2024-10-19 00:57:25 +02:00
|
|
|
---
|
|
|
|
tags:
|
|
|
|
- awk
|
|
|
|
- csv
|
|
|
|
---
|
2023-07-05 11:33:45 +02:00
|
|
|
|
2024-10-19 00:57:25 +02:00
|
|
|
# Using `awk` to deal with CSV that uses quoted/unquoted delimiters
|
2023-07-05 11:33:45 +02:00
|
|
|
|
|
|
|
CSV files are a mess, yes.
|
|
|
|
|
|
|
|
Assume you have CSV files that use the comma as delimiter and quoted
|
|
|
|
data fields that can contain the delimiter.
|
|
|
|
|
|
|
|
"first", "second", "last"
|
|
|
|
"fir,st", "second", "last"
|
|
|
|
"firtst one", "sec,ond field", "final,ly"
|
|
|
|
|
2024-03-30 20:09:26 +01:00
|
|
|
Simply using the comma as separator for `awk` won't work here, of
|
2023-07-05 11:33:45 +02:00
|
|
|
course.
|
|
|
|
|
|
|
|
Solution: Use the field separator `", "|^"|"$` for `awk`.
|
|
|
|
|
|
|
|
This is an OR-ed list of 3 possible separators:
|
|
|
|
|
2024-10-19 00:57:25 +02:00
|
|
|
| | |
|
|
|
|
|--------|----------------------------------------------|
|
|
|
|
|`", "` | matches the area between the datafields|
|
|
|
|
|`^"` | matches the area left of the first datafield|
|
|
|
|
|`"$` | matches the area right of the last data field|
|
2023-07-05 11:33:45 +02:00
|
|
|
|
|
|
|
You can tune these delimiters if you have other needs (for example if
|
2024-03-30 20:09:26 +01:00
|
|
|
you don't have a space after the commas).
|
2023-07-05 11:33:45 +02:00
|
|
|
|
|
|
|
Test:
|
|
|
|
|
|
|
|
The `awk` command used for the CSV above just prints the fileds
|
2024-03-30 19:22:45 +01:00
|
|
|
separated by `###` to see what's going on:
|
2023-07-05 11:33:45 +02:00
|
|
|
|
|
|
|
$ awk -v FS='", "|^"|"$' '{print $2"###"$3"###"$4}' data.csv
|
|
|
|
first###second###last
|
|
|
|
fir,st###second###last
|
|
|
|
firtst one###sec,ond field###final,ly
|
|
|
|
|
|
|
|
**ATTENTION** If the CSV data changes its format every now and then (for
|
|
|
|
example it only quotes the data fields if needed, not always), then this
|
|
|
|
way will not work.
|