
Contents
COPY コマンド以外で CSV を扱う PostgreSQL の「道具立て」
昨日はまずインポート先テーブルを作ってから必要な手順と、PostgreSQL
- ⅰ)ローカルにある
CSV ファイルを読み込む → 汎用ファイルアクセス関数を使う - ⅱ)素直な
CSV は単純にカンマで分ける → 配列型へのキャストを使う - ⅲ)素直でない
CSV は何とか正しく区切る → 正規表現関数を使う - ⅳ)配列にした各行から、必要なフィールド(配列の要素)を抜き出す
以下、各作業について書きます(ⅲの途中からは明日以降に)。
(ⅰ)ローカルにある CSV ファイルを pg_read_file 関数で読み込む
PostgreSQL今回の
\pset tuples_only on Tuples only is on. \pset format unaligned Output format is unaligned. select pg_read_file('alameda-schools.csv'); school_name,county,street,city,state,zip_code,charter,magnet,title1school,title1schoolwide,students,teachers,studentteach_ratio,Geocode Score,Geocode Precision,latitude,longitude ACHIEVE ACADEMY,ALAMEDA,"303 HEGENBERGER RD., STE. 301",OAKLAND,CA,94621,Yes,,Yes,Yes,221,10.0,22.1,0.73,range,37.733459,-122.198363 ACORN WOODLAND ELEMENTARY,ALAMEDA,1025 81ST AVE.,OAKLAND,CA,94621,No,,Yes,Yes,243,13.9,17.5,0.916,range,37.753053,-122.185701 ALAMEDA COMMUNITY LEARNING CENTER,ALAMEDA,"210 CENTRAL AVE., RM. 603",ALAMEDA,CA,94501,Yes,,No,,249,10.2,24.4,0.737,range,37.771445,-122.279426 ALAMEDA COUNTY COMMUNITY,ALAMEDA,313 WEST WINTON AVE.,HAYWARD,CA,94544,No,,Yes,No,200,12.0,16.7,1.0,range,37.658168,-122.097027 (...)

psql
ただし
テンプレは ↓ こんな感じ。改行をデリミタとして
select row_number() over() as rowid, line from pg_read_file('データフォルダ下のCSVファイルのパス') as str, unnest(string_to_array(str, E'\n')) as line;
↓ 実行例。最初に
\pset tuples_only off Tuples only is off. \pset format aligned Output format is aligned. \pset border 2 Border style is 2. select row_number() over() - 1 as rowid, line from pg_read_file('alameda-schools.csv') as str, unnest(string_to_array(str, E'\n')) as line; +-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | rowid | line | +-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 0 | school_name,county,street,city,state,zip_code,charter,magnet,title1school,title1schoolwide,students,teachers,studentteach_ratio,Geocode Score,Geocode Precision,latitude,longitude | | 1 | ACHIEVE ACADEMY,ALAMEDA,"303 HEGENBERGER RD., STE. 301",OAKLAND,CA,94621,Yes,,Yes,Yes,221,10.0,22.1,0.73,range,37.733459,-122.198363 | | 2 | ACORN WOODLAND ELEMENTARY,ALAMEDA,1025 81ST AVE.,OAKLAND,CA,94621,No,,Yes,Yes,243,13.9,17.5,0.916,range,37.753053,-122.185701 | | 3 | ALAMEDA COMMUNITY LEARNING CENTER,ALAMEDA,"210 CENTRAL AVE., RM. 603",ALAMEDA,CA,94501,Yes,,No,,249,10.2,24.4,0.737,range,37.771445,-122.279426 | | 4 | ALAMEDA COUNTY COMMUNITY,ALAMEDA,313 WEST WINTON AVE.,HAYWARD,CA,94544,No,,Yes,No,200,12.0,16.7,1.0,range,37.658168,-122.097027 | | 5 | ALAMEDA COUNTY JUVENILE HALL/COURT,ALAMEDA,2500 FAIRMONT AVE.,SAN LEANDRO,CA,94578,No,,Yes,No,249,25.5,9.8,0.92,range,37.715126,-122.1165 | (...)

(ⅱ)1 行の「素直な」CSV は、そのまま PostgreSQL の配列型にできる
前項でselect '{ 素直なCSVの1行 }' :: text[];
素直な
- カンマのあるデータは二重引用符で囲まれている、またはカンマが
\, にエスケープされている - 空フィールドは空文字
"" か NULL のどちらか。NULL は小文字でも OK
この要件はかなり幅広で、例えば「空白や改行を含むフィールドがあり、二重引用符で囲まれてない」場合でもカンマ区切りで配列になります。ただ泣き所は「カンマの連続で空フィールドを示す」CSV
今回の
select unnest(('{' || lines[1] || '}') :: text[]) as colname from pg_read_file('alameda-schools.csv') as str, string_to_array(str, E'\n') as lines; +--------------------+ | colnames | +--------------------+ | school_name | | county | | street | | city | | state | | zip_code | | charter | | magnet | | title1school | | title1schoolwide | | students | | teachers | | studentteach_ratio | | Geocode Score | | Geocode Precision | | latitude | | longitude | +--------------------+ (17 rows)

(ⅲ)そのままで配列型にできない CSV を、正規表現で置換する
今回のselect line from pg_read_file('alameda-schools.csv') as str, string_to_array(str, E'\n') as lines, unnest(lines) as line where line like '%,,,,%' limit 1; +-----------------------------------------------------------------------------------------------------------------------+ | line | +-----------------------------------------------------------------------------------------------------------------------+ | COX ACADEMY,ALAMEDA,"303 HEGENBERGER RD., STE 301",OAKLAND,CA,94621,Yes,,,,0,27.75,,0.604,street,37.736534,-122.19504 | +-----------------------------------------------------------------------------------------------------------------------+ (1 row)

普通の置換で「全ての
\pset expanded on Expanded display is on. select line, replace(line, ',,', ',""'), regexp_replace(line, ',,', ',""', 'g') from pg_read_file('alameda-schools.csv') as str, string_to_array(str, E'\n') as lines, unnest(lines) as line where line like '%,,,,%' limit 1; +-[ RECORD 1 ]---+--------------------------------------------------------------------------------------------------------------------------+ | line | COX ACADEMY,ALAMEDA,"303 HEGENBERGER RD., STE 301",OAKLAND,CA,94621,Yes,,,,0,27.75,,0.604,street,37.736534,-122.19504 | | replace | COX ACADEMY,ALAMEDA,"303 HEGENBERGER RD., STE 301",OAKLAND,CA,94621,Yes,"",""0,27.75,""0.604,street,37.736534,-122.19504 | | regexp_replace | COX ACADEMY,ALAMEDA,"303 HEGENBERGER RD., STE 301",OAKLAND,CA,94621,Yes,"",""0,27.75,""0.604,street,37.736534,-122.19504 | +----------------+--------------------------------------------------------------------------------------------------------------------------+

そこで先日書いた正規表現の肯定先読みを使えば ↓ 全ての空フィールドを空文字に変換できました。
select line, regexp_replace(line, ',(?=,)', ',""', 'g') from pg_read_file('alameda-schools.csv') as str, string_to_array(str, E'\n') as lines, unnest(lines) as line where line like '%,,,,%' limit 1 ; +-[ RECORD 1 ]---+-------------------------------------------------------------------------------------------------------------------------------+ | line | COX ACADEMY,ALAMEDA,"303 HEGENBERGER RD., STE 301",OAKLAND,CA,94621,Yes,,,,0,27.75,,0.604,street,37.736534,-122.19504 | | regexp_replace | COX ACADEMY,ALAMEDA,"303 HEGENBERGER RD., STE 301",OAKLAND,CA,94621,Yes,"","","",0,27.75,"",0.604,street,37.736534,-122.19504 | +----------------+-------------------------------------------------------------------------------------------------------------------------------+

この正規表現で
select line, concat('{', regexp_replace(line, ',(?=,)', ',""', 'g'), '}') :: text[] from pg_read_file('alameda-schools.csv') as str, string_to_array(str, E'\n') as lines, unnest(lines) as line ; +-[ RECORD 1 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | line | school_name,county,street,city,state,zip_code,charter,magnet,title1school,title1schoolwide,students,teachers,studentteach_ratio,Geocode Score,Geocode Precision,latitude,longitude | | concat | {school_name,county,street,city,state,zip_code,charter,magnet,title1school,title1schoolwide,students,teachers,studentteach_ratio,"Geocode Score","Geocode Precision",latitude,longitude} | +-[ RECORD 2 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | line | ACHIEVE ACADEMY,ALAMEDA,"303 HEGENBERGER RD., STE. 301",OAKLAND,CA,94621,Yes,,Yes,Yes,221,10.0,22.1,0.73,range,37.733459,-122.198363 | | concat | {"ACHIEVE ACADEMY",ALAMEDA,"303 HEGENBERGER RD., STE. 301",OAKLAND,CA,94621,Yes,"",Yes,Yes,221,10.0,22.1,0.73,range,37.733459,-122.198363} | +-[ RECORD 3 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | line | ACORN WOODLAND ELEMENTARY,ALAMEDA,1025 81ST AVE.,OAKLAND,CA,94621,No,,Yes,Yes,243,13.9,17.5,0.916,range,37.753053,-122.185701 | | concat | {"ACORN WOODLAND ELEMENTARY",ALAMEDA,"1025 81ST AVE.",OAKLAND,CA,94621,No,"",Yes,Yes,243,13.9,17.5,0.916,range,37.753053,-122.185701} | +-[ RECORD 4 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | line | ALAMEDA COMMUNITY LEARNING CENTER,ALAMEDA,"210 CENTRAL AVE., RM. 603",ALAMEDA,CA,94501,Yes,,No,,249,10.2,24.4,0.737,range,37.771445,-122.279426 | | concat | {"ALAMEDA COMMUNITY LEARNING CENTER",ALAMEDA,"210 CENTRAL AVE., RM. 603",ALAMEDA,CA,94501,Yes,"",No,"",249,10.2,24.4,0.737,range,37.771445,-122.279426} | (...)

後々のため、もう少し汎用的な処理を考えると
- 1.
データに含まれない適当な「一時的な字」を設定 - 2.
二重引用符で囲まれたフィールド内にカンマがあれば、上の一時的な字に置換 - 3.
全ての連続するカンマを「空文字または NULL のフィールド」に置換 - 4.
一時的な字を元のカンマに戻す
て感じでしょうか。あー面倒!