
Contents
実行環境 どんな CSV か(配布元と中身) (1)普通にインポート先テーブルを定義し COPY コマンドを使う (2)CSV のヘッダ行から直にテーブル定義クエリを作る (3)とりあえずテキスト型で入れた列の型を変換する (4)COPY コマンドは「賢い」けどアドホックには少し使いづらい(以上、1月18日) COPY コマンド以外で CSV を扱う PostgreSQL の「道具立て」 (ⅰ)ローカルにある CSV ファイルを pg_read_file 関数で読み込む (ⅱ)1 行の「素直な」CSV は、そのまま PostgreSQL の配列型にできる (ⅲ)そのままで配列型にできない CSV を、正規表現で置換する(以上、1月20日の記事) - (ⅲ)続き
- (ⅳ)配列にした
CSV の各行から、必要なフィールド(配列の要素)を抜き出す - (ⅴ)まとめ・補足
(ⅲ)続き
前回書いたとおりCSV- 1.
データに含まれない適当な「一時的な字句」を設定 - 2.
二重引用符で囲まれたフィールド内にカンマがあれば、上の一時的な字句に置換(正規表現関数を使う) - 3.
全ての連続するカンマを「空文字または NULL のフィールド」に置換( 〃 ) - 4.
一時的な字句を元のカンマに戻す - …… 結果の文字列を
{ } に入れて配列型にキャスト
正規表現では
↓ まず
with a (fpath, dummy) as ( values ('alameda-schools.csv', '$$comma$$') ) select line, case when strpos(line, dummy) > 0 then '' else regexp_replace(line, ',(?=[^"]*(",|"$))', dummy, 'g') end from a, string_to_array(pg_read_file(fpath), E'\n') as lines, unnest(lines) as line;
↓ 今回の
\pset expanded on Expanded display is on. (上のクエリ) -[ RECORD 1 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- line | school_name,county,street,city,state,zip_code,charter,magnet,title1school,title1schoolwide,students,teachers,studentteach_ratio,Geocode Score,Geocode Precision,latitude,longitude regexp_replace | school_name,county,street,city,state,zip_code,charter,magnet,title1school,title1schoolwide,students,teachers,studentteach_ratio,Geocode Score,Geocode Precision,latitude,longitude -[ RECORD 2 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- line | ACHIEVE ACADEMY,ALAMEDA,"303 HEGENBERGER RD., STE. 301",OAKLAND,CA,94621,Yes,,Yes,Yes,221,10.0,22.1,0.73,range,37.733459,-122.198363 regexp_replace | ACHIEVE ACADEMY,ALAMEDA,"303 HEGENBERGER RD.$$comma$$STE. 301",OAKLAND,CA,94621,Yes,,Yes,Yes,221,10.0,22.1,0.73,range,37.733459,-122.198363 -[ RECORD 3 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- line | ACORN WOODLAND ELEMENTARY,ALAMEDA,1025 81ST AVE.,OAKLAND,CA,94621,No,,Yes,Yes,243,13.9,17.5,0.916,range,37.753053,-122.185701 regexp_replace | ACORN WOODLAND ELEMENTARY,ALAMEDA,1025 81ST AVE.,OAKLAND,CA,94621,No,,Yes,Yes,243,13.9,17.5,0.916,range,37.753053,-122.185701 -[ RECORD 4 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- line | ALAMEDA COMMUNITY LEARNING CENTER,ALAMEDA,"210 CENTRAL AVE., RM. 603",ALAMEDA,CA,94501,Yes,,No,,249,10.2,24.4,0.737,range,37.771445,-122.279426 regexp_replace | ALAMEDA COMMUNITY LEARNING CENTER,ALAMEDA,"210 CENTRAL AVE.$$comma$$RM. 603",ALAMEDA,CA,94501,Yes,,No,,249,10.2,24.4,0.737,range,37.771445,-122.279426 (...)

上の結果に残りの手順を追加すると ↓ こんな感じ。「連続カンマ」で空フィールドを表している箇所を
with a (fpath, dummy) as ( values ('alameda-schools.csv', '$$comma$$') ), b (line, tmp) as ( select line, case when strpos(line, dummy) > 0 then '' else regexp_replace(line, ',(?=[^"]*(",|"$))', dummy, 'g') end from a, string_to_array(pg_read_file(fpath), E'\n') as lines, unnest(lines) as line ) select b.*, ('{' || replace( regexp_replace(tmp, ',(?=,)', ',NULL', 'g'), dummy, ',') || '}') :: text [] as array from a, b;
↓ 実行結果。先ほどと同様
\pset expanded on Expanded display is on. (上のクエリ) -[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- line | school_name,county,street,city,state,zip_code,charter,magnet,title1school,title1schoolwide,students,teachers,studentteach_ratio,Geocode Score,Geocode Precision,latitude,longitude tmp | school_name,county,street,city,state,zip_code,charter,magnet,title1school,title1schoolwide,students,teachers,studentteach_ratio,Geocode Score,Geocode Precision,latitude,longitude array | {school_name,county,street,city,state,zip_code,charter,magnet,title1school,title1schoolwide,students,teachers,studentteach_ratio,"Geocode Score","Geocode Precision",latitude,longitude} -[ RECORD 2 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- line | ACHIEVE ACADEMY,ALAMEDA,"303 HEGENBERGER RD., STE. 301",OAKLAND,CA,94621,Yes,,Yes,Yes,221,10.0,22.1,0.73,range,37.733459,-122.198363 tmp | ACHIEVE ACADEMY,ALAMEDA,"303 HEGENBERGER RD.$$comma$$ STE. 301",OAKLAND,CA,94621,Yes,,Yes,Yes,221,10.0,22.1,0.73,range,37.733459,-122.198363 array | {"ACHIEVE ACADEMY",ALAMEDA,"303 HEGENBERGER RD., STE. 301",OAKLAND,CA,94621,Yes,NULL,Yes,Yes,221,10.0,22.1,0.73,range,37.733459,-122.198363} -[ RECORD 3 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- line | ACORN WOODLAND ELEMENTARY,ALAMEDA,1025 81ST AVE.,OAKLAND,CA,94621,No,,Yes,Yes,243,13.9,17.5,0.916,range,37.753053,-122.185701 tmp | ACORN WOODLAND ELEMENTARY,ALAMEDA,1025 81ST AVE.,OAKLAND,CA,94621,No,,Yes,Yes,243,13.9,17.5,0.916,range,37.753053,-122.185701 array | {"ACORN WOODLAND ELEMENTARY",ALAMEDA,"1025 81ST AVE.",OAKLAND,CA,94621,No,NULL,Yes,Yes,243,13.9,17.5,0.916,range,37.753053,-122.185701} -[ RECORD 4 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- line | ALAMEDA COMMUNITY LEARNING CENTER,ALAMEDA,"210 CENTRAL AVE., RM. 603",ALAMEDA,CA,94501,Yes,,No,,249,10.2,24.4,0.737,range,37.771445,-122.279426 tmp | ALAMEDA COMMUNITY LEARNING CENTER,ALAMEDA,"210 CENTRAL AVE.$$comma$$ RM. 603",ALAMEDA,CA,94501,Yes,,No,,249,10.2,24.4,0.737,range,37.771445,-122.279426 array | {"ALAMEDA COMMUNITY LEARNING CENTER",ALAMEDA,"210 CENTRAL AVE., RM. 603",ALAMEDA,CA,94501,Yes,NULL,No,NULL,249,10.2,24.4,0.737,range,37.771445,-122.279426} (...)

もしカンマを適当に変換した字句(今回では
別の確認として ↓ 各行の配列の要素数を調べるクエリ。DISTINCT
with a (fpath, dummy) as ( values ('alameda-schools.csv', '$$comma$$') ), b (line, tmp) as ( select line, case when strpos(line, dummy) > 0 then '' else regexp_replace(line, ',(?=[^"]*(",|"$))', dummy, 'g') end from a, string_to_array(pg_read_file(fpath), E'\n') as lines, unnest(lines) as line ) select distinct array_length(('{' || replace( regexp_replace(tmp, ',(?=,)', ',NULL', 'g'), dummy, ',') || '}') :: text [], 1) as ncol_check from a, b ; +------------+ | ncol_check | +------------+ | 17 | +------------+ (1 row)

(ⅳ)配列にした CSV の各行から、必要なフィールド(配列の要素)を抜き出す
CSVwith a (colnames) as ( select ('{' || lines[1] || '}') :: text[] from pg_read_file('alameda-schools.csv') as str, string_to_array(str, E'\n') as lines ) select id, colname from a, unnest(colnames) with ordinality as foo(colname, id) ; +----+--------------------+ | id | colname | +----+--------------------+ | 1 | school_name | | 2 | county | | 3 | street | | 4 | city | | 5 | state | | 6 | zip_code | | 7 | charter | | 8 | magnet | | 9 | title1school | | 10 | title1schoolwide | | 11 | students | | 12 | teachers | | 13 | studentteach_ratio | | 14 | Geocode Score | | 15 | Geocode Precision | | 16 | latitude | | 17 | longitude | +----+--------------------+ (17 rows)

今回の
with a (fpath, dummy) as ( values ('alameda-schools.csv', '$$comma$$') ), b (line, tmp) as ( select line, case when strpos(line, dummy) > 0 then '' else regexp_replace(line, ',(?=[^"]*(",|"$))', dummy, 'g') end from a, string_to_array(pg_read_file(fpath), E'\n') as lines, unnest(lines) as line ), c (ary) as ( select ('{' || replace( regexp_replace(tmp, ',(?=,)', ',NULL', 'g'), dummy, ',') || '}') :: text [] from a, b offset 1 -- 先頭のヘッダ行を除く ) select ary[16] :: float as lat, ary[17] :: float as lon from c ; +-----------+-------------+ | lat | lon | +-----------+-------------+ | 37.733459 | -122.198363 | | 37.753053 | -122.185701 | | 37.771445 | -122.279426 | | 37.658168 | -122.097027 | | 37.715126 | -122.1165 | | 37.658168 | -122.097027 | | 37.658168 | -122.097027 | | 37.764236 | -122.248243 | (...)

CSV
(ⅴ)まとめ・補足
今回の「素直な」とは、CSVwith a (fpath) as ( values ('CSVのファイルパス') ) select ('{' || line || '}') :: text [] as array from a, string_to_array(pg_read_file(fpath), E'\n') as lines, unnest(lines) as line;
これならアドホックにも使え、COPY
with a (fpath, dummy) as ( values ('CSVのファイルパス', 'カンマの一時置換用の字句') ), b (line, tmp) as ( select line, case when strpos(line, dummy) > 0 then '' else regexp_replace(line, ',(?=[^"]*(",|"$))', dummy, 'g') end from a, string_to_array(pg_read_file(fpath), E'\n') as lines, unnest(lines) as line ) select b.*, ('{' || replace( regexp_replace(tmp, ',(?=,)', ',NULL', 'g'), dummy, ',') || '}') :: text [] as array from a, b;
結局、CSV
(1)今回紹介したクエリは
(2)連続するカンマを処理する今回の正規表現では、カンマの前後に空白を含む場合を想定してません。そこまでひどい
(3)もし対象とする
-- 空フィールド -> 空文字 select string_to_array('1,2,,,5,6,7', ','); +-------------------+ | string_to_array | +-------------------+ | {1,2,"","",5,6,7} | +-------------------+ (1 row) -- 空フィールド -> NULL select string_to_array('1,2,,,5,6,7', ',', ''); +-----------------------+ | string_to_array | +-----------------------+ | {1,2,NULL,NULL,5,6,7} | +-----------------------+ (1 row) -- 区切りを正規表現で指定 -- 空フィールド -> 空文字のみ. NULLにはできない select regexp_split_to_array('1, 2, , , 5, 6, 7', '\s*,\s*'); +-----------------------+ | regexp_split_to_array | +-----------------------+ | {1,2,"","",5,6,7} | +-----------------------+ (1 row)
