readxl
package helps reading .xlsx
, but…
Prefer CSV (Comma-separated values) and TSV (Tab-).
Use readr
package instead of base R functions.
Specify a file with the relative path from working directory.
readr::write_tsv(iris, "data/iris.tsv")
iris2 = readr::read_tsv("data/iris.tsv")
Oops, an error occurred:
Error: Cannot open file for writing:
* 'data/iris.tsv'
Check your current directory and its content:
getwd() # GET Working Directory
fs::dir_ls(".") # List files in "." (here)
fs::dir_ls("data") # List files in "./data"
fs::dir_create("data") # Create directory
🔰 Write some data.frames to data/
directory.
🔰 Read them and create objects with different names.
Read → Prepare → Visualize → a piece of cake … hopefully?
2020年 → 小地域 → 年齢(5歳階級、4区分)別、男女別人口 → 宮城県
Population pyramids as follows can be drawn from this CSV file, but…
First trial, an error:
infile = "tblT001082C04.txt"
readr::read_csv(infile)
Error in nchar(x, keepNA = FALSE): invalid multibyte string, element 2
View this file with RStudio as a plain text. Decoding fails:
KEY_CODE,HYOSYO,CITYNAME,NAME,HTKSYORI,HTKSAKI,GASSAN,T001082001,T001082002,T001082003,T001082004,T001082005,T001082006,T001082007,T001082008,T001082009,T001082010,T001082011,T001082012,T001082013,T001082014,T001082015,T001082016,T001082017,T001082018,T001082019,T001082020,T001082021,T001082022,T001082023,T001082024,T001082025,T001082026,T001082027,T001082028,T001082029,T001082030,T001082031,T001082032,T001082033,T001082034,T001082035,T001082036,T001082037,T001082038,T001082039,T001082040,T001082041,T001082042,T001082043,T001082044,T001082045,T001082046,T001082047,T001082048,T001082049,T001082050,T001082051,T001082052,T001082053,T001082054,T001082055,T001082056,T001082057,T001082058,T001082059,T001082060
,,,,,,,\x91\x8d\x90\x94\x81A\x94N\x97\xee\x81u\x95s\x8fځv\x8a܂\xde,\x91\x8d\x90\x94\x82O\x81`\x82S\x8d\xce,\x91\x8d\x90\x94\x82T\x81`\x82X\x8d\xce,\x91\x8d\x90\x94\x82P\x82O\x81`\x82P\x82S\x8d\xce,\x91\x8d\x90\x94\x82P\x82T\x81`\x82P\x82X\x8d\xce,\x91\x8d\x90\x94\x82Q\x82O\x81`\x82Q\x82S\x8d\xce,\x91\x8d\x90\x94\x82Q\x82T\x81`\x82Q\x82X\x8d\xce,\x91\x8d\x90\x94\x82R\x82O\x81`\x82R\x82S\x8d\xce,\x91\x8d\x90\x94\x82R\x82T\x81`\x82R\x82X\x8d\xce,\x91\x8d\x90\x94\x82S\x82O\x81`\x82S\x82S\x8d\xce,\x91\x8d\x90\x94\x82S\x82T\x81`\x82S\x82X\x8d\xce,\x91\x8d\x90\x94\x82T\x82O\x81`\x82T\x82S\x8d\xce,\x91\x8d\x90\x94\x82T\x82T\x81`\x82T\x82X\x8d\xce,\x91\x8d\x90\x94\x82U\x82O\x81`\x82U\x82S\x8d\xce,\x91\x8d\x90\x94\x82U\x82T\x81`\x82U\x82X\x8d\xce,\x91\x8d\x90\x94\x82V\x82O\x81`\x82V\x82S\x8d\xce,\x91\x8d\x90\x94\x82P\x82T\x8dΖ\xa2\x96\x9e,\x91\x8d\x90\x94\x82P\x82T\x81`\x82U\x82S\x8d\xce,\x91\x8d\x90\x94\x82U\x82T\x8dΈȏ\xe3,\x91\x8d\x90\x94\x82V\x82T\x8dΈȏ\xe3,\x92j\x82̑\x8d\x90\x94\x81A\x94N\x97\xee\x81u\x95s\x8fځv\x8a܂\xde,\x92j\x82O\x81`\x82S\x8d\xce,\x92j\x82T\x81`\x82X\x8d\xce,\x92j\x82P\x82O\x81`\x82P\x82S\x8d\xce,\x92j\x82P\x82T\x81`\x82P\x82X\x8d\xce,\x92j\x82Q\x82O\x81`\x82Q\x82S\x8d\xce,\x92j\x82Q\x82T\x81`\x82Q\x82X\x8d\xce,\x92j\x82R\x82O\x81`\x82R\x82S\x8d\xce,\x92j\x82R\x82T\x81`\x82R\x82X\x8d\xce,\x92j\x82S\x82O\x81`\x82S\x82S\x8d\xce,\x92j\x82S\x82T\x81`\x82S\x82X\x8d\xce,\x92j\x82T\x82O\x81`\x82T\x82S\x8d\xce,\x92j\x82T\x82T\x81`\x82T\x82X\x8d\xce,\x92j\x82U\x82O\x81`\x82U\x82S\x8d\xce,\x92j\x82U\x82T\x81`\x82U\x82X\x8d\xce,\x92j\x82V\x82O\x81`\x82V\x82S\x8d\xce,\x92j\x82P\x82T\x8dΖ\xa2\x96\x9e,\x92j\x82P\x82T\x81`\x82U\x82S\x8d\xce,\x92j\x82U\x82T\x8dΈȏ\xe3,\x92j\x82V\x82T\x8dΈȏ\xe3,\x8f\x97\x82̑\x8d\x90\x94\x81A\x94N\x97\xee\x81u\x95s\x8fځv\x8a܂\xde,\x8f\x97\x82O\x81`\x82S\x8d\xce,\x8f\x97\x82T\x81`\x82X\x8d\xce,\x8f\x97\x82P\x82O\x81`\x82P\x82S\x8d\xce,\x8f\x97\x82P\x82T\x81`\x82P\x82X\x8d\xce,\x8f\x97\x82Q\x82O\x81`\x82Q\x82S\x8d\xce,\x8f\x97\x82Q\x82T\x81`\x82Q\x82X\x8d\xce,\x8f\x97\x82R\x82O\x81`\x82R\x82S\x8d\xce,\x8f\x97\x82R\x82T\x81`\x82R\x82X\x8d\xce,\x8f\x97\x82S\x82O\x81`\x82S\x82S\x8d\xce,\x8f\x97\x82S\x82T\x81`\x82S\x82X\x8d\xce,\x8f\x97\x82T\x82O\x81`\x82T\x82S\x8d\xce,\x8f\x97\x82T\x82T\x81`\x82T\x82X\x8d\xce,\x8f\x97\x82U\x82O\x81`\x82U\x82S\x8d\xce,\x8f\x97\x82U\x82T\x81`\x82U\x82X\x8d\xce,\x8f\x97\x82V\x82O\x81`\x82V\x82S\x8d\xce,\x8f\x97\x82P\x82T\x8dΖ\xa2\x96\x9e,\x8f\x97\x82P\x82T\x81`\x82U\x82S\x8d\xce,\x8f\x97\x82U\x82T\x8dΈȏ\xe3,\x8f\x97\x82V\x82T\x8dΈȏ\xe3
04101,1,\x90\xe5\x91\xe4\x8es\x90\u0097t\x8b\xe6,,0,,,311590,10231,11633,11838,15944,23772,17838,17677,19028,21113,23299,20544,18295,16702,17137,17700,33702,194212,69969,35132,150932,5350,6087,6066,8321,12579,8669,8600,9241,10232,11412,10317,9031,7893,8010,8082,17503,96295,29477,13385,160658,4881,5546,5772,7623,11193,9169,9077,9787,10881,11887,10227,9264,8809,9127,9618,16199,97917,40492,21747
041010010,2,\x90\xe5\x91\xe4\x8es\x90\u0097t\x8b\xe6,\x90\u0097t\x92\xac,0,,,649,16,15,17,23,53,62,49,40,40,40,45,33,28,38,38,48,413,143,67,307,8,10,7,10,26,30,27,21,18,14,25,16,11,15,16,25,198,60,29,342,8,5,10,13,27,32,22,19,22,26,20,17,17,23,22,23,215,83,38
041010020,2,\x90\xe5\x91\xe4\x8es\x90\u0097t\x8b\xe6,\x82\xa0\x82\xaf\x82ڂ̒\xac,0,,,741,23,18,13,26,32,55,48,42,60,51,48,47,38,55,43,54,447,209,111,365,12,10,4,15,16,22,28,19,39,27,25,16,23,24,21,26,230,89,44,376,11,8,9,11,16,33,20,23,21,24,23,31,15,31,22,28,217,120,67
Select “File → Reopen with Encoding…”.
Modern, decent text files should be encoded in UTF-8.
Old Japanese text tend to be encoded in SHIFT-JIS (or EUC-JP).
Next problem: the second row also has column names:
sjis = readr::locale(encoding = "SHIFT-JIS")
readr::read_csv(infile, locale = sjis)
KEY_CODE HYOSYO CITYNAME NAME HTKSYORI HTKSAKI GASSAN T001082001 T001082002 T001082003 T001082004 T001082005 T001082006 T001082007 T001082008 T001082009 T001082010 T001082011 T001082012 T001082013 T001082014 T001082015 T001082016 T001082017 T001082018 T001082019 T001082020 T001082021 T001082022 T001082023 T001082024 T001082025 T001082026 T001082027 T001082028 T001082029 T001082030 T001082031 T001082032 T001082033 T001082034 T001082035 T001082036 T001082037 T001082038 T001082039 T001082040 T001082041 T001082042 T001082043 T001082044 T001082045 T001082046 T001082047 T001082048 T001082049 T001082050 T001082051 T001082052 T001082053 T001082054 T001082055 T001082056 T001082057 T001082058 T001082059 T001082060
1 <NA> NA <NA> <NA> NA <NA> <NA> 総数、年齢「不詳」含む 総数0〜4歳 総数5〜9歳 総数10〜14歳 総数15〜19歳 総数20〜24歳 総数25〜29歳 総数30〜34歳 総数35〜39歳 総数40〜44歳 総数45〜49歳 総数50〜54歳 総数55〜59歳 総数60〜64歳 総数65〜69歳 総数70〜74歳 総数15歳未満 総数15〜64歳 総数65歳以上 総数75歳以上 男の総数、年齢「不詳」含む 男0〜4歳 男5〜9歳 男10〜14歳 男15〜19歳 男20〜24歳 男25〜29歳 男30〜34歳 男35〜39歳 男40〜44歳 男45〜49歳 男50〜54歳 男55〜59歳 男60〜64歳 男65〜69歳 男70〜74歳 男15歳未満 男15〜64歳 男65歳以上 男75歳以上 女の総数、年齢「不詳」含む 女0〜4歳 女5〜9歳 女10〜14歳 女15〜19歳 女20〜24歳 女25〜29歳 女30〜34歳 女35〜39歳 女40〜44歳 女45〜49歳 女50〜54歳 女55〜59歳 女60〜64歳 女65〜69歳 女70〜74歳 女15歳未満 女15〜64歳 女65歳以上 女75歳以上
2 04101 1 仙台市青葉区 <NA> 0 <NA> <NA> 311590 10231 11633 11838 15944 23772 17838 17677 19028 21113 23299 20544 18295 16702 17137 17700 33702 194212 69969 35132 150932 5350 6087 6066 8321 12579 8669 8600 9241 10232 11412 10317 9031 7893 8010 8082 17503 96295 29477 13385 160658 4881 5546 5772 7623 11193 9169 9077 9787 10881 11887 10227 9264 8809 9127 9618 16199 97917 40492 21747
3 041010010 2 仙台市青葉区 青葉町 0 <NA> <NA> 649 16 15 17 23 53 62 49 40 40 40 45 33 28 38 38 48 413 143 67 307 8 10 7 10 26 30 27 21 18 14 25 16 11 15 16 25 198 60 29 342 8 5 10 13 27 32 22 19 22 26 20 17 17 23 22 23 215 83 38
4 041010020 2 仙台市青葉区 あけぼの町 0 <NA> <NA> 741 23 18 13 26 32 55 48 42 60 51 48 47 38 55 43 54 447 209 111 365 12 10 4 15 16 22 28 19 39 27 25 16 23 24 21 26 230 89 44 376 11 8 9 11 16 33 20 23 21 24 23 31 15 31 22 28 217 120 67
--
5941 04606004015 4 南三陸町 歌津字石浜 0 <NA> <NA> 295 9 7 11 6 6 12 6 18 16 19 21 36 18 30 26 27 158 110 54 146 5 5 5 3 5 8 2 6 8 10 14 16 8 11 20 15 80 51 20 149 4 2 6 3 1 4 4 12 8 9 7 20 10 19 6 12 78 59 34
5942 04606004016 4 南三陸町 歌津字田の浦 0 <NA> <NA> 144 5 2 5 3 5 7 5 7 6 5 15 12 17 14 4 12 82 50 32 66 - 1 3 1 3 4 3 4 3 2 6 5 9 9 4 4 40 22 9 78 5 1 2 2 2 3 2 3 3 3 9 7 8 5 - 8 42 28 23
5943 04606004017 4 南三陸町 歌津字草木沢 0 <NA> <NA> 457 21 16 18 16 12 13 20 18 18 41 34 28 37 46 43 55 237 165 76 234 11 6 8 10 11 5 9 10 11 22 18 15 17 19 29 25 128 81 33 223 10 10 10 6 1 8 11 8 7 19 16 13 20 27 14 30 109 84 43
5944 04606004018 4 南三陸町 歌津字伊里前 0 <NA> <NA> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Numeric columns have non-numeric characters like -
and X
:
dfL = readr::read_csv(infile, locale = sjis, col_select = seq(1, 7)) |>
dplyr::slice(-1)
dfR = readr::read_csv(infile, locale = sjis, col_select = -seq(1, 7),
skip = 1L)
raw_miyagi = dplyr::bind_cols(dfL, dfR) |> print()
KEY_CODE HYOSYO CITYNAME NAME HTKSYORI HTKSAKI GASSAN 総数、年齢「不詳」含む 総数0〜4歳 総数5〜9歳 総数10〜14歳 総数15〜19歳 総数20〜24歳 総数25〜29歳 総数30〜34歳 総数35〜39歳 総数40〜44歳 総数45〜49歳 総数50〜54歳 総数55〜59歳 総数60〜64歳 総数65〜69歳 総数70〜74歳 総数15歳未満 総数15〜64歳 総数65歳以上 総数75歳以上 男の総数、年齢「不詳」含む 男0〜4歳 男5〜9歳 男10〜14歳 男15〜19歳 男20〜24歳 男25〜29歳 男30〜34歳 男35〜39歳 男40〜44歳 男45〜49歳 男50〜54歳 男55〜59歳 男60〜64歳 男65〜69歳 男70〜74歳 男15歳未満 男15〜64歳 男65歳以上 男75歳以上 女の総数、年齢「不詳」含む 女0〜4歳 女5〜9歳 女10〜14歳 女15〜19歳 女20〜24歳 女25〜29歳 女30〜34歳 女35〜39歳 女40〜44歳 女45〜49歳 女50〜54歳 女55〜59歳 女60〜64歳 女65〜69歳 女70〜74歳 女15歳未満 女15〜64歳 女65歳以上 女75歳以上
1 04101 1 仙台市青葉区 <NA> 0 <NA> <NA> 311590 10231 11633 11838 15944 23772 17838 17677 19028 21113 23299 20544 18295 16702 17137 17700 33702 194212 69969 35132 150932 5350 6087 6066 8321 12579 8669 8600 9241 10232 11412 10317 9031 7893 8010 8082 17503 96295 29477 13385 160658 4881 5546 5772 7623 11193 9169 9077 9787 10881 11887 10227 9264 8809 9127 9618 16199 97917 40492 21747
2 041010010 2 仙台市青葉区 青葉町 0 <NA> <NA> 649 16 15 17 23 53 62 49 40 40 40 45 33 28 38 38 48 413 143 67 307 8 10 7 10 26 30 27 21 18 14 25 16 11 15 16 25 198 60 29 342 8 5 10 13 27 32 22 19 22 26 20 17 17 23 22 23 215 83 38
3 041010020 2 仙台市青葉区 あけぼの町 0 <NA> <NA> 741 23 18 13 26 32 55 48 42 60 51 48 47 38 55 43 54 447 209 111 365 12 10 4 15 16 22 28 19 39 27 25 16 23 24 21 26 230 89 44 376 11 8 9 11 16 33 20 23 21 24 23 31 15 31 22 28 217 120 67
4 041010030 3 仙台市青葉区 旭ケ丘 0 <NA> <NA> 9160 279 289 272 315 766 880 771 643 633 713 561 493 436 363 358 840 6211 1671 950 4274 149 161 141 155 315 366 352 308 296 350 278 237 220 165 167 451 2877 697 365 4886 130 128 131 160 451 514 419 335 337 363 283 256 216 198 191 389 3334 974 585
--
5940 04606004015 4 南三陸町 歌津字石浜 0 <NA> <NA> 295 9 7 11 6 6 12 6 18 16 19 21 36 18 30 26 27 158 110 54 146 5 5 5 3 5 8 2 6 8 10 14 16 8 11 20 15 80 51 20 149 4 2 6 3 1 4 4 12 8 9 7 20 10 19 6 12 78 59 34
5941 04606004016 4 南三陸町 歌津字田の浦 0 <NA> <NA> 144 5 2 5 3 5 7 5 7 6 5 15 12 17 14 4 12 82 50 32 66 - 1 3 1 3 4 3 4 3 2 6 5 9 9 4 4 40 22 9 78 5 1 2 2 2 3 2 3 3 3 9 7 8 5 - 8 42 28 23
5942 04606004017 4 南三陸町 歌津字草木沢 0 <NA> <NA> 457 21 16 18 16 12 13 20 18 18 41 34 28 37 46 43 55 237 165 76 234 11 6 8 10 11 5 9 10 11 22 18 15 17 19 29 25 128 81 33 223 10 10 10 6 1 8 11 8 7 19 16 13 20 27 14 30 109 84 43
5943 04606004018 4 南三陸町 歌津字伊里前 0 <NA> <NA> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
NA
OK, now we got to the starting point…
dfL = readr::read_csv(infile, locale = sjis, col_select = seq(1, 7)) |>
dplyr::slice(-1)
dfR = readr::read_csv(infile, locale = sjis, col_select = -seq(1, 7),
skip = 1L, na = c("-", "X"))
raw_miyagi = dplyr::bind_cols(dfL, dfR) |> print()
KEY_CODE HYOSYO CITYNAME NAME HTKSYORI HTKSAKI GASSAN 総数、年齢「不詳」含む 総数0〜4歳 総数5〜9歳 総数10〜14歳 総数15〜19歳 総数20〜24歳 総数25〜29歳 総数30〜34歳 総数35〜39歳 総数40〜44歳 総数45〜49歳 総数50〜54歳 総数55〜59歳 総数60〜64歳 総数65〜69歳 総数70〜74歳 総数15歳未満 総数15〜64歳 総数65歳以上 総数75歳以上 男の総数、年齢「不詳」含む 男0〜4歳 男5〜9歳 男10〜14歳 男15〜19歳 男20〜24歳 男25〜29歳 男30〜34歳 男35〜39歳 男40〜44歳 男45〜49歳 男50〜54歳 男55〜59歳 男60〜64歳 男65〜69歳 男70〜74歳 男15歳未満 男15〜64歳 男65歳以上 男75歳以上 女の総数、年齢「不詳」含む 女0〜4歳 女5〜9歳 女10〜14歳 女15〜19歳 女20〜24歳 女25〜29歳 女30〜34歳 女35〜39歳 女40〜44歳 女45〜49歳 女50〜54歳 女55〜59歳 女60〜64歳 女65〜69歳 女70〜74歳 女15歳未満 女15〜64歳 女65歳以上 女75歳以上
1 04101 1 仙台市青葉区 <NA> 0 <NA> <NA> 311590 10231 11633 11838 15944 23772 17838 17677 19028 21113 23299 20544 18295 16702 17137 17700 33702 194212 69969 35132 150932 5350 6087 6066 8321 12579 8669 8600 9241 10232 11412 10317 9031 7893 8010 8082 17503 96295 29477 13385 160658 4881 5546 5772 7623 11193 9169 9077 9787 10881 11887 10227 9264 8809 9127 9618 16199 97917 40492 21747
2 041010010 2 仙台市青葉区 青葉町 0 <NA> <NA> 649 16 15 17 23 53 62 49 40 40 40 45 33 28 38 38 48 413 143 67 307 8 10 7 10 26 30 27 21 18 14 25 16 11 15 16 25 198 60 29 342 8 5 10 13 27 32 22 19 22 26 20 17 17 23 22 23 215 83 38
3 041010020 2 仙台市青葉区 あけぼの町 0 <NA> <NA> 741 23 18 13 26 32 55 48 42 60 51 48 47 38 55 43 54 447 209 111 365 12 10 4 15 16 22 28 19 39 27 25 16 23 24 21 26 230 89 44 376 11 8 9 11 16 33 20 23 21 24 23 31 15 31 22 28 217 120 67
4 041010030 3 仙台市青葉区 旭ケ丘 0 <NA> <NA> 9160 279 289 272 315 766 880 771 643 633 713 561 493 436 363 358 840 6211 1671 950 4274 149 161 141 155 315 366 352 308 296 350 278 237 220 165 167 451 2877 697 365 4886 130 128 131 160 451 514 419 335 337 363 283 256 216 198 191 389 3334 974 585
--
5940 04606004015 4 南三陸町 歌津字石浜 0 <NA> <NA> 295 9 7 11 6 6 12 6 18 16 19 21 36 18 30 26 27 158 110 54 146 5 5 5 3 5 8 2 6 8 10 14 16 8 11 20 15 80 51 20 149 4 2 6 3 1 4 4 12 8 9 7 20 10 19 6 12 78 59 34
5941 04606004016 4 南三陸町 歌津字田の浦 0 <NA> <NA> 144 5 2 5 3 5 7 5 7 6 5 15 12 17 14 4 12 82 50 32 66 NA 1 3 1 3 4 3 4 3 2 6 5 9 9 4 4 40 22 9 78 5 1 2 2 2 3 2 3 3 3 9 7 8 5 NA 8 42 28 23
5942 04606004017 4 南三陸町 歌津字草木沢 0 <NA> <NA> 457 21 16 18 16 12 13 20 18 18 41 34 28 37 46 43 55 237 165 76 234 11 6 8 10 11 5 9 10 11 22 18 15 17 19 29 25 128 81 33 223 10 10 10 6 1 8 11 8 7 19 16 13 20 27 14 30 109 84 43
5943 04606004018 4 南三陸町 歌津字伊里前 0 <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Still many traps: leading/trailing whitespace, full-width numbers, etc.
tidy_miyagi = raw_miyagi |>
dplyr::rename_with(stringr::str_trim) |>
dplyr::filter(HYOSYO == 1) |>
dplyr::select(CITYNAME, matches("[男女].+歳")) |>
tidyr::pivot_longer(!CITYNAME, names_to = "category", values_to = "count") |>
tidyr::separate(category, c("sex", "age"), 1) |>
dplyr::mutate(age = stringi::stri_trans_nfkc(age)) |>
tidyr::separate(age, c("lower", "upper"), "〜", fill = "right") |>
dplyr::mutate(lower = readr::parse_number(lower),
upper = readr::parse_number(upper)) |>
dplyr::filter((upper - lower) < 5 | lower == 75) |>
dplyr::mutate(age = (lower + upper) / 2) |>
print()
CITYNAME sex lower upper count age
1 仙台市青葉区 男 0 4 5350 2
2 仙台市青葉区 男 5 9 6087 7
3 仙台市青葉区 男 10 14 6066 12
4 仙台市青葉区 男 15 19 8321 17
--
1245 南三陸町 女 60 64 507 62
1246 南三陸町 女 65 69 553 67
1247 南三陸町 女 70 74 450 72
1248 南三陸町 女 75 NA 1602 NA
tidy_miyagi |>
dplyr::mutate(count = ifelse(sex == "男", -1, 1) * count) |>
ggplot() +
geom_col(aes(age, count, fill = sex)) +
facet_wrap(vars(CITYNAME), nrow = 4L) +
coord_flip() + theme_minimal(base_size = 15)
Now we have great skills for data preparation with R ✨
Not scared of messy data!
that being said,
What should we care about when we are the primary data source?
Ministry of Internal affairs and Communication published a document in 2020.
「統計表における機械判読可能なデータの表記方法の統一ルール」
Useful functions to handle bad forms:
tidyr::separate()
, stringr::str_split()
, stringr::str_extract()
Useful functions to handle bad forms:
tidyr::separate()
, stringr::str_split()
, stringr::str_extract()
No unit, no comma, no space should be included in a cell.
Useful functions to handle bad forms:
readr::parse_number()
, stringr::str_remove()
, stringr::str_replace()
No unit, no comma, no space should be included in a cell.
Useful functions to handle bad forms:
readr::parse_number()
, stringr::str_remove()
, stringr::str_replace()
No footnote should be included in a table.
Useful functions to handle bad forms:
readr::parse_number()
, stringr::str_remove()
, stringr::str_replace()
"A"
, " A"
, and " A"
are different for machines.
Useful functions to handle bad forms:
stringr::str_trim()
, stringr::str_remove()
, stringr::str_replace()
Useful functions to handle bad forms:
stringr::str_trim()
, stringr::str_remove()
, stringr::str_replace()
Useful functions to handle bad forms:
tidyr::fill()
Useful functions to handle bad forms:
tidyr::fill()
They are not recognized automatically.
Useful functions to handle bad forms:
tidyr::fill()
, tidyr::separate()
, stringr::str_replace()
Stick to only ASCII characters whenever possible.
Useful functions to handle bad forms:
stringi::stri_trans_nfkc()
For that matter, one sheet in one file.
\w+
..xlsx
).
.tsv
) or Comma- (.csv
)In other words, think about the future you analyze it.
Widely used as a software to view and edit table format data.
It has many nice features, but often brings chaos.
22-4
, 4-14
MARCH1
, SEPT1
🔰 Experience the fear:
gene,label
MARCH1,22-4
SEPT1,4-14
excel.csv
.excel2.csv
.✅ Data input
⬜ Data interpretation (very basic introduction)
no matter how fantastic your statistical analysis is.
What impairs data input?
e.g., Weigh yourself 10 times with your clothes on.
🔰 Selection bias in biological research?
Each card has a number on one side and an alphabet on the other.
“If a card has an even number, then it has A on the other side."
To test this statement, which cards should be turned over?
There are many other cognitive bias and errors.
We will learn it next time.
Various shapes depending on underlying mechanisms.
Impression and information amount vary by graph.
considering dispersion as well as central tendency.
“The probability to observe this difference by chance is very low."
A method to show this is called statistical hypothesis testing.
🎲 Out of 10 rolls of a dice, 6 appeared 9 times. Is this dice unfair?
🎲 Out of 12 rolls of a dice, 6 appeared 4 times. Is this dice unfair?
Probability to roll 4 or more sixes out of 12 under $H_0$: $p = 0.125 > \alpha$
Failed to reject $H_0$ this time.
The probability to roll a 6 is not significantly different from 1/6.
Note: NOT accepting $H_0$. NOT saying the probability is equal to 1/6.
A test with $\alpha=0.05$ mistakenly rejects true $H_0$ with the probability of ≤5%.
By repeating such test 10 times, the probability to get at least one false positive (family-wise error rate, FWER) is up to
$1 - (1 - 0.05)^{10} \approx 0.40$
誤: Increasing 🍦icecream sales causes increasing 🍺beer sales.
正: Both 🍦icecream and 🍺beer sell better when it is 🌞 hot.
誤: Increasing police officer causes increase in crimes.
正: The more crime leads to more police deployment.
Collecting (x + y) pairs that fall within a specific range.
Correlation coefficient can jump by a few outliers and group structures.
Number of drowned people and the films in which Nicolas Cage appeared.
🔰 Find examples of these four types of relationships.