r-training-2024.Rproj
をダブルクリック。select()
, filter()
group_by()
, summarize()
arrange()
, relocate()
*_join()
pivot_longer()
, pivot_wider()
install.packages("tidyverse")
library(conflicted) # 安全のおまじない
library(tidyverse) # 一挙に読み込み
── Attaching core tidyverse packages ──── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
一貫したデザインでデータ解析の様々な工程をカバー
vector
: 基本型。一次元の配列。 (👈今回の主役)
logical
: 論理値 (TRUE
or FALSE
)numeric
: 数値 (整数 42L
or 実数 3.1416
)character
: 文字列 ("a string"
)factor
: 因子 (文字列っぽいけど微妙に違う)array
: 多次元配列。vector
同様、全要素が同じ型。
matrix
: 行列 = 二次元の配列。list
: 異なる型でも詰め込める太っ腹ベクトル。data.frame
: 同じ長さのベクトルを並べた長方形のテーブル。重要。 tibble
とか tbl_df
と呼ばれる亜種もあるけどほぼ同じ。1個の値でもベクトル扱い。
ベクトルの各要素に一気に計算を適用できる。
x = c(1, 2, 9) # 長さ3の数値ベクトル
x + x # 同じ長さ同士の計算
[1] 2 4 18
y = 10 # 長さ1の数値ベクトル
x + y # 長さ3 + 長さ1 = 長さ3 (それぞれ足し算)
[1] 11 12 19
x < 5 # それぞれの要素を比較
[1] TRUE TRUE FALSE
普通は倍精度浮動小数点型 double
として扱われる:
answer = 42
typeof(answer)
[1] "double"
明示的に変換したり末尾にLを付けることで整数扱いもできる:
typeof(as.integer(answer))
[1] "integer"
whoami = 24601L
typeof(whoami)
[1] "integer"
Rではほとんど気にする必要はない。
ベクトルを受け取り、それぞれの要素に適用
x = c(1, 2, 3)
sqrt(x)
[1] 1.000000 1.414214 1.732051
log(x)
[1] 0.0000000 0.6931472 1.0986123
log10(x)
[1] 0.0000000 0.3010300 0.4771213
exp(x)
[1] 2.718282 7.389056 20.085537
内容を変更する方法はいくつかある。
diamonds
の price
列をドルから円に変換する例:
dia = diamonds # 別名コピー
# dollar演算子 $ で指定
dia$price = 105.59 * dia$price
# 名前を [[文字列]] で指定
dia[["price"]] = 105.59 * dia[["price"]]
# dplyr::mutate with pipe
dia = diamonds |>
dplyr::mutate(price = 105.59 * price) |>
dplyr::filter(carat > 1) |>
dplyr::summarize(avg_price = mean(price))
1発ならどれでもいい。流れ作業には mutate()
が便利。
最小=0、最大=1、になるように:
normalized_minmax = diamonds |>
dplyr::mutate(price = (price - min(price)) / (max(price) - min(price))) |>
print()
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 0.000000e+00 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 0.000000e+00 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 5.406282e-05 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 4.325026e-04 4.20 4.23 2.63
--
53937 0.72 Good D SI1 63.1 55 1.314267e-01 5.69 5.75 3.61
53938 0.70 Very Good D SI1 62.8 60 1.314267e-01 5.66 5.68 3.56
53939 0.86 Premium H SI2 61.0 58 1.314267e-01 6.15 6.12 3.74
53940 0.75 Ideal D SI2 62.2 55 1.314267e-01 5.83 5.87 3.64
外れ値の影響を大きく受けることに注意。
平均=0、標準偏差=1、になるように:
normalized_z = diamonds |>
dplyr::mutate(price = (price - mean(price)) / sd(price)) |>
print()
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 -0.9040868 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 -0.9040868 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 -0.9038361 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 -0.9020815 4.20 4.23 2.63
--
53937 0.72 Good D SI1 63.1 55 -0.2947280 5.69 5.75 3.61
53938 0.70 Very Good D SI1 62.8 60 -0.2947280 5.66 5.68 3.56
53939 0.86 Premium H SI2 61.0 58 -0.2947280 6.15 6.12 3.74
53940 0.75 Ideal D SI2 62.2 55 -0.2947280 5.83 5.87 3.64
price = as.vector(scale(price))
でも可能。
scale()
はmatrixを返すため as.vector()
が必要。
分布の形は変わらず、範囲が変わる。
z-scoreは正規分布前提。これだけ非対称だと使いにくい。
平均値から標準偏差の3倍以上離れているもの($\lvert z \rvert \ge 3$)を取り除く例:
result = diamonds |>
dplyr::filter(abs(price - mean(price)) / sd(price) < 3) |>
print()
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
--
52731 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
52732 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
52733 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
52734 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
唯一の方法ではないし、そもそもやるべきかどうかも要検討
tidyr::drop_na()
(指定した列に) NA
が含まれてる行を削除する。
df = tibble::tibble(x = c(1, 2, NA), y = c("a", NA, "c"), z = c("D", "E", NA))
df |> tidyr::drop_na()
x y z
1 1 a D
🔰 starwars
で身長体重データのある行だけ抽出してみよう
name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
1 Luke Skywalker 172 77 blond fair blue 19.0 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]>
2 C-3PO 167 75 <NA> gold yellow 112.0 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
3 R2-D2 96 32 <NA> white, blue red 33.0 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
4 Darth Vader 202 136 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]>
--
84 Rey NA NA brown light hazel NA female feminine <NA> Human <chr [1]> <chr [0]> <chr [0]>
85 Poe Dameron NA NA brown light brown NA male masculine <NA> Human <chr [1]> <chr [0]> <chr [1]>
86 BB8 NA NA none none black NA none masculine <NA> Droid <chr [1]> <chr [0]> <chr [0]>
87 Captain Phasma NA NA none none unknown NA female feminine <NA> Human <chr [1]> <chr [0]> <chr [0]>
tidyr::replace_na()
欠損値 NA
を任意の値で置き換える。
df = tibble::tibble(x = c(1, 2, NA), y = c("a", NA, "c"), z = c("D", "E", NA))
df |> tidyr::replace_na(list(x = 9999, y = "unknown"))
x y z
1 1 a D
2 2 unknown E
3 9999 c <NA>
🔰 starwars
で髪や目の色が不明の部分を"UNKNOWN"に置換しよう
name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
1 Luke Skywalker 172 77 blond fair blue 19.0 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]>
2 C-3PO 167 75 <NA> gold yellow 112.0 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
3 R2-D2 96 32 <NA> white, blue red 33.0 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
4 Darth Vader 202 136 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]>
--
84 Rey NA NA brown light hazel NA female feminine <NA> Human <chr [1]> <chr [0]> <chr [0]>
85 Poe Dameron NA NA brown light brown NA male masculine <NA> Human <chr [1]> <chr [0]> <chr [1]>
86 BB8 NA NA none none black NA none masculine <NA> Droid <chr [1]> <chr [0]> <chr [0]>
87 Captain Phasma NA NA none none unknown NA female feminine <NA> Human <chr [1]> <chr [0]> <chr [0]>
dplyr::na_if()
特定の値を NA
に置き換える:
df = tibble::tibble(x = c(1, 2, NA), y = c("a", NA, "c"), z = c("D", "E", NA))
df |> dplyr::mutate(x = dplyr::na_if(x, 1), y = dplyr::na_if(y, "a"))
x y z
1 NA <NA> D
2 2 <NA> E
3 NA c <NA>
🔰 starwars
の性別"none"を欠損値にしよう
name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
1 Luke Skywalker 172 77 blond fair blue 19.0 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]>
2 C-3PO 167 75 <NA> gold yellow 112.0 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
3 R2-D2 96 32 <NA> white, blue red 33.0 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
4 Darth Vader 202 136 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]>
--
84 Rey NA NA brown light hazel NA female feminine <NA> Human <chr [1]> <chr [0]> <chr [0]>
85 Poe Dameron NA NA brown light brown NA male masculine <NA> Human <chr [1]> <chr [0]> <chr [1]>
86 BB8 NA NA none none black NA none masculine <NA> Droid <chr [1]> <chr [0]> <chr [0]>
87 Captain Phasma NA NA none none unknown NA female feminine <NA> Human <chr [1]> <chr [0]> <chr [0]>
dplyr::coalesce()
先に指定したvectorの値が NA
なら次のvectorの値を採用:
df = tibble::tibble(x = c(1, 2, NA), y = c("a", NA, "c"), z = c("D", "E", NA))
df |> dplyr::mutate(y_or_z = dplyr::coalesce(y, z))
x y z y_or_z
1 1 a D a
2 2 <NA> E E
3 NA c <NA> c
異なる型を混ぜると怒られる:
df |> dplyr::mutate(x_or_y = dplyr::coalesce(x, y))
Error in `dplyr::mutate()`:
ℹ In argument: `x_or_y = dplyr::coalesce(x, y)`.
Caused by error in `dplyr::coalesce()`:
! Can't combine `..1` <double> and `..2` <character>.
🔰 starwars
で髪色の欠損値を肌色で補おう
dplyr::if_else()
TRUE
の位置では x
を採用、FALSE
の位置では y
を採用:
condition = c(TRUE, TRUE, FALSE)
x = c(1, 2, 3)
y = c(100, 200, 300)
dplyr::if_else(condition, x, y)
[1] 1 2 300
🔰 starwars
で種族がドロイドの行だけ身長を100倍してみよう
name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
1 Luke Skywalker 172 77 blond fair blue 19.0 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]>
2 C-3PO 167 75 <NA> gold yellow 112.0 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
3 R2-D2 96 32 <NA> white, blue red 33.0 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
4 Darth Vader 202 136 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]>
--
84 Rey NA NA brown light hazel NA female feminine <NA> Human <chr [1]> <chr [0]> <chr [0]>
85 Poe Dameron NA NA brown light brown NA male masculine <NA> Human <chr [1]> <chr [0]> <chr [1]>
86 BB8 NA NA none none black NA none masculine <NA> Droid <chr [1]> <chr [0]> <chr [0]>
87 Captain Phasma NA NA none none unknown NA female feminine <NA> Human <chr [1]> <chr [0]> <chr [0]>
ダブルクォートで囲む。シングルクォートも使える。
x = "This is a string"
y = 'If I want to include a "quote" inside a string, I use single quotes'
閉じそびれると変な状態になるので、落ち着いて esc or ctrlc
> "This is a string without a closing quote
+
+
+ HELP I'M STUCK
何をやる関数なのか名前から分かりにくい
grep
, grepl
, regexpr
, gregexpr
, regexec
sub
, gsub
, substr
, substring
対象文字列はいくつめに渡す?関数ごとに違う。e.g.,
grep(pattern, x, ...)
sub(pattern, replacement, x, ...)
substr(x, start, stop)
欠損値 NA
に対する挙動が微妙
x = c(1, 2, NA)
y = c("a", NA, "c")
paste(x, y) # NA is not distinguished from character "NA"
[1] "1 a" "2 NA" "NA c"
|>
ableNA
が含まれる場合は対応する出力も NA
fruit4 = head(fruit, 4L) |> print()
[1] "apple" "apricot" "avocado" "banana"
stringr::str_length(fruit4) # 長さ
[1] 5 7 7 6
stringr::str_sub(fruit4, 2, 4) # 部分抽出
[1] "ppl" "pri" "voc" "ana"
stringr::str_c(1:4, " ", fruit4, "!") # 結合
[1] "1 apple!" "2 apricot!" "3 avocado!" "4 banana!"
🔰 words
の中で9文字より長いものを抜き出してみよう
🔰 それら長い単語に str_sub()
や str_c()
を適用してみよう
単純な一致だけじゃなく、いろんな条件でマッチングできる:
# aで始まる
stringr::str_subset(fruit, "^a")
[1] "apple" "apricot" "avocado"
# rで終わる
stringr::str_subset(fruit, "r$")
[1] "bell pepper" "chili pepper" "cucumber" "pear"
# 英数字3-4文字
stringr::str_subset(fruit, "^\\w{3,4}$")
[1] "date" "fig" "lime" "nut" "pear" "plum"
この ^
とか $
って何者?
メタ文字 | 意味 | 演算子 | 意味 | |
---|---|---|---|---|
\d |
数字 (逆は \D ) |
a? |
0回か1回のa | |
\s |
空白 (逆は \S ) |
a* |
0回以上繰り返されたa | |
\w |
英数字 (逆は \W ) |
a+ |
1回以上繰り返されたa | |
. |
何でも1文字 | a{n,m} |
n回以上m回以下のa | |
^ |
行頭 | a(?=c) |
cに先立つa | |
$ |
行末 | (?<=b)a |
bに続くa |
Rの"普通の文字列"
ではバックスラッシュを重ねる必要がある: "^\\d"
.
🔰 str_subset()
と fruit
でパターンマッチを身に着けよう:
🔰 starnames = starwars[["name"]]
として次のマッチにも挑戦:
str_detect()
マッチするかどうか TRUE
/FALSE
を返す。
fruit4 = head(fruit, 4L)
stringr::str_detect(fruit4, "^a")
[1] TRUE TRUE TRUE FALSE
🔰 starwars
から name
列に空白を含まない行を抽出しよう
name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
1 C-3PO 167 75 <NA> gold yellow 112 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
2 R2-D2 96 32 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
3 R5-D4 97 32 <NA> white, red red NA none masculine Tatooine Droid <chr [1]> <chr [0]> <chr [0]>
4 Chewbacca 228 112 brown unknown blue 200 male masculine Kashyyyk Wookiee <chr [5]> <chr [1]> <chr [2]>
--
21 Tarfful 234 136 brown brown blue NA male masculine Kashyyyk Wookiee <chr [1]> <chr [0]> <chr [0]>
22 Finn NA NA black dark dark NA male masculine <NA> Human <chr [1]> <chr [0]> <chr [0]>
23 Rey NA NA brown light hazel NA female feminine <NA> Human <chr [1]> <chr [0]> <chr [0]>
24 BB8 NA NA none none black NA none masculine <NA> Droid <chr [1]> <chr [0]> <chr [0]>
str_extract()
マッチした部分文字列を取り出す。しなかった要素には NA
。
fruit4 = head(fruit, 4L)
stringr::str_extract(fruit4, "^a..")
[1] "app" "apr" "avo" NA
🔰 diamonds
の clarity
列を数字なしにしてみよう
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS 62.4 58 334 4.20 4.23 2.63
--
53937 0.72 Good D SI 63.1 55 2757 5.69 5.75 3.61
53938 0.70 Very Good D SI 62.8 60 2757 5.66 5.68 3.56
53939 0.86 Premium H SI 61.0 58 2757 6.15 6.12 3.74
53940 0.75 Ideal D SI 62.2 55 2757 5.83 5.87 3.64
str_replace()
, str_replace_all()
カッコ ()
で囲んだマッチングは後で参照できる:
fruit4 = head(fruit, 4L)
stringr::str_replace(fruit4, "..$", "!!")
[1] "app!!" "apric!!" "avoca!!" "bana!!"
stringr::str_replace(fruit4, "(..)$", "_\\1_")
[1] "app_le_" "apric_ot_" "avoca_do_" "bana_na_"
🔰 starwars
の name
列の数字を全部ゼロにしてみよう
name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
1 C-0PO 167 75 <NA> gold yellow 112 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
2 R0-D0 96 32 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
3 R0-D0 97 32 <NA> white, red red NA none masculine Tatooine Droid <chr [1]> <chr [0]> <chr [0]>
4 IG-00 200 140 none metal red 15 none masculine <NA> Droid <chr [1]> <chr [0]> <chr [0]>
5 R0-P00 96 NA none silver, red red, blue NA none feminine <NA> Droid <chr [2]> <chr [0]> <chr [0]>
6 BB0 NA NA none none black NA none masculine <NA> Droid <chr [1]> <chr [0]> <chr [0]>
matches()
だけで starts_with()
/ends_with()
の役もこなせる:
# starts_with("c")
diamonds |> dplyr::select(matches("^c"))
# ends_with("s")
starwars |> dplyr::select(matches("s$"))
# 数字だけ
world_bank_pop |>
tidyr::pivot_longer(matches("^\\d+$"), names_to = "year")
See selection helpers for more details.
fruit4 = head(fruit, 4L)
stringr::str_to_upper(fruit4) # 大文字に
[1] "APPLE" "APRICOT" "AVOCADO" "BANANA"
stringr::str_pad(fruit4, 8, "left", "_") # 幅を埋めて指定幅に
[1] "___apple" "_apricot" "_avocado" "__banana"
stringi
パッケージはさらに多機能
stringi::stri_trans_nfkc(c("カタカナ", "42")) # 半角カナ・全角数字に対処
[1] "カタカナ" "42"
🔰 starwars
の name
列を全部小文字にしてみよう
name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
1 luke skywalker 172 77 blond fair blue 19.0 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]>
2 c-3po 167 75 <NA> gold yellow 112.0 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
3 r2-d2 96 32 <NA> white, blue red 33.0 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
4 darth vader 202 136 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]>
--
84 rey NA NA brown light hazel NA female feminine <NA> Human <chr [1]> <chr [0]> <chr [0]>
85 poe dameron NA NA brown light brown NA male masculine <NA> Human <chr [1]> <chr [0]> <chr [1]>
86 bb8 NA NA none none black NA none masculine <NA> Droid <chr [1]> <chr [0]> <chr [0]>
87 captain phasma NA NA none none unknown NA female feminine <NA> Human <chr [1]> <chr [0]> <chr [0]>
これはstringrではなくreadrの担当:
readr::parse_number(c("p = 0.02 *", "N_A = 6e23"))
[1] 2e-02 6e+23
readr::parse_double(c("0.02", "6e+23"))
[1] 2e-02 6e+23
readr::parse_logical(c("1", "true", "0", "false"))
[1] TRUE TRUE FALSE FALSE
readr::parse_date("2024-09-21")
[1] "2024-09-21"
6e+23
は $6 \times 10 ^ {23}$ のプログラミング的表現。
$6e^{23}$ ではない。
factor
でカテゴリカル変数(質的変数)を扱うmonth_levels = c( # 取りうる値
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
x1 = c("Dec", "Apr", "Jan", "Mar") # ただの文字列vector
y1 = factor(x1, levels = month_levels) # 因子型に変換
print(y1)
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
文字列っぽいけど実体は整数:
typeof(y1)
[1] "integer"
as.integer(y1) # 整数型に変換可能
[1] 12 4 1 3
🔰 iris
に含まれる因子型を確認しよう: str(iris)
factor
: 文字列との違い1取りうる値 (levels) が既知。
typoなどによりlevels外になると NA
扱い。
x2 = c("Dec", "Apr", "Jam", "Mar")
factor(x2, levels = month_levels)
[1] Dec Apr <NA> Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
元の文字列vectorに全てのlevelsが含まれてるなら簡単に変換可能:
as.factor(starwars[["gender"]])
[1] masculine masculine masculine masculine feminine masculine feminine masculine masculine masculine masculine masculine masculine masculine masculine masculine masculine <NA> masculine masculine masculine masculine masculine masculine masculine masculine feminine masculine masculine masculine masculine masculine masculine feminine masculine masculine masculine masculine masculine masculine masculine feminine masculine masculine feminine masculine masculine masculine masculine masculine masculine masculine masculine feminine masculine masculine masculine masculine <NA> <NA> masculine masculine feminine feminine feminine masculine masculine masculine feminine masculine masculine feminine feminine feminine masculine masculine feminine masculine masculine masculine <NA> masculine masculine feminine masculine masculine feminine
Levels: feminine masculine
factor
: 文字列との違い2アルファベット順じゃない順序がある:
x1 = c("Dec", "Apr", "Jan", "Mar")
sort(x1) # 文字列としてソートするとアルファベット順
[1] "Apr" "Dec" "Jan" "Mar"
y1 = factor(x1, levels = month_levels)
sort(y1) # 因子としてソートするとlevels順
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
factor
: 順序の情報は作図で生きる文字列だと勝手にアルファベット順。因子型なら任意指定可能:
mpg_fct = mpg |>
dplyr::mutate(drv = factor(drv, levels = c("f", "r", "4")))
ordered
大小の比較ができる。
x1 = c("Dec", "Apr", "Jan", "Mar")
y3 = factor(x1, levels = month_levels, ordered = TRUE)
class(y3)
[1] "ordered" "factor"
print(y3)
[1] Dec Apr Jan Mar
Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < Nov < Dec
y3 < "Sep"
[1] FALSE TRUE TRUE TRUE
🔰 diamonds
に含まれるordered型を確認しよう: str(diamonds)
🔰 cut
がPremium以上の行だけ抜き出す、とか。
fct_relevel()
: 手動で順序設定fct_reorder()
: 別の変数に応じて順序を並べ替えfct_infreq()
: 頻度に応じて順序を並べ替えfct_lump()
: 少なすぎるカテゴリを"その他"としてまとめるdiamonds_fct = diamonds |>
dplyr::mutate(color = forcats::fct_infreq(color))
mpg_fct = mpg |>
dplyr::mutate(fl = forcats::fct_lump(fl, n = 2))
🔰 mpg
で次のような図を描いてみよう
イチゼロの値を持たせて横広に変形するのと等価。
iris |> tibble::rowid_to_column() |>
dplyr::mutate(value = 1L) |>
tidyr::pivot_wider(names_from = Species,
values_from = value, values_fill = 0L)
rowid Sepal.Length Sepal.Width Petal.Length Petal.Width setosa versicolor virginica
1 1 5.1 3.5 1.4 0.2 1 0 0
2 2 4.9 3.0 1.4 0.2 1 0 0
3 3 4.7 3.2 1.3 0.2 1 0 0
4 4 4.6 3.1 1.5 0.2 1 0 0
--
147 147 6.3 2.5 5.0 1.9 0 0 1
148 148 6.5 3.0 5.2 2.0 0 0 1
149 149 6.2 3.4 5.4 2.3 0 0 1
150 150 5.9 3.0 5.1 1.8 0 0 1
🔰 これを元の iris
に戻してみよう
now = "2024-04-10 14:10:00"
ct = as.POSIXct(now)
unclass(ct)
[1] 1712725800
attr(,"tzone")
[1] ""
lt = as.POSIXlt(now)
unclass(lt) |> as_tibble()
sec min hour mday mon year wday yday isdst zone gmtoff
1 0 10 14 10 3 124 3 100 0 JST NA
素のRでも扱えるけど lubridate パッケージを使うともっと楽に。
日時型への変換:
lubridate::ymd(c("20240921", "2024-09-21", "24/09/21"))
[1] "2024-09-21" "2024-09-21" "2024-09-21"
日時型から単位ごとに値を取得:
today = lubridate::ymd(20240921)
lubridate::month(today)
[1] 9
lubridate::wday(today, label = TRUE)
[1] Sat
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
各パッケージのチートシート.pdfを手元に持っておくと便利。
select()
, filter()
group_by()
, summarize()
arrange()
, relocate()
*_join()
pivot_longer()
, pivot_wider()
前処理大全 — 本橋智光
RユーザのためのRStudio[実践]入門 (宇宙船本) — 松村ら