Raw data are often too complex and have too much information to make sense.
print(ggplot2::diamonds)
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
--
53937 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
53938 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
53939 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
53940 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
A data set of diamonds (53,940 observations, 10 variables).
average, standard deviation, etc. of each column:
stat carat depth table price
1 mean 0.80 61.75 57.46 3932.80
2 sd 0.47 1.43 2.23 3989.44
3 max 5.01 79.00 95.00 18823.00
4 min 0.20 43.00 43.00 326.00
carat
and price
looks positively correlated:
carat depth table price
carat 1.00
depth 0.03 1.00
table 0.18 -0.30 1.00
price 0.92 -0.01 0.13 1.00
OK, summary stats may help better understanding than raw data do.
But be aware……
Interesting relationships may be overlooked without visualization.
Reduction and reorganization of information → intuitive understanding
The larger carat
, the higher price
.
The slope seems to differ by clarity
.
is the process of summarizing data and making inferences based on it.
Modelling is necessary if you want more than “intuitive understanding”
Simplified and idealized structures to represent a target system.
Mathematical expression of assumptions to simulate data generation
Mathematical expression of assumptions to simulate data generation
e.g., the larger the more expensive: $\text{price} = A \times \text{carat} + B + \epsilon$
We now described diamonds price with a very simple equation.
→ Improving the model may lead to more accurate understanding.
to clearly show “altering X causes phenotype Y”.
Dry-lab theoreticians are tend to be called “modellers”,
but all the biological researchers are modellers in a broad sense.
A new discovery is always based on the previous studies.
Reproducible research makes the giants bigger.
A magnum opus thesis based on a massive data
from observation of animals’ behaviors and positions in a zoological park.
The position and behavior of every individual were recorded.
Many files and many tabs —— are they all correct?
Many files and many tabs —— are they all correct?
The files are the crystal of blood, sweat, and tears,
but cannot be opened after free trial of the app.
The dataset is larger than the previous example.
But I made less effort to handle it.
for statistical computing and graphics
There are some alternatives.
Python is comparable.
Julia is rising.
You don’t have to remember every command.
Just repeat forgetting and searching.
✅ R is a programming language/environment for data analysis
⬜ Setup R environment
⬜ Make conversation with R
⬜ Create a “project” and “scripts”
⬜ Data types and operations
⬜ R packages
⬜ Solve errors and questions
Action | ||
---|---|---|
Switch apps | commandtab | alttab |
Quit apps | commandq | altF4 |
Spotlight | commandspace | |
Cut, Copy, Paste | commandx, -c, -v | ctrlx, -c, -v |
Select all | commanda | ctrla |
Undo | commandz | ctrlz |
Find | commandf | ctrlf |
Save | commands | ctrls |
Workspace (Environment) = a collection of temporary objects on memory
Uncheck “Restore …”. Set “Save workspace …” to Never.
File → New Project… → New Directory → New Project →
→ Directory name: r-training-2023
→ as subdirectory of: ~/project
or C:/Users/yourname/project
📁 directory = folder. ~/
= home directory.
File → New File → R script
File → New File → R script
Select text with shift←↓↑→
Execute them with ctrlreturn
hello.R
🔰 Try basic arithmetic operations and save them to hello.R
.
e.g., 1 + 2 + 3
, 3 * 7 * 2
, 4 / 2
, 4 / 3
, etc.
r-training-2023/ # the root of the project
├── r-training-2023.Rproj # double-click this to launch RStudio
├── hello.R
├── transform.R # script for data preparation
├── visualize.R # script for data visualization
├── data/ # input
│ ├── iris.tsv
│ └── diamonds.xlsx
└── results/ # output
├── iris-petal.png
└── iris-summary.tsv
The next topics are working directory and relative path.
The project root is the working directory by default. Never change it.
✅ Good: read_tsv("data/iris.tsv")
❌ Bad: setwd("data"); read_tsv("iris.tsv")
head(iris)
error
and warning
if any.Feel free to interrupt me any time.
x = 2 # Create x
x # What's in x?
[1] 2
y = 5 # Create y
y # What's in y?
[1] 5
R accepts <-
as an assignment operator, but I recommend =
.
Texts following #
are ignored. Useful for comments.
x + y
[1] 7
🔰 Try subtraction, multiplication, and division with x
and y
.
Symbols like +
and *
are called operators.
10 + 3 # addition
10 - 3 # subtraction
10 * 3 # multiplication
10 / 3 # division
10 %/% 3 # integer division
10 %% 3 # modulus 剰余
10 ** 3 # exponent 10^3
🔰 Check the results of the commands above.
Receive some variables, do some job, and return something.
x = seq(1, 3) # receives 1 and 3, and returns a vector.
x
[1] 1 2 3
sum(x) # receives a vector, and returns a sum
[1] 6
square = function(something) { # define a new function
something ** 2
}
square(x) # use it
[1] 1 4 9
🔰 Create your own function.
e.g., a function named twice
to return doubled numbers.
x = 42 # Create x
x # What's in x?
[1] 42
y = "24601" # Create y
y # What's in y?
[1] "24601"
R cannot calculate the sum of them:
x + y # Error! Why?
Error in x + y: non-numeric argument to binary operator
class(x)
[1] "numeric"
is.numeric(x)
[1] TRUE
is.character(x)
[1] FALSE
as.character(x)
[1] "42"
🔰 Apply the same functions to y
.
logical
: (TRUE
or FALSE
)numeric
: (integer 42L
or real number 3.1416
)character
: ("a string"
)factor
: (hybrid of character and integer)array
: multi-dimensional array.
matrix
: two-dimensional array.list
: subspecies of vector that can be heterogenous.data.frame
: Rectangular table of the vectors. important tibble
and tbl_df
.R is good at element-wise operation on vectors.
There is no scalar type; it is treated as a vector of length 1.
x = c(1, 2, 9) # length of 3
x + x # the same length
[1] 2 4 18
y = 10 # length of 1
x + y # the shorter vector is recycled
[1] 11 12 19
x < 5 # is it smaller than 5?
[1] TRUE TRUE FALSE
🔰 Try other operations on these x
and y
.
Use []
to extract a subset. Indices starts from 1.
letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
letters[3]
[1] "c"
letters[seq(4, 6)] # 4 5 6
[1] "d" "e" "f"
letters[seq(1, 26) < 4] # TRUE TRUE TRUE FALSE FALSE ...
[1] "a" "b" "c"
element-wise:
x = c(1, 2, 9)
y = sqrt(x) # square root
y
[1] 1.000000 1.414214 3.000000
aggregate (use all values to generate one output):
z = sum(x)
z
[1] 12
🔰 Try log()
, exp()
, length()
, max()
, mean()
and classify them.
A rectangular made by folding a vector.
Often used in machine learning and image processing.
v = seq(1, 8) # c(1, 2, 3, 4, 5, 6, 7, 8)
x = matrix(v, nrow = 2) # 2行に畳む。列ごとに詰める
x
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 2 4 6 8
y = matrix(v, nrow = 2, byrow = TRUE) # 行ごとに詰める
y
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
🔰 Try x + y
, dim(x)
, nrow(x)
, ncol(x)
.
A set of vertical vectors with the same length.
e.g., 4 numeric and 1 factor vectors with the length of 150:
print(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
--
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
Overview:
head(iris, 6) # First N rows. tail() for last rows.
nrow(iris) # Number of ROWs
ncol(iris) # Number of COLumns
names(iris) # of columns
summary(iris) # mean, quantiles, etc.
View(iris) # in RStudio/VSCode
str(iris) # structure
tibble [150 × 5] (S3: tbl_df/tbl/data.frame)
$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
🔰 Try some other data.frames
e.g., mtcars
, quakes
, data()
Subset:
iris[2, ] # 2nd row
iris[2:5, ] # 2nd to 5th rows
iris[, 3:4] # 3rd to 4th columns
iris[2:5, 3:4] # 2nd to 5th rows, 3rd to 4th columns
Extract a column as vector:
iris[[3]] # 3rd column
iris$Petal.Length # a column named Petal.Length
iris[["Petal.Length"]] # a column named Petal.Length
iris[["Petal.Length"]][2] # 2nd element of Petal.Length
Unrecommended. Hard to know if the result is data.frame or vector:
iris[, 3] # data.frame with a single column?
iris[, "Petal.Length"] # data.frame with a single column?
iris[2, "Petal.Length"] # data.frame with a single cell?
Combine column vectors with the same length:
x = c(1, 2, 3)
y = c("A", "B", "C")
mydata = data.frame(x, y)
print(mydata)
x y
1 1 A
2 2 B
3 3 C
🔰 Create a data.frame named theDF
as follows:
i s
24 x
25 y
26 z
Hint: you can do it with and without c()
.
R has built-in functions such as read.csv()
and write.csv()
,
write.csv(iris, "iris.csv")
"","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
"1",5.1,3.5,1.4,0.2,"setosa"
"2",4.9,3,1.4,0.2,"setosa"
"3",4.7,3.2,1.3,0.2,"setosa"
but they are difficult to use properly.
Use readr
package instead.
readr::write_csv(iris, "iris.csv")
Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
A collection of useful functions and datasets.
install.packages("readr") # once per computer
library(readr) # every time you start R
update.packages() # once in a while
install.packages("tidyverse")
library(conflicted) # charm for safe coding
library(tidyverse) # load core packages at once
── Attaching core tidyverse packages ──── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
Consistently designed to cover all the processes in data analysis.
No such file or directory
str(iris)
, attributes(iris)
?sum
, help.start()
✅ R is a programming language/environment for data analysis.
✅ Create a “project” first to organize your files.
✅ Save commands to scripts before executing in the console.
✅ Data types: numeric, character, data.frame, etc.
✅ Useful R packages: tidyverse, etc.
✅ How to solve questions and errors.
You don’t have to remember every command.
Just repeat forgetting and searching.