R Programming Quick Notes :: Part - 2


Bhaskar S 03/17/2017


Overview

In Part - 1, we introduced the atomic data types, collection types, and some commonly used functions in R.

In this part, we will dive into the collection types and explore them further.

Hands-on With R - II

Vector

Let us start our journey with vector.

Let us assume a vector with at least n elements in it. The following are some of the operations one can perform on that vector:

Let us create the following R script named vector_ops.R in RStudio:

vector_ops.R
#
# vector operations
#

a <- sample(1:25, 10)
print(a)

b <- sample(1:10, 10, replace = TRUE)
print(b)

print(a[5])

print(a[-5])

print(b[4:6])

print(b[-(4:6)])

print(a[c(3, 5, 7)])

print(a[-c(3, 5, 7)])

print(b/2)

c <- a > 15
print(a[c])

d <- (a > 15) & (a <= 20)
print(a[d])

e <- (b > 3) & (b <= 7)
print(b[e])

str(a)

length(b)

which(a > 15)

table(b)

min(a)

max(b)

sum(a)

mean(b)

median(a)

summary(b)
      

Execute the R script vector_ops.R in RStudio and the following is the output:

Output (vector_ops.R)

> a <- sample(1:25, 10)
> print(a)
 [1]  2 11  7 18 16 22 12  9 24 25
>
> b <- sample(1:10, 10, replace = TRUE)
> print(b)
 [1]  2  1  2  4  7 10  2  4  5  9
>
> print(a[5])
[1] 16
>
> print(a[-5])
[1]  2 11  7 18 22 12  9 24 25
>
> print(b[4:6])
[1]  4  7 10
>
> print(b[-(4:6)])
[1] 2 1 2 2 4 5 9
>
> print(a[c(3, 5, 7)])
[1]  7 16 12
>
> print(a[-c(3, 5, 7)])
[1]  2 11 18 22  9 24 25
>
> print(b/2)
 [1] 1.0 0.5 1.0 2.0 3.5 5.0 1.0 2.0 2.5 4.5
>
> c <- a > 15
> print(a[c])
[1] 18 16 22 24 25
>
> d <- (a > 15) & (a <= 20)
> print(a[d])
[1] 18 16
>
> e <- (b > 3) & (b <= 7)
> print(b[e])
[1] 4 7 4 5
>
> str(a)
 int [1:10] 2 11 7 18 16 22 12 9 24 25
>
> length(b)
[1] 10
>
> which(a > 15)
[1]  4  5  6  9 10
>
> table(b)
b
 1  2  4  5  7  9 10
 1  3  2  1  1  1  1
>
> min(a)
[1] 2
>
> max(b)
[1] 10
>
> sum(a)
[1] 146
>
> mean(b)
[1] 4.6
>
> median(a)
[1] 14
>
> summary(b)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.0     2.0     4.0     4.6     6.5    10.0
      

The sample() function returns a vector of elements after taking a random sample from a specified vector (first argument), of the specified size (second argument) either with (replace = TRUE) or without replacement.

The logical expression a > 15 (or for that matter of fact (a > 15) & (a <= 20)) returns a logical (TRUE or FALSE) vector that is the results of performing the logical operation on each element in the vector a.

The str() function displays the structure of any R objects such as a vector, list, matrix, or a data.frame.

The length() function returns a count of the number of elements in the vector.

The which() function returns a vector of index positions where the specified logical expression is TRUE.

The table() function returns a frequency table that lists each unique element of the specified vector and how many times that element occurs.

The min() function finds the minimum value in the vector.

The max() function finds the maximum value in the vector.

The sum() function computes the sum of all the elements in the numeric vector.

The mean() function computes the arithmetic mean of all the elements in the numeric vector.

The median() function finds the median from all the elements in the numeric vector.

The summary() function outputs the distribution of the values in the numeric vector, such as, the min value, max value, the mean, the median, and the quartiles.

List

Now, let us move on to explore list.

A list can also be created using any of the following two ways:

  list(value1, value2, value3, ...)

    OR

  list(name1 = value1, name2 = value2, name3 = value3, ...)

where, name1, name2, name3, etc are names.

Let us assume a list with at least n elements in it.

The following are some of the operations one can perform on a list:

Let us create the following R script named list_ops.R in RStudio:

list_ops.R
#
# list operations
#

a <- list(5, 'ABC', c(1, 2, 3), 7.5, FALSE)
print(a)

print(a[3])
class(a[3])

print(a[[1]])
class(a[[1]])

print(a[2:3])
class(a[2:3])

print(a[c(1, 3)])

print(a[[c(3, 2)]])

print(a[[3]][[2]])

b <- list(b1 = 10, b2 = 'PQR', b3 = 6:8, b4 = 8.75, b5 = TRUE)
print(b)

print(b[['b1']])

print(b$b3)
      

Execute the R script list_ops.R in RStudio and the following is the output:

Output (list_ops.R)

> a <- list(5, 'ABC', c(1, 2, 3), 7.5, FALSE)
> print(a)
[[1]]
[1] 5

[[2]]
[1] "ABC"

[[3]]
[1] 1 2 3

[[4]]
[1] 7.5

[[5]]
[1] FALSE

>
> print(a[3])
[[1]]
[1] 1 2 3

> class(a[3])
[1] "list"
>
> print(a[[1]])
[1] 5
> class(a[[1]])
[1] "numeric"
>
> print(a[2:3])
[[1]]
[1] "ABC"

[[2]]
[1] 1 2 3

> class(a[2:3])
[1] "list"
>
> print(a[c(1, 3)])
[[1]]
[1] 5

[[2]]
[1] 1 2 3

>
> print(a[[c(3, 2)]])
[1] 2
>
> print(a[[3]][[2]])
[1] 2
>
> b <- list(b1 = 10, b2 = 'PQR', b3 = 6:8, b4 = 8.75, b5 = TRUE)
> print(b)
$b1
[1] 10

$b2
[1] "PQR"

$b3
[1] 6 7 8

$b4
[1] 8.75

$b5
[1] TRUE

>
> print(b[['b1']])
[1] 10
>
> print(b$b3)
[1] 6 7 8
      

Matrix

Now, let us switch gears to explore matrix.

One of the ways by which a matrix can be created is using the format:

  matrix(vector_of_data, no_of_rows, no_of_columns)

By default, the data elements in a Matrix are filled column-wise. One can change that behavior to fill the data elements row-wise by specifying the argument byrow = TRUE.

One can assign names to each of the rows and columns in a matrix using the argument dirnames = list(rows_names, column_names).

Let us assume a matrix with m rows and n columns.

The following are some of the operations one can perform on a matrix:

Let us create the following R script named matrix_ops.R in RStudio:

matrix_ops.R
#
# matrix operations
#

a <- matrix(1:20, 4, 5)
print(a)

class(a)

dim(a)

b <- matrix(1:20, 4, 5, byrow = TRUE)
print(b)

m <- 1:5
n <- 6:10
o <- 11:15
p <- 16:20

c <- cbind(m, n, o, p)
print(c)

d <- rbind(m, n, o, p)
print(d)

e <- matrix(1:20, 4, 5,
            dimnames = list(c('r1', 'r2', 'r3', 'r4'),
                            c('c1', 'c2', 'c3', 'c4', 'c5')))

print(e)

rownames(e)

colnames(e)

print(a[1,])

print(b[,2])

print(c[2:3,])

print(d[,2:3])

print(e[2,4])

print(e['r2', 'c3'])

f <- t(e)
print(f)

a * b

a / b

a %*% f
      

Execute the R script matrix_ops.R in RStudio and the following is the output:

Output (matrix_ops.R)

> a <- matrix(1:20, 4, 5)
> print(a)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20
>
> class(a)
[1] "matrix"
>
> dim(a)
[1] 4 5
>
> b <- matrix(1:20, 4, 5, byrow = TRUE)
> print(b)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
[3,]   11   12   13   14   15
[4,]   16   17   18   19   20
>
> m <- 1:5
> n <- 6:10
> o <- 11:15
> p <- 16:20
>
> c <- cbind(m, n, o, p)
> print(c)
     m  n  o  p
[1,] 1  6 11 16
[2,] 2  7 12 17
[3,] 3  8 13 18
[4,] 4  9 14 19
[5,] 5 10 15 20
>
> d <- rbind(m, n, o, p)
> print(d)
  [,1] [,2] [,3] [,4] [,5]
m    1    2    3    4    5
n    6    7    8    9   10
o   11   12   13   14   15
p   16   17   18   19   20
>
> e <- matrix(1:20, 4, 5,
+             dimnames = list(c('r1', 'r2', 'r3', 'r4'),
+                             c('c1', 'c2', 'c3', 'c4', 'c5')))
>
> print(e)
   c1 c2 c3 c4 c5
r1  1  5  9 13 17
r2  2  6 10 14 18
r3  3  7 11 15 19
r4  4  8 12 16 20
>
> rownames(e)
[1] "r1" "r2" "r3" "r4"
>
> colnames(e)
[1] "c1" "c2" "c3" "c4" "c5"
>
> print(a[1,])
[1]  1  5  9 13 17
>
> print(b[,2])
[1]  2  7 12 17
>
> print(c[2:3,])
     m n  o  p
[1,] 2 7 12 17
[2,] 3 8 13 18
>
> print(d[,2:3])
  [,1] [,2]
m    2    3
n    7    8
o   12   13
p   17   18
>
> print(e[2,4])
[1] 14
>
> print(e['r2', 'c3'])
[1] 10
>
> f <- t(e)
> print(f)
   r1 r2 r3 r4
c1  1  2  3  4
c2  5  6  7  8
c3  9 10 11 12
c4 13 14 15 16
c5 17 18 19 20
>
> a * b
     [,1] [,2] [,3] [,4] [,5]
[1,]    1   10   27   52   85
[2,]   12   42   80  126  180
[3,]   33   84  143  210  285
[4,]   64  136  216  304  400
>
> a / b
          [,1]      [,2]      [,3]      [,4]     [,5]
[1,] 1.0000000 2.5000000 3.0000000 3.2500000 3.400000
[2,] 0.3333333 0.8571429 1.2500000 1.5555556 1.800000
[3,] 0.2727273 0.5833333 0.8461538 1.0714286 1.266667
[4,] 0.2500000 0.4705882 0.6666667 0.8421053 1.000000
>
> a %*% f
      r1  r2  r3  r4
[1,] 565 610 655 700
[2,] 610 660 710 760
[3,] 655 710 765 820
[4,] 700 760 820 880
      

The dim() function returns the dimensions (rows, columns) for the specified matrix.

The rownames() function returns the row names for the specified matrix.

The colnames() function returns the column names for the specified matrix.

The cbind() function combines the specified list of vectors as columns of a matrix.

The rbind() function combines the specified list of vectors as rows of a matrix.

The t() function performs a transpose operation on the specified matrix.

Data Frame

Finally, we will explore data.frame, which is a tabular, spreadsheet like data object.

Let us assume a data.frame with n rows and columns with names c1, c2, c3, ..., cm.

The following are some of the operations one can perform on a data.frame:

Let us create the following R script named dataframe_ops.R in RStudio:

dataframe_ops.R
#
# data.frame operations
#

a <- sample(21:30, 10)
b <- sample(c('Blue', 'Green', 'Orange', 'Red', 'Yellow'), 10, replace = TRUE)
c <- sample(c(NA, 'HS', 'BS', 'MS', 'PHD'), 10, replace = TRUE)
d <- sample(c(NA, TRUE, FALSE), 10, replace = TRUE)
e <- sample(round(runif(5, min = 5, max = 6.5), digits = 1), 10, replace = TRUE)
f <- sample(c(NA, 150, 160, 170, 180, 190, 200), 10, replace = TRUE)

df <- data.frame(age = a,
                 color = b,
                 education = c,
                 employed = d,
                 height = e,
                 weight = f)
print(df)

class(df)

nrow(df)

ncol(df)

dim(df)

length(df)

str(df)

head(df)

tail(df)

names(df)

df['age']

df[c('color', 'education')]

df[['employed']]

df$age

df[5,]

df[5, c('height', 'weight')]

df[2:5,]

df[2:5, c('age', 'education')]

df[c(1, 3, 5),]

df[c(1, 3, 5), c('employed', 'height')]

df[5, 'weight']

g <- df$weight > 160

df[g,]

h <- (df$age > 22) & (df$education == 'MS')

df[h,]
      

Execute the R script dataframe_ops.R in RStudio and the following is the output:

Output (dataframe_ops.R)

> a <- sample(21:30, 10)
> b <- sample(c('Blue', 'Green', 'Orange', 'Red', 'Yellow'), 10, replace = TRUE)
> c <- sample(c(NA, 'HS', 'BS', 'MS', 'PHD'), 10, replace = TRUE)
> d <- sample(c(NA, TRUE, FALSE), 10, replace = TRUE)
> e <- sample(round(runif(5, min = 5, max = 6.5), digits = 1), 10, replace = TRUE)
> f <- sample(c(NA, 150, 160, 170, 180, 190, 200), 10, replace = TRUE)
>
> df <- data.frame(age = a,
+                  color = b,
+                  education = c,
+                  employed = d,
+                  height = e,
+                  weight = f)
> print(df)
   age  color education employed height weight
1   29    Red       PHD       NA    5.1    150
2   25 Orange        BS     TRUE    5.1    170
3   26   Blue        BS       NA    5.0    170
4   28 Yellow        MS       NA    5.9    150
5   21   Blue        MS       NA    5.1    160
6   23   Blue        MS       NA    5.9    150
7   24 Orange        HS    FALSE    5.0     NA
8   27  Green           TRUE    5.5     NA
9   30    Red        BS    FALSE    5.9    190
10  22 Yellow        HS       NA    5.9    150
>
> class(df)
[1] "data.frame"
>
> nrow(df)
[1] 10
>
> ncol(df)
[1] 6
>
> dim(df)
[1] 10  6
>
> length(df)
[1] 6
>
> str(df)
'data.frame':	10 obs. of  6 variables:
 $ age      : int  29 25 26 28 21 23 24 27 30 22
 $ color    : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5
 $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 NA 1 2
 $ employed : logi  NA TRUE NA NA NA NA ...
 $ height   : num  5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9
 $ weight   : num  150 170 170 150 160 150 NA NA 190 150
>
> head(df)
  age  color education employed height weight
1  29    Red       PHD       NA    5.1    150
2  25 Orange        BS     TRUE    5.1    170
3  26   Blue        BS       NA    5.0    170
4  28 Yellow        MS       NA    5.9    150
5  21   Blue        MS       NA    5.1    160
6  23   Blue        MS       NA    5.9    150
>
> tail(df)
   age  color education employed height weight
5   21   Blue        MS       NA    5.1    160
6   23   Blue        MS       NA    5.9    150
7   24 Orange        HS    FALSE    5.0     NA
8   27  Green           TRUE    5.5     NA
9   30    Red        BS    FALSE    5.9    190
10  22 Yellow        HS       NA    5.9    150
>
> names(df)
[1] "age"       "color"     "education" "employed"  "height"    "weight"
>
> df['age']
   age
1   29
2   25
3   26
4   28
5   21
6   23
7   24
8   27
9   30
10  22
>
> df[c('color', 'education')]
    color education
1     Red       PHD
2  Orange        BS
3    Blue        BS
4  Yellow        MS
5    Blue        MS
6    Blue        MS
7  Orange        HS
8   Green      
9     Red        BS
10 Yellow        HS
>
> df[['employed']]
 [1]    NA  TRUE    NA    NA    NA    NA FALSE  TRUE FALSE    NA
>
> df$age
 [1] 29 25 26 28 21 23 24 27 30 22
>
> df[5,]
  age color education employed height weight
5  21  Blue        MS       NA    5.1    160
>
> df[5, c('height', 'weight')]
  height weight
5    5.1    160
>
> df[2:5,]
  age  color education employed height weight
2  25 Orange        BS     TRUE    5.1    170
3  26   Blue        BS       NA    5.0    170
4  28 Yellow        MS       NA    5.9    150
5  21   Blue        MS       NA    5.1    160
>
> df[2:5, c('age', 'education')]
  age education
2  25        BS
3  26        BS
4  28        MS
5  21        MS
>
> df[c(1, 3, 5),]
  age color education employed height weight
1  29   Red       PHD       NA    5.1    150
3  26  Blue        BS       NA    5.0    170
5  21  Blue        MS       NA    5.1    160
>
> df[c(1, 3, 5), c('employed', 'height')]
  employed height
1       NA    5.1
3       NA    5.0
5       NA    5.1
>
> df[5, 'weight']
[1] 160
>
> g <- df$weight > 160
>
> df[g,]
     age  color education employed height weight
2     25 Orange        BS     TRUE    5.1    170
3     26   Blue        BS       NA    5.0    170
NA    NA                NA     NA     NA
NA.1  NA                NA     NA     NA
9     30    Red        BS    FALSE    5.9    190
>
> h <- (df$age > 22) & (df$education == 'MS')
>
> df[h,]
   age  color education employed height weight
4   28 Yellow        MS       NA    5.9    150
6   23   Blue        MS       NA    5.9    150
NA  NA                NA     NA     NA
      

The runif() function returns a specified number of random samples (the first argument) as a uniform distribution in the interval between min to max.

The nrow() function returns the number of rows for the specified data.frame.

The ncol() function returns the number of columns for the specified data.frame.

The head() function returns the first 5 rows of the specified data.frame.

The tail() function returns the last 5 rows of the specified data.frame.

More to come in Part-3 ...

References

R Programming Quick Notes :: Part - 1