R Programming Quick Notes :: Part - 4


Bhaskar S 04/16/2017


Overview

In Part - 3, we explored factor type, date and time types, control structures, and functions in R.

In this part, we will explore how to read data from different sources and how to use the apply functions.

Hands-on With R - IV

Reading Data

Data can be in different forms such as textual or binary. In addition, data can come from different sources such as files or network.

Most often data comes in a textual form stored in a .csv format.

Let us create two csv sample data files named sample-data.csv and sample-data-2.csv respectively.

The following is the sample-data.csv file:

sample-data.csv
age,color,education,employed,height,weight
29,Red,PHD,NA,5.1,150
25,Orange,BS,TRUE,5.1,170
26,Blue,BS,NA,5.0,170
28,Yellow,MS,NA,5.9,150
21,Blue,MS,NA,5.1,160
23,Blue,MS,NA,5.9,150
24,Orange,HS,FALSE,5.0,NA
27,Green,BS,TRUE,5.5,NA
30,Red,BS,FALSE,5.9,190
22,Yellow,HS,NA,5.9,150

And, the following is the sample-data-2.csv file:

sample-data-2.csv
#
# This is a sample csv data
#

age,color,education,employed,height,weight
29,Red,PHD,NA,5.1,150
25,Orange,BS,TRUE,5.1,170
26,Blue,BS,NA,5.0,170
28,Yellow,MS,NA,5.9,150
21,Blue,MS,NA,5.1,160
23,Blue,MS,NA,5.9,150
24,Orange,HS,FALSE,5.0,NA
27,Green,BS,TRUE,5.5,NA
30,Red,BS,FALSE,5.9,190
22,Yellow,HS,NA,5.9,150

Also, create a gzipped version of sample-data.csv called sample-data.csv.gz. For convenience, we have uploaded both the csv and the gzipped file at PolarSPARC.

Let us create the following R script named read_csv_data.R in RStudio:

read_csv_data.R
        #
        # read csv data
        #

        a <- read.table('sample-data.csv', header=TRUE, sep=',')
        str(a)
        print(a)

        b <- read.table('sample-data-2.csv', header=TRUE, sep=',', skip=3, blank.lines.skip=TRUE)
        str(b)
        print(b)

        c <- read.csv('sample-data.csv')
        str(c)
        print(c)

        # Use connection to file
        d <- file('sample-data.csv', open='r')
        e <- read.csv(d)
        str(e)
        print(e)

        # Use connection to compressed file
        f <- gzfile('sample-data.csv.gz', open='r')
        g <- read.csv(f)
        str(g)
        print(g)

        # Use connection to file using URL
        h <- url('file://./sample-data.csv')
        i <- read.csv(h)
        str(i)
        print(i)

        # Use connection to a website
        j <- url('http://www.polarsparc.com/data/sample-data.csv')
        k <- read.csv(j)
        str(k)
        print(k)

        rm(list=ls())
      

Execute the R script read_csv_data.R in RStudio and the following is the output:

Output (read_csv_data.R)

        > a <- read.table('sample-data.csv', header=TRUE, sep=',')
        > str(a)
        'data.frame':	10 obs. of  6 variables:
         $ age      : int  29 25 26 28 21 23 24 27 30 22
         $ color    : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5
         $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2
         $ employed : logi  NA TRUE NA NA NA NA ...
         $ height   : num  5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9
         $ weight   : int  150 170 170 150 160 150 NA NA 190 150
        > print(a)
           age  color education employed height weight
        1   29    Red       PHD       NA    5.1    150
        2   25 Orange        BS     TRUE    5.1    170
        3   26   Blue        BS       NA    5.0    170
        4   28 Yellow        MS       NA    5.9    150
        5   21   Blue        MS       NA    5.1    160
        6   23   Blue        MS       NA    5.9    150
        7   24 Orange        HS    FALSE    5.0     NA
        8   27  Green        BS     TRUE    5.5     NA
        9   30    Red        BS    FALSE    5.9    190
        10  22 Yellow        HS       NA    5.9    150
        >
        > b <- read.table('sample-data-2.csv', header=TRUE, sep=',', skip=3, blank.lines.skip=TRUE)
        > str(b)
        'data.frame':	10 obs. of  6 variables:
         $ age      : int  29 25 26 28 21 23 24 27 30 22
         $ color    : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5
         $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2
         $ employed : logi  NA TRUE NA NA NA NA ...
         $ height   : num  5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9
         $ weight   : int  150 170 170 150 160 150 NA NA 190 150
        > print(b)
           age  color education employed height weight
        1   29    Red       PHD       NA    5.1    150
        2   25 Orange        BS     TRUE    5.1    170
        3   26   Blue        BS       NA    5.0    170
        4   28 Yellow        MS       NA    5.9    150
        5   21   Blue        MS       NA    5.1    160
        6   23   Blue        MS       NA    5.9    150
        7   24 Orange        HS    FALSE    5.0     NA
        8   27  Green        BS     TRUE    5.5     NA
        9   30    Red        BS    FALSE    5.9    190
        10  22 Yellow        HS       NA    5.9    150
        >
        > c <- read.csv('sample-data.csv')
        > str(c)
        'data.frame':	10 obs. of  6 variables:
         $ age      : int  29 25 26 28 21 23 24 27 30 22
         $ color    : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5
         $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2
         $ employed : logi  NA TRUE NA NA NA NA ...
         $ height   : num  5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9
         $ weight   : int  150 170 170 150 160 150 NA NA 190 150
        Warning message:
        closing unused connection 3 (sample-data.csv.gz)
        > print(c)
           age  color education employed height weight
        1   29    Red       PHD       NA    5.1    150
        2   25 Orange        BS     TRUE    5.1    170
        3   26   Blue        BS       NA    5.0    170
        4   28 Yellow        MS       NA    5.9    150
        5   21   Blue        MS       NA    5.1    160
        6   23   Blue        MS       NA    5.9    150
        7   24 Orange        HS    FALSE    5.0     NA
        8   27  Green        BS     TRUE    5.5     NA
        9   30    Red        BS    FALSE    5.9    190
        10  22 Yellow        HS       NA    5.9    150
        >
        > # Use connection to file
        > d <- file('sample-data.csv', open='r')
        > e <- read.csv(d)
        > str(e)
        'data.frame':	10 obs. of  6 variables:
         $ age      : int  29 25 26 28 21 23 24 27 30 22
         $ color    : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5
         $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2
         $ employed : logi  NA TRUE NA NA NA NA ...
         $ height   : num  5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9
         $ weight   : int  150 170 170 150 160 150 NA NA 190 150
        > print(e)
           age  color education employed height weight
        1   29    Red       PHD       NA    5.1    150
        2   25 Orange        BS     TRUE    5.1    170
        3   26   Blue        BS       NA    5.0    170
        4   28 Yellow        MS       NA    5.9    150
        5   21   Blue        MS       NA    5.1    160
        6   23   Blue        MS       NA    5.9    150
        7   24 Orange        HS    FALSE    5.0     NA
        8   27  Green        BS     TRUE    5.5     NA
        9   30    Red        BS    FALSE    5.9    190
        10  22 Yellow        HS       NA    5.9    150
        >
        > # Use connection to compressed file
        > f <- gzfile('sample-data.csv.gz', open='r')
        > g <- read.csv(f)
        Warning messages:
        1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
          seek on a gzfile connection returned an internal error
        2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
          seek on a gzfile connection returned an internal error
        > str(g)
        'data.frame':	10 obs. of  6 variables:
         $ age      : int  29 25 26 28 21 23 24 27 30 22
         $ color    : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5
         $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2
         $ employed : logi  NA TRUE NA NA NA NA ...
         $ height   : num  5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9
         $ weight   : int  150 170 170 150 160 150 NA NA 190 150
        > print(g)
           age  color education employed height weight
        1   29    Red       PHD       NA    5.1    150
        2   25 Orange        BS     TRUE    5.1    170
        3   26   Blue        BS       NA    5.0    170
        4   28 Yellow        MS       NA    5.9    150
        5   21   Blue        MS       NA    5.1    160
        6   23   Blue        MS       NA    5.9    150
        7   24 Orange        HS    FALSE    5.0     NA
        8   27  Green        BS     TRUE    5.5     NA
        9   30    Red        BS    FALSE    5.9    190
        10  22 Yellow        HS       NA    5.9    150
        >
        > # Use connection to file
        > h <- url('file://./sample-data.csv')
        > i <- read.csv(h)
        > str(i)
        'data.frame':	10 obs. of  6 variables:
         $ age      : int  29 25 26 28 21 23 24 27 30 22
         $ color    : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5
         $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2
         $ employed : logi  NA TRUE NA NA NA NA ...
         $ height   : num  5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9
         $ weight   : int  150 170 170 150 160 150 NA NA 190 150
        > print(i)
           age  color education employed height weight
        1   29    Red       PHD       NA    5.1    150
        2   25 Orange        BS     TRUE    5.1    170
        3   26   Blue        BS       NA    5.0    170
        4   28 Yellow        MS       NA    5.9    150
        5   21   Blue        MS       NA    5.1    160
        6   23   Blue        MS       NA    5.9    150
        7   24 Orange        HS    FALSE    5.0     NA
        8   27  Green        BS     TRUE    5.5     NA
        9   30    Red        BS    FALSE    5.9    190
        10  22 Yellow        HS       NA    5.9    150
        >
        > # Use connection to a website
        > j <- url('http://www.polarsparc.com/data/sample-data.csv')
        > k <- read.csv(j)
        > str(k)
        'data.frame':	10 obs. of  6 variables:
         $ age      : int  29 25 26 28 21 23 24 27 30 22
         $ color    : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5
         $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2
         $ employed : logi  NA TRUE NA NA NA NA ...
         $ height   : num  5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9
         $ weight   : int  150 170 170 150 160 150 NA NA 190 150
        > print(k)
           age  color education employed height weight
        1   29    Red       PHD       NA    5.1    150
        2   25 Orange        BS     TRUE    5.1    170
        3   26   Blue        BS       NA    5.0    170
        4   28 Yellow        MS       NA    5.9    150
        5   21   Blue        MS       NA    5.1    160
        6   23   Blue        MS       NA    5.9    150
        7   24 Orange        HS    FALSE    5.0     NA
        8   27  Green        BS     TRUE    5.5     NA
        9   30    Red        BS    FALSE    5.9    190
        10  22 Yellow        HS       NA    5.9    150
        >
        > rm(list=ls())
      

The read.table() function reads data from a specified file and returns the data as a data.frame. The following are some of the important arguments to the function:

The read.csv() function is similar to the read.table() function except that many of the arguments have default values such as the header which is set to TRUE, the sep which is set to comma (,), etc.

The file() function returns a connection to the specified input file. The open argument specifies the mode of operation on the input file. The value r specifies the read mode.

The gzfile() function returns a connection to the specified compressed file. The open argument specifies the mode of operation on the compressed file. The value r specifies the read mode.

The url() function returns a connection to the specified URL. The URL schemes supported are http, ftp and file.

Apply Functions

Assume we have a list with 3 elements, where each element is a vector with 5 elements. We desire to find the max value for each element of the list. We can use the for-loop statement to achieve the result. But there is much easier way to achieve the same using the apply family of functions in R.

Let us dive right in to explore the lapply and the sapply functions in R.

Let us create the following R script named apply_functions-1.R in RStudio:

apply_functions-1.R
        #
        # apply functions on a list of vectors
        #

        a <- list(x = round(rnorm(5, mean=5, sd=1), digits=1),
                  y = round(runif(5, min=1, max=10), digits=0),
                  z = sample(LETTERS, 5))
        print(a)

        b <- vector('list', length(a))
        c <- c('', '', '')
        names(b) <- names(a)
        names(c) <- names(a)
        for (i in seq_along(a)) {
          v <- a[[i]]
          m <- v[1]
          for (j in seq_along(v)) {
            if (v[j] > m) {
              m <- v[j]
            }
          }
          b[[i]] <- m
          c[i] <- m
        }

        print(b)

        lapply(a, max)

        print(c)

        sapply(a, max)

        rm(list=ls())
      

Execute the R script apply_functions-1.R in RStudio and the following is the output:

Output (apply_functions-1.R)

        > a <- list(x = round(rnorm(5, mean=5, sd=1), digits=1),
        +           y = round(runif(5, min=1, max=10), digits=0),
        +           z = sample(LETTERS, 5))
        > print(a)
        $x
        [1] 4.9 4.6 5.4 4.6 4.9

        $y
        [1] 9 3 3 7 7

        $z
        [1] "Z" "X" "F" "V" "T"

        >
        > b <- vector('list', length(a))
        > c <- c('', '', '')
        > names(b) <- names(a)
        > names(c) <- names(a)
        > for (i in seq_along(a)) {
        +   v <- a[[i]]
        +   m <- v[1]
        +   for (j in seq_along(v)) {
        +     if (v[j] > m) {
        +       m <- v[j]
        +     }
        +   }
        +   b[[i]] <- m
        +   c[i] <- m
        + }
        >
        > print(b)
        $x
        [1] 5.4

        $y
        [1] 9

        $z
        [1] "Z"

        >
        > lapply(a, max)
        $x
        [1] 5.4

        $y
        [1] 9

        $z
        [1] "Z"

        >
        > print(c)
            x     y     z
        "5.4"   "9"   "Z"
        >
        > sapply(a, max)
            x     y     z
        "5.4"   "9"   "Z"
        >
        > rm(list=ls())
      

As is evident from the R script above, using the for-loop to find the max of elements is not simple.

The names() function gets or sets the names of the elements in a collection.

The seq_along() function generates the indices for the elements in a collection.

The lapply() function stands for 'list apply'. It iterates over a list of elements and applies the specified function to each element of the list. The value returned is a list.

The sapply() function stands for 'simplify list apply'. It iterates over a list of elements and applies the specified function to each element of the list. The value returned is a simplified vector if each result element is of length 1. Else returns a list.

Assume we have the data.frame loaded from sample-data.csv. We desire to solve the following problems on the data.frame:

Let us create the following R script named apply_functions-2.R in RStudio:

apply_functions-2.R
        #
        # apply functions on a list of data.frames
        #

        a <- read.csv('sample-data.csv')

        b <- apply(a[c('height', 'weight')], 2, mean, na.rm=TRUE)
        print(b)

        c <- split(a[,'weight'], a$color)
        print(c)

        lapply(c, mean, na.rm=TRUE)

        sapply(c, mean, na.rm=TRUE)

        d <- split(a[,c('height', 'weight')], a$education)
        print(d)

        lapply(d, colMeans, na.rm=TRUE)

        sapply(d, colMeans, na.rm=TRUE)

        e <- tapply(a[,'height'], a[,'education'], mean)
        print(e)

        rm(list=ls())
      

Execute the R script apply_functions-2.R in RStudio and the following is the output:

Output (apply_functions-2.R)

        > a <- read.csv('sample-data.csv')
        >
        > b <- apply(a[c('height', 'weight')], 2, mean, na.rm=TRUE)
        > print(b)
        height weight
          5.44 161.25
        >
        > c <- split(a[,'weight'], a$color)
        > print(c)
        $Blue
        [1] 170 160 150

        $Green
        [1] NA

        $Orange
        [1] 170  NA

        $Red
        [1] 150 190

        $Yellow
        [1] 150 150

        >
        > lapply(c, mean, na.rm=TRUE)
        $Blue
        [1] 160

        $Green
        [1] NaN

        $Orange
        [1] 170

        $Red
        [1] 170

        $Yellow
        [1] 150

        >
        > sapply(c, mean, na.rm=TRUE)
          Blue  Green Orange    Red Yellow
           160    NaN    170    170    150
        >
        > d <- split(a[,c('height', 'weight')], a$education)
        > print(d)
        $BS
          height weight
        2    5.1    170
        3    5.0    170
        8    5.5     NA
        9    5.9    190

        $HS
           height weight
        7     5.0     NA
        10    5.9    150

        $MS
          height weight
        4    5.9    150
        5    5.1    160
        6    5.9    150

        $PHD
          height weight
        1    5.1    150

        >
        > lapply(d, colMeans, na.rm=TRUE)
        $BS
          height   weight
          5.3750 176.6667

        $HS
        height weight
          5.45 150.00

        $MS
            height     weight
          5.633333 153.333333

        $PHD
        height weight
           5.1  150.0

        >
        > sapply(d, colMeans, na.rm=TRUE)
                     BS     HS         MS   PHD
        height   5.3750   5.45   5.633333   5.1
        weight 176.6667 150.00 153.333333 150.0
        >
        > e <- tapply(a[,'height'], a[,'education'], mean)
        > print(e)
              BS       HS       MS      PHD
        5.375000 5.450000 5.633333 5.100000
        >
        > rm(list=ls())
      

The apply() function evaluates the specified function on either the rows or columns of the specified tabular collection (matrix or data.frame).

The split() function segregates the specified collection into groups based on the specified set of factors.

The colMeans() function is the shorthand form for the function apply(data, 2, func), which computes the average on the columns of data.

The tapply() function is similar to the combination of split() and sapply() functions working together.

More to come in Part-5 ...

References

R Programming Quick Notes :: Part - 1

R Programming Quick Notes :: Part - 2

R Programming Quick Notes :: Part - 3