R Programming Quick Notes :: Part - 1


Bhaskar S 03/11/2017


Overview

R is an open-source programming language that is one of the favorites amongst statisticians and data scientists with the following characteristics:

Installation and Setup

We will assume a Ubuntu 16.04 based platform with the user id alice.

Download and install the following software:

Hands-on With R - I

The following are the basic atomic data types supported by R:

Let us create the following R script named basic.R in RStudio:

basic.R
#
# Basic atomic types
#

a <- TRUE
print(a)
class(a)

b <- 2.5
print(b)
class(b)

c <- 5L
print(c)
class(c)

d <- 'm'
print(d)
class(d)

e <- 'hello'
print(e)
class(e)

f <- 2 +5i
print(f)
class(f)
      

To execute the R script basic.R, select all the lines from the script and press the CTRL+<Enter> keys. The following is the output:

Output (basic.R)

> a <- TRUE
> print(a)
[1] TRUE
> class(a)
[1] "logical"
>
> b <- 2.5
> print(b)
[1] 2.5
> class(b)
[1] "numeric"
>
> c <- 5L
> print(c)
[1] 5
> class(c)
[1] "integer"
>
> d <- 'm'
> print(d)
[1] "m"
> class(d)
[1] "character"
>
> e <- 'hello'
> print(e)
[1] "hello"
> class(e)
[1] "character"
>
> f <- 2 +5i
> print(f)
[1] 2+5i
> class(f)
[1] "complex"
      

The operator <- is the assignment operator.

The print() function prints the value of the specified object.

The class() function return the type of the specified object.

The following are the basic collection types supported by R:

Let us create the following R script named collection.R in RStudio:

collection.R
#
# Basic collection types
#

a <- c('A', 'B', 'C', 'D', 'E')
print(a)
class(a)

b <- 1:5
print(b)
class(b)

c <- seq(5, 10)
print(c)
class(c)

d <- seq(1, 10, 2)
print(d)
class(d)

e <- list(1, 'Two', 3L, FALSE)
print(e)
class(e)

f <- matrix(1:8, 2, 4)
print(f)
class(f)

g <- data.frame(a = 1:4, b = c('A', 'B', 'C', 'D'), c = c(TRUE, FALSE, FALSE, TRUE))
print(g)
class(g)
      

Execute the R script collection.R in RStudio and the following is the output:

Output (collection.R)

> a <- c('A', 'B', 'C', 'D', 'E')
> print(a)
[1] "A" "B" "C" "D" "E"
> class(a)
[1] "character"
>
> b <- 1:5
> print(b)
[1] 1 2 3 4 5
> class(b)
[1] "integer"
>
> c <- seq(5, 10)
> print(c)
[1]  5  6  7  8  9 10
> class(c)
[1] "integer"
>
> d <- seq(1, 10, 2)
> print(d)
[1] 1 3 5 7 9
> class(d)
[1] "numeric"
>
> e <- list(1, 'Two', 3L, FALSE)
> print(e)
[[1]]
[1] 1

[[2]]
[1] "Two"

[[3]]
[1] 3

[[4]]
[1] FALSE

> class(e)
[1] "list"
>
> f <- matrix(1:8, 2, 4)
> print(f)
     [,1] [,2] [,3] [,4]
[1,]    1    3    5    7
[2,]    2    4    6    8
> class(f)
[1] "matrix"
>
> g <- data.frame(a = 1:4, b = c('A', 'B', 'C', 'D'), c = c(TRUE, FALSE, FALSE, TRUE))
> print(g)
  a b     c
1 1 A  TRUE
2 2 B FALSE
3 3 C FALSE
4 4 D  TRUE
> class(g)
[1] "data.frame"
      

The c() function stands for concatenate and allows us to create a vector.

The m:n expression is another way to create a vector from a sequence where m is the start of the sequence and n is the end of the sequence.

The seq() function is yet another way to create a vector and generates a sequence. The first argument specifies the start of the sequence, the second argument specifies the end of the sequence, and the third argument if specified is the increment.

The list() function allows us to create a list of data elements of different types.

The matrix() function allows us to create a matrix from the specified data elements with specified rows (second argument) and columns (third argument).

The data.frame() function allows us to create tabular data where each column is a vector of a certain data type and of the same length.

In the above R script collection.R, we have assigned (or bound) values to the variables (or symbols) a through g. In R, these name-value pairs are stored in the current working session called the global environment (referred to as .GlobalEnv). Think of the global environment as the working memory consisting of a collection of R objects. The global environment is initialized when R is started first.

Execute the following R function to take a peek into what symbols are in the global environment:

ls()

The following is the typical output:

Output (ls())

> ls()
[1] "a" "b" "c" "d" "e" "f" "g"
      

To remove a particular object from the global environment working memory, execute the following R function:

rm(object)

For example, to remove the object a, execute the following R function:

rm(a)

To remove all the objects from the global environment working memory, execute the following R function:

rm(list = ls())

To display the current working directory, execute the following R function:

getwd()

The following is the typical output:

Output (getwd())

> getwd()
[1] "/home/alice/Projects/R"
      

Operations and functions in R are Vectorized meaning they not only work on a single data value but also work on a collection of data values in parallel at the same time.

The following are some of the commonly used basic functions supported by R:

Let us create the following R script named vector.R in RStudio to demonstrate the functions mentioned above:

vector.R
#
# Vectorized operations and functions
#

a <- seq(11, 20)
b <- seq(-5, -50, -5)
c <- c('abc', 'DEFGH', 'iJKlmnop', 'qrSTUVWxyz')

d <- a + b
print(d)

e <- b / a
print(e)

abs(b)

f <- sqrt(a)

ceiling(f)

floor(f)

round(f)

substr(c, 2, 3)

paste('Welcome', 'to', 'R', 'Programming')
paste('Weekdays are - Mon', 'Tue', 'Wed', 'Thu', 'Fri', sep = ',')

tolower(c)

toupper(c)

rep(1:5, 3)
      

Execute the R script vector.R in RStudio and the following is the output:

Output (vector.R)

> a <- seq(11, 20)
> b <- seq(-5, -50, -5)
> c <- c('abc', 'DEFGH', 'iJKlmnop', 'qrSTUVWxyz')
>
> d <- a + b
> print(d)
 [1]   6   2  -2  -6 -10 -14 -18 -22 -26 -30
>
> e <- b / a
> print(e)
 [1] -0.4545455 -0.8333333 -1.1538462 -1.4285714 -1.6666667 -1.8750000 -2.0588235 -2.2222222 -2.3684211 -2.5000000
>
> abs(b)
 [1]  5 10 15 20 25 30 35 40 45 50
>
> f <- sqrt(a)
>
> ceiling(f)
 [1] 4 4 4 4 4 4 5 5 5 5
>
> floor(f)
 [1] 3 3 3 3 3 4 4 4 4 4
>
> round(f)
 [1] 3 3 4 4 4 4 4 4 4 4
>
> substr(c, 2, 3)
[1] "bc" "EF" "JK" "rS"
>
> paste('Welcome', 'to', 'R', 'Programming')
[1] "Welcome to R Programming"
> paste('Weekdays are - Mon', 'Tue', 'Wed', 'Thu', 'Fri', sep = ',')
[1] "Weekdays are - Mon,Tue,Wed,Thu,Fri"
>
> tolower(c)
[1] "abc"        "defgh"      "ijklmnop"   "qrstuvwxyz"
>
> toupper(c)
[1] "ABC"        "DEFGH"      "IJKLMNOP"   "QRSTUVWXYZ"
>
> rep(1:5, 3)
 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
      

Missing data values in R is represented as either an NA or a NaN. A NaN value is also considered an NA value but an NA is never a NaN.

Let us create the following R script named missing.R in RStudio to demonstrate the missing values as mentioned above:

missing.R
#
# Missing values
#

a <- c(5, 10, NA, 20, NA, 30)
b <- c(5, 10, NaN, 20, NA, 30)

is.na(a)

is.nan(a)

is.na(b)

is.nan(b)
      

Execute the R script missing.R in RStudio and the following is the output:

Output (missing.R)

> a <- c(5, 10, NA, 20, NA, 30)
> b <- c(5, 10, NaN, 20, NA, 30)
>
> is.na(a)
[1] FALSE FALSE  TRUE FALSE  TRUE FALSE
>
> is.nan(a)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
>
> is.na(b)
[1] FALSE FALSE  TRUE FALSE  TRUE FALSE
>
> is.nan(b)
[1] FALSE FALSE  TRUE FALSE FALSE FALSE
      

More to come in Part-2 ...