R Language Basics

R-lang is a free software environment for statistical computing and graphics.

Install R

Download R

Download R Studio - An application to write R programs on

Use Swirl

Swirl is an interactive prompt based way to learn about R and other data science topics

To start open r in terminal with r or open r studio

Install swirl

install.packages("swirl")

Load the swirl library

library(swirl)

Then start with swirl()

swirl()

Everything else will be guided

R Language

The rlang interpreter works much like many others in that you can do basic maths with it.

Syntax

Assignment: <- Assigning a value to a variable is done with <-

Data Structures

Any object containing data is a data structure

The simplest data structure is a vector. A single number is a vector of length 1.

A vector is created with the c() concatenate of combine method

z = c(1.1, 4.5, 6)

You can concatenate vectors with c:

c(z, 255, z)

Numberic operations on vectors are applied to all elements in the vector. When arithmetic is done to vectors of the same length, each operation is applied element by element. If they are not the same length, the shorter vector is recycled to the same length.

Behind the scenes R converts single vectors into multiple.

``` z <- c(5, 10, 15) z * 2 + 100

same as

z * c(2,2,2) + c(100,100,100) ```

Artihmetic Operators

  • +, -, /, *
  • ^: to power of
  • sqrt(): square root
  • abs(): absolute value

Getting Help

To get help on a function type: ? and the function name without calling it

Eg. ?c

Dollar Operator

Grab specific items from output with the $ operator

eg file.info("mytest.R")$mode

Workspace and Files

Get working directory getwd()

List all objects in local workspace ls()

List all files in directory: dir() or list.files()

Find what arguments a function takes: args(list.files) Remember to not call the function

Create a directory: dir.create('testdir')

Set the working directory: setwd('testdir')

Create a file: file.create('mytest.R')

Check if a file exists: file.exists("mytest.R")

File info: file.info("mytest.R")

Rename a file: file.rename('mytest.R', 'mytest2.R')

Copy a file: file.copy('mytest2.R', 'mytest3.R')

Get relative path to a file: file.path('mytest3.R')

Create a path to a folder or file: file.path('folder1', 'folder2')

Create directory with recursive folders: dir.create(file.path('testdir2', 'testdir3'), recursive = TRUE)

Top tip: It is often helpful to save the settings that you had before you began an analysis and then go back to them at the end. This trick is often used within functions; you save, say, the par() settings that you started with, mess around a bunch, and then set them back to the original values at the end. This isn’t the same as what we have done here, but it seems similar enough to mention.

Sequences

Create a sequence of numbers :: 1:20

Get a sequence of real numbers

pi:10 [1] 3.141593 4.141593 5.141593 6.141593 7.141593 8.141593 9.141593 It stops before it goes greater than 10, incrmeenting by 1 each time

Returns a vector

Go back / decrement: 15:1

Help on special chars

Use backticks

?`:`

Use seq() for more control

seq(1,20) [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Get 30 items equally between 2 numbers

``` seq(5, 10, length=30)

[1] 5.000000 5.172414 5.344828 5.517241 5.689655 5.862069 6.034483 [8] 6.206897 6.379310 6.551724 6.724138 6.896552 7.068966 7.241379 [15] 7.413793 7.586207 7.758621 7.931034 8.103448 8.275862 8.448276 [22] 8.620690 8.793103 8.965517 9.137931 9.310345 9.482759 9.655172 [29] 9.827586 10.000000 ```

Check the length of a vector

length(my_seq) [1] 30

Make a sequence of numbers of length of another vector

1:length(my_seq)

There are often several approaches to solving the same problem, particularly in R. Simple approaches that involve less typing are generally best. It’s also important for your code to be readable, so that you and others can figure out what’s going on without too much hassle.

Replicate with rep()

A vector of 40 zeroes

rep(0, times = 40) [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [40] 0

Replicate a vector 10 times

> rep(c(0,1,2), times=10) [1] 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2

Create 10 of each in sequence

rep(c(0, 1, 2), each = 10) [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2

## Vectors

The simplest and most common data structure

  • atomic vectors - single data type
  • lists - contain multiple data types

Logical vectors contain the values TRUE, FALSE and NA (Not Available)

    num_vect <- c(0.5, 55, -10, 6)

    tf <- num_vect < 1

    tf
    [1]  TRUE FALSE  TRUE FALSE

Logical operators:

Exact equality: >, <=, == Inequality: != Or (Union): A | B And (intersection): A & B Not (Negation): !A

Character vectors

    my_char <- c("My", "name", "is")

Concatenate into svector of length 1

    paste(my_char, collapse = " ")

Append a value:

    my_name <- c(my_char, "stephen")

Adding an integer and character vector of length 3 together:

    paste(1:3, c("X", "Y", "Z"), sep="")

If they are not of equal kength there is vector recycling

Printing letters with vector recycling:

    > paste(LETTERS, 1:4, sep="-")
    [1] "A-1" "B-2" "C-3" "D-4" "E-1" "F-2" "G-3" "H-4" "I-1" "J-2" "K-3" "L-4" "M-1"
    [14] "N-2" "O-3" "P-4" "Q-1" "R-2" "S-3" "T-4" "U-1" "V-2" "W-3" "X-4" "Y-1" "Z-2"

Missing values

Missing values play an important role in statistics and data analysis. Often, missing values must not be ignored, but rather they should be carefully studied to see if there’s an underlying pattern or cause for their missingness.

In R, NA is used to represent any value that is ‘not available’ or ‘missing’ (in the statistical sense).

Any operation involving NA generally yields NA as the result

> x <- c(44, NA, 5, NA)
> x
[1] 44 NA  5 NA
> x * 3
[1] 132  NA  15  NA

Create a vector with 1000 draws from standard distribution

y <- rnorm(1000)

Then a vector of 1000 NA’s

z <- rep(NA, 1000)

Select 100 at random from both:

my_data <- sample(c(y,z), 100)

Check which are NA in a new vector using is.na()

my_na <- is.na(my_data)

Camparing with my_data == NA returns all NA

The reason you got a vector of all NAs is that NA is not really a value, but just a placeholder for a quantity that is not available. Therefore the logical expression is incomplete and R has no choice but to return a vector of the same length as my_data that contains all NAs.

> 5 == NA
[1] NA

The key takeaway is to be cautious when using logical expressions anytime NAs might creep in

Tota number of true values

> sum(my_na)
[1] 45

Not a Number

There is another missing value

> 0 / 0
[1] NaN

In R, Inf stands for infinity

> Inf - Inf
[1] NaN

Subsetting Vectors

Selecting first 10 elements of a vector

> x[1:10]
[1]  3.0949871  0.1960158  0.2084758         NA -0.2614606         NA -0.4809142
[8]         NA         NA  0.6007584

Getting all results that are not NA:

y <- x[!is.na(x)]

Get a vector of all positive values

y[y > 0]

Since NA is not a value, but rather a placeholder for an unknown quantity, the expression NA > 0 evaluates to NA

Only values of x that are both non-missing AND greater than zero.

> x[!is.na(x) & x > 0]
[1] 3.09498711 0.19601584 0.20847579 0.60075844 1.72316551 0.87532455 0.27598833
[8] 0.58037652 0.10702578 0.08164542 1.65696398

Many programming languages use what’s called zero-based indexing, which means that the first element of a vector is considered element 0. R uses one-based indexing, which (you guessed it!) means the first element of a vector is considered element 1

Get the 3rd, 5th and 7th elements of vector

x[c(3, 5, 7)]

But you can still ask for the 0th element (No error thrown, nothing)

> x[0]
numeric(0)

Getting an element that does not exist:

> x[3000]
[1] NA

Getting elements except a few needs to use negative indices

x[c(-2, -10)]

The shorthand for the above is:

x[-c(2, 10)]

Named vector

vect <- c(foo = 11, bar = 2, norf = NA)
> vect
foo  bar norf 
11    2   NA 

Get just the names of a named vector

> names(vect)
[1] "foo"  "bar"  "norf"

You can give names to elements retrospectively

vect2 <- c(11, 2, NA)
names(vect2) <- c("foo", "bar", "norf")

Checking if 2 vectors are the same use identical()

> identical(vect, vect2)
[1] TRUE

Get a named element

vect["bar"]

Matrices and Data Frames

Both represent rectangular data types, meaning that they are used to store tabular data, with rows and columns.

  • matrices: can only contain a single class of data
  • data frames: can consist of many different classes of data

Find the dimensions of a variable

dim() function tells us how many dimensions an object has

> my_vector <- 1:20
> my_vector
[1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
> dim(my_vector)
NULL

A vector does not have a dimension so it is NULL

Get length:

> length(my_vector)
[1] 20

The dim() function allows you to get OR set the dim attribute for an R object.

You can also use aatributes:

> attributes(my_vector)
$dim
[1] 4 5

Now it is a matrix: rows and columns

> my_vector
    [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20

Check the class of the element

> class(my_vector)
[1] "matrix"

Open docs for matrix:

> ?matrix()

Create the matrix

> my_matrix2 = matrix(1:20, 4, 5)

Column combine for named rows

> patients <- c("Bill", "Gina", "Kelly", "Sean")
> cbind(patients, my_matrix)
    patients                       
[1,] "Bill"   "1" "5" "9"  "13" "17"
[2,] "Gina"   "2" "6" "10" "14" "18"
[3,] "Kelly"  "3" "7" "11" "15" "19"
[4,] "Sean"   "4" "8" "12" "16" "20"

This makes all the data to now be of type string / character

So we need a data frame

my_data <- data.frame(patients, my_matrix)
> my_data
patients X1 X2 X3 X4 X5
1     Bill  1  5  9 13 17
2     Gina  2  6 10 14 18
3    Kelly  3  7 11 15 19
4     Sean  4  8 12 16 20

Confirm the class:

> class(my_data)
[1] "data.frame"

Add column names

> cnames <- c("patient", "age", "weight", "bp", "rating", "test")
> colnames(my_data) <- cnames
> my_data
patient age weight bp rating test
1    Bill   1      5  9     13   17
2    Gina   2      6 10     14   18
3   Kelly   3      7 11     15   19
4    Sean   4      8 12     16   20

Logic

The basic of logic will not be mentioned here.

In R: * & evalautes to AND for the entire vector * && evaluates to AND just for the first element for vector

> TRUE & c(TRUE, FALSE, FALSE)
[1]  TRUE FALSE FALSE

and

> TRUE && c(TRUE, FALSE, FALSE)
[1] TRUE
  • | evaluates to OR across the entire vector
  • || version of OR only evaluates the first member of a vector

All AND operators are evaluated before OR operators

There is a isTRUE function

  • isTRUE() will only return TRUE if the statement passed to it as an argument is TRUE

    isTRUE(NA) [1] FALSE isTRUE(3) [1] FALSE

xor() function stands for exclusive OR

> xor(TRUE, TRUE)
[1] FALSE

Get a random sample of ints 1 to 10

> ints <- sample(10)
> ints
[1]  4  6  8  7  2  9 10  5  3  1

which() function takes a logical vector as an argument and returns the indices of the vector that are TRUE

Finding which ints are greater than 7

> which(ints > 7)
[1] 3 6 7
  • any() function will return TRUE if one or more of the elements in the logical vector is TRUE
  • all() function will return TRUE if every element in the logical vector is TRUE

    any(ints < 0) [1] FALSE all(ints > 0) [1] TRUE

Functions

> Sys.Date()
[1] "2018-03-16"

Get the mean()

> mean(c(2, 4, 5))
[1] 3.666667

Writing a function:

function_name <- function(arg1, arg2){
    # Manipulate arguments in some way
    # Return a value
}

Use the function:

function_name(value1, value2)

Note: There is no return. The last expression evaluated will be returned!

John Chambers the creator of R said:

To understand computations in R, two slogans are helpful: 1. Everything that exists is an object. 2. Everything that happens is a function call.

You can view a function’s source code by just typing the function name

Setting default arguments

remainder <- function(num, divisor=2) {
num %% divisor
}

You can use named parameters:

remainder(divisor = 11, num = 5)

Check what arguments a function expects with:

> args(remainder)
function (num, divisor = 2)

You can pass functions as arguments

evaluate <- function(func, dat){
    func(dat)
}

Running it:

> evaluate(sd, c(1.4, 3.6, 7.9, 8.8))
[1] 3.514138

Anonymous functions:

> evaluate(function(x){x + 1}, 6)
[1] 7

paste function: Concatenate vectors after converting to character

The first argument is an ... meaning it allows an indefinite number of arguments to be passed into a function. Any number of strings can be passed to function and a concatenated string will return.

Strict rule in R programming: all arguments after an ellipses must have default values.

Unpacking arguments:

args <- list(...)

alpha <- args[["alpha"]]
beta  <- args[["beta"]]

+, -, *, and / symbols. These symbols are called binary operators because they take two inputs, an input from the left and an input from the right.

User defined Binary Operators

"%mult_add_one%" <- function(left, right){ # Notice the quotation marks!
left * right + 1
}

I could then use this binary operator like 4 %mult_add_one% 5 which would evaluate to 21.

Lapply and Sapply

loop functions

Used for implementing the Split-Apply-Combine strategy for data analysis

We will be using the [uci flag dataset(http://archive.ics.uci.edu/ml/datasets/Flags)

View the first 6 lines of a dataset:

    head(flags)

Dimensions:

    > dim(flags)
    [1] 194  30

194 rows and 30 columns

To open a more complete description of the dataset in a separate text file, type viewinfo()

Class type:

    > class(flags)
    [1] "data.frame"

But what is the class of each variable or column in the dataset?

lapply() takes a list as input and applies a function to each element of the list. A dataframe is really just a list of vectors: as.list(flags))

Remember to only give the name of the function you want to call (don’t call it with the results):

> cls_list <- lapply(flags, class)

> cls_list
$name
[1] "factor"

$landmass
[1] "integer"

$zone
[1] "integer"

$area
[1] "integer"

$population
[1] "integer"

$language
[1] "integer"

$religion
[1] "integer"

$bars
[1] "integer"

$stripes
[1] "integer"

$colours
[1] "integer"

$red
[1] "integer"

$green
[1] "integer"

$blue
[1] "integer"

$gold
[1] "integer"

$white
[1] "integer"

$black
[1] "integer"

$orange
[1] "integer"

$mainhue
[1] "factor"

$circles
[1] "integer"

$crosses
[1] "integer"

$saltires
[1] "integer"

$quarters
[1] "integer"

$sunstars
[1] "integer"

$crescent
[1] "integer"

$triangle
[1] "integer"

$icon
[1] "integer"

$animate
[1] "integer"

$text
[1] "integer"

$topleft
[1] "factor"

$botright
[1] "factor"

The l in lapply stands for list

Simpified to a character vector:

> as.character(cls_list)
[1] "factor"  "integer" "integer" "integer" "integer" "integer" "integer" "integer"
[9] "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer"
[17] "integer" "factor"  "integer" "integer" "integer" "integer" "integer" "integer"
[25] "integer" "integer" "integer" "integer" "factor"  "factor" 

sapply stands for simplify apply. It converts to a character vector.

> cls_vect <- sapply(flags, class)
> class(cls_vect)
[1] "character"

if the result is a list where every element is of length one, then sapply() returns a vector. If the result is a list where every element is a vector of the same length (> 1), sapply() returns a matrix. If sapply() can’t figure things out, then it just returns a list, no different from what lapply() would give you.

See number of flags that has orange:

> sum(flags$orange)
[1] 26

Get only certain columns but keep all the rows:

> flag_colors <- flags[, 11:17]

> lapply(flag_colors, sum)
$red
[1] 153

$green
[1] 91

$blue
[1] 99

$gold
[1] 91

$white
[1] 146

$black
[1] 52

$orange
[1] 26

Using sapply:

> sapply(flag_colors, sum)    red  green   blue   gold  white  black orange     153     91     99     91    146     52     26

> sapply(flag_colors, mean)
  red     green      blue      gold     white     black    orange 
0.7886598 0.4690722 0.5103093 0.4690722 0.7525773 0.2680412 0.1340206

The range() function returns the minimum and maximum of its first argument

> shape_mat <- sapply(flag_shapes, range)
> shape_mat
 circles crosses saltires quarters sunstars
[1,]       0       0        0        0        0
[2,]       4       2        1        4       50

unique() returns a vector of only the ‘unique’ elements

> unique(c(3, 4, 5, 5, 5, 6, 6))
[1] 3 4 5 6

Use with anonymous functions:

> lapply(unique_vals, function(elem) elem[2])

vapply and tapply

vapply() allows you to specify format of result explicitly

Alows you to be mroe strict and will throw an error when data does not a single numeric value

> vapply(flags, unique, numeric(1))
Error in vapply(flags, unique, numeric(1)) : values must be length 1,
but FUN(X[[1]]) result is length 194

To explicitly get the data types as a single element character vector

> vapply(flags, class, character(1))

As a data analyst, you’ll often wish to split your data up into groups based on the value of some variable, then apply a function to the members of each group.

See amount in each group based on landmass:

> table(flags$landmass)
1  2  3  4  5  6 
31 17 35 52 39 20 

Aplitting data into groups by landmass and running stats on it:

> tapply(flags$animate, $flags$landmass, mean)
See mean of animate flags per landmass

Get summary of popualtion for flags with/without red in:

> tapply(flags$population, flags$red, summary)
$`0`    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     0.00    0.00    3.00   27.63    9.00  684.00 

$`1`
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     0.0     4.0    22.1    15.0  1008.0 

Looking at Data

Whenever you’re working with a new dataset, the first thing you should do is look at it! What is the format of the data? What are the dimensions? What are the variable names? How are the variables stored? Are there missing data? Are there any flaws in the data?

List variables in your workspace: > ls()

Check strucute of data:

> class(plants)
[1] "data.frame"

It’s very common for data to be stored in a data frame. It is the default class for data read into R using functions like read.csv() and read.table(), which you’ll learn about in another lesson.

Check rows and columns:

> dim(plants)
[1] 5166   10
> nrow(plants)
[1] 5166
> ncol(plants)
[1] 10

Size in memeory:

> object.size(plants)
644232 bytes

Get column names:

> names(plants)
[1] "Scientific_Name"      "Duration"             "Active_Growth_Period"
[4] "Foliage_Color"        "pH_Min"               "pH_Max"              
[7] "Precip_Min"           "Precip_Max"           "Shade_Tolerance"     
[10] "Temp_Min_F"

By defulat head() shows you the first 6 lines you can get the first 10 with:

> head(plants, 10)

Same for tail:

> tail(plants, 15)

Get a summary of the dataset and missing values:

> summary(plants)

Categorical values are called factors in R

Sometimes number of categories is truncated by saying Other in that case use:

> table(plants$Active_Growth_Period)

The best is casting to str()

str() can be used on many other datastructures

Simlulation

Creating random numbers

sample(x, size, replace = FALSE, prob = NULL)

Roll 4 dice (6 sided):

> sample(1:6, 4, replace=TRUE)
[1] 6 2 3 3

Choose 4 numbers, from 1 to 6, each number is replaced after selection so it can show up more than once

Get 10 numbers from 1 to 20 that won’t appear again:

> sample(1:20, 10)
[1]  1  7 20 14 13 10  6  2 15 18

LETTERS is a predefined variable in R containing a vector of all 26 letters of the English alphabet

permute a sample of letters:

> sample(LETTERS)
[1] "I" "L" "B" "R" "F" "S" "Q" "J" "G" "M" "A" "H" "W" "U" "O" "P" "K" "T" "Y" "X" "E"
[22] "D" "Z" "N" "C" "V"

If size is not given, R takes a sample equal in size.

Get an unfair coin with 100 flips:

flips <- sample(c(0, 1), 100, replace=TRUE, prob=c(0.3, 0.7))

Rbinom

Random binomial distribution: rbinom

Each probability distribution in R has an r* function (for “random”), a d* function (for “density”), a p* (for “probability”), and q* (for “quantile”).

Binomial distribution - Number of successes

Only specify the number of successes

To see number of successes:

> rbinom(1, size = 100, prob = 0.7)

To store number of flips:

> flips2 <- rbinom(100, size = 1, prob = 0.7)

RNorm

The standard normal distribution has mean 0 and standard deviation 1

10 random numbers in a normal distribution:

> rnorm(10)
[1]  0.53665009 -2.39624561 -1.50745602 -1.27852621 -0.85378324 -0.04011113  0.49547350
[8] -0.21447406 -0.81949348  0.75271073

RPois

Poisson Distribution - Expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event.[1] The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.

Generate 5 numbers with mean on 10:

> rpois(5, lambda=10)
[1]  9  7  6 12  6

TO get that 10 times use:

> my_pois <- replicate(100, rpois(5, 10))

Get the column means:

> cm <- colMeans(my_pois)

Plot a histogram of column means:

> hist(cm)

All the other standard probability distributions are built into R:

  • Exponential: rexpr()
  • Chi-squared: rchisq()
  • Gamma: rgamma()

Dates and Times

Timeseries data or temporal information

Dates are represented by the 'Date' class and times are represented by the 'POSIXct' and 'POSIXlt' classes. Internally, dates are stored as the number of days since 1970-01-01 and times are stored as either the number of seconds since 1970-01-01 (for 'POSIXct') or a list of seconds, minutes, hours, etc. (for 'POSIXlt').

> d1 <- Sys.Date()
> d1
[1] "2018-03-19"
> class(d1)
[1] "Date"

See internal look of class

> unclass(d1)
[1] 17609

The total number of days since: 1970-01-01

Create a date before epoch:

> d2 <- as.Date("1969-01-01")
> unclass(d2)
[1] -365

System time:

> t1 <- Sys.time()
> t1
[1] "2018-03-19 12:16:16 SAST"
> class(t1)
[1] "POSIXct" "POSIXt"

coerce the result to POSIXlt (Not sure why though)

> t2 <- as.POSIXlt(Sys.time())
> t2
[1] "2018-03-19 12:17:49 SAST"

> unclass(t2)
$sec
[1] 49.87161

$min
[1] 17

$hour
[1] 12

$mday
[1] 19

$mon
[1] 2

$year
[1] 118

$wday
[1] 1

$yday
[1] 77

$isdst
[1] 0

$zone
[1] "SAST"

$gmtoff
[1] 7200

attr(,"tzone")
[1] ""     "SAST" "SAST"

> str(unclass(t2))
List of 11
$ sec   : num 49.9
$ min   : int 17
$ hour  : int 12
$ mday  : int 19
$ mon   : int 2
$ year  : int 118
$ wday  : int 1
$ yday  : int 77
$ isdst : int 0
$ zone  : chr "SAST"
$ gmtoff: int 7200
- attr(*, "tzone")= chr [1:3] "" "SAST" "SAST"

Just get minutes:

> t2$min
[1] 17

Return day of the week:

> weekdays(d1)
[1] "Monday"

Similarly with months and quarters:

> months(t1)
[1] "March"

> quarters(t2)
[1] "Q1"

strptime() converts character vectors to POSIXlt. In that sense, it is similar to as.POSIXlt(), except that the input doesn’t have to be in a particular format (YYYY-MM-DD).

> t3 <- "October 17, 1986 08:24"
> t4 <- strptime(t3, "%B %d, %Y %H:%M")
> t4
[1] "1986-10-17 08:24:00 SAST"

> class(t4)
[1] "POSIXlt" "POSIXt"

Comparison of time:

> Sys.time() > t1
[1] TRUE

Time difference:

> Sys.time() - t1
Time difference of 9.086724 mins

Find time difference in specific unit:

> difftime(Sys.time(), t1, units = 'days')
Time difference of 0.006632809 days

Base Graphics

Not covered are more advanced graphics:

  • lattice
  • ggplot2
  • ggvis

Load dataset cars:

data(cars)

Get help page for cars:

`?cars`

Create basic chart:

> plot(cars)

If dataset has 2 columns it assumes what you want to plot. Since we do not provide labels for either axis, R uses the names of the columns

plot is short for scatterplot

Can be plotted with:

> plot(x = cars$speed, y=cars$dist)

Setting labels:

> plot(x = cars$speed, y=cars$dist, xlab='Speed', ylab='Stopping Distance')

Plot so points are red:

> plot(cars, col = 2)

PLot and limit x-axis:

> plot(cars, xlim=c(10, 15))

PLot with triangles:

> plot(cars, pch=2)

Boxplot

You can pass the entire data frame

boxplot(), like many R functions, also takes a “formula” argument, generally an expression with a tilde (“~”) which indicates the relationship between the input variables. This allows you to enter something like mpg ~ cyl to plot the relationship between cyl (number of cylinders) on the x-axis and mpg (miles per gallon) on the y-axis.

> boxplot(formula=mpg ~ cyl, data = mtcars)

A histogram can be used for a single vector:

> hist(mtcars$mpg)