Exploratory Data Analysis
Exploratory Data Analysis#
Principles of Analytic Graphs#
Graphs give us a visual form of data, and the first principle of analytic graphs is to show some comparison. You’ll hear more about this when you study statistical inference (another great course BTW), but evidence for a hypothesis is always relative to another competing or alternative hypothesis.
When presented with a claim that something is good, you should always ask “Compared to What?” This is why in commercials you often hear the phrase “other leading brands”. An implicit comparison, right?
So the first principle was to show a comparison. The second principle is to show causality or a mechanism of how your theory of the data works The third principle is to show multivariate data
Most hard problems have more variables
Restricting yourself to 2 variables you might be misled and draw an incorrect conclusion
A singel variable may give rise to Simpson’s paradox where a trend appears in certain groups but dissapears when groups are combined. It happens when frequency data is unduly given causal interpretations. Often in social science and medical science studies.
The fourth principle is integrating evidence. Don’t use a single form of expression.
Don’t let the tool drive the analysis
The fifth principle of graphing involves describing and documenting the evidence with sources and appropriate labels and scales.
Also, using R, you want to preserve any code you use to generate your data and graphics so that the research can be replicated if necessary. This allows for easy verification or finding bugs in your analysis.
The sixth principle is Content is king
Analytical presentations ultimately stand or fall depending on the quality, relevance, and integrity of their content.
- Comparison
- Causality
- Multivariate Data
- Integrate Evidence
- Describe and document evidence with sources, labels and scales
- Content is king
Exploratory Graphs#
So graphics give us some visual form of data, and since our brains are very good at seeing patterns, graphs give us a compact way to present data and find or display any pattern that may be present.
Exploratory graphs serve mostly the same functions as graphs. They help us find patterns in data and understand its properties. They suggest modeling strategies and help to debug analyses. We DON’T use exploratory graphs to communicate results.
Getting asummary of data:
> summary(pollution$pm25)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.383 8.549 10.047 9.836 11.356 18.441
- Minimum (0 percentile)
- Maximum (100 percentile)
- 25%: 1st quartile
- 50%: median
- 75%: 3rd quartile
The median specifies that half of the measured fact have a value less than or equal to that median. Anohter half of the data set has greater or equal to that value.
Quantile#
> quantile(ppm)
0% 25% 50% 75% 100%
3.382626 8.548799 10.046697 11.356012 18.440731
We use a boxplot to display this:
> boxplot(ppm, col="blue")
In a boxplot the top and bottom of the box represent the values of the 25% and 75% quartiles. The midde horizontal line is the median. The whiskers (horizontal lines) outside the box represent the range defualte to 1.5 the interquartile range. The interquartile range is the difference between the 75% and 25% quartiles.
Data points outside the whiskers are outliers.
Striaght Lines#
You can overlay lines on the plot with:
abline(h=12)
Create a green histogram
> hist(ppm, col="green")
Some more detailed greyscale representation of occurances
This one-dimensional plot, with its grayscale representation, gives you a little more detailed information about how many data points are in each bucket and where they lie within the bucket.
Use breaks
to set the number of buckets to split the data into:
> hist(ppm, col="green", breaks=100)
rug
automatically adjusted its pocket size to that of the last plot plotted.
Add line with extra width:
> abline(v=12, lwd=2)
Change colour of abline
:
> abline(v=median(ppm), col="magenta", lwd=4)
Make table of factor data:
> table(pollution$region)
east west
442 134
Barplot#
barplot(reg, col = "wheat", main = "Number of Counties in Each Region")
Boxplot#
> boxplot(formula=pm25~region, data=pollution, col="red")
Multiple Histograms#
To plot multiple histograms we need to set the layout with par
Also margins is a 4-long vector which indicates the number of lines for the bottom, left, top and right
par(mfrow=c(2,1),mar=c(4,4,2,1))
mfrow=c(2,1)
means 2 rows and 1 column
Subset#
To split the data up:
> east <- subset(pollution, region == 'east')
> hist(east$pm25, col='green')
Doing the above in one command:
> hist(subset(pollution,region=="west")$pm25, col = "green")
Using the with
keyword:
> with(pollution, plot(latitude, pm25))
So you can avoid using: pollution$pm25
Create a dashed horizontal line:
abline(h=12, lwd=2, lty=2)
Plot#
You can set the colours of points based on region
> plot(pollution$latitude, ppm, col=pollution$region)
- Exploratory plots are quick and dirty
- Plots let you summarize the data (usually graphically) and highlight any broad features
Graphics Devices in R#
There is a screen device (window on device) or file devices (PDF, JPEG, SVG)
How you access your screen device depends on what computer system you’re using. On a Mac the screen device is launched with the call
quartz()
, on Windows you use the callwindows()
, and on Unix/Linuxx11()
View available devices with:
> ?Devices
Create a box plot:
> with(faithful, plot(eruptions, waiting))
Annotate the main title:
> title(main='Old Faithful Geyser data')
View the current plotting device:
> dev.cur()
RStudioGD
2
Creating a plot on a file device#
Launch the file device
> pdf(file="myplot.pdf")
Then run the plot again:
> with(faithful, plot(eruptions, waiting))
> title(main='Old Faithful Geyser data')
After done you have to close it with:
> dev.off()
There are two basic types of file devices, vector and bitmap devices:
Vector
formats are good for line drawings and plots with solid colors using a modest number of pointsBitmap
formats are good for plots with a large number of points, natural scenes or web-based plots
Vector Formats#
pdf
: line-type graphics and papers. It resizes well, is usually portable, but it is not efficient if a plot has many objects/points.svg
: XML-based, scalable vector graphics. This supports animation and interactivity and is potentially useful for web-based plots.
The last two vector formats are win.metafile, a Windows-only metafile format, and postscript (ps), an older format which also resizes well, is usually portable, and can be used to create encapsulated postscript files. Unfortunately, Windows systems often don’t have a postscript viewer.
Bitmap formats#
png
: Portable Network Graphics which is good for line drawings or images with solid colors. It uses lossless compression (like the old GIF format), and most web browsers can read this format natively. In addition, png is good for plots with many points, but it does not resize well.jpeg
: files are good for photographs or natural scenes. They use lossy compression, so they’re good for plots with many points. Files in jpeg format don’t resize well, but they can be read by almost any computer and any web browser. They’re not great for line drawings.tiff
: an older lossless compression meta-format and bmp which is a native Windows bitmapped format.
Devices#
Every open graphics device is assigned an integer greater than or equal to 2
You can change the active device with:
dev.set(<integer>)
You can copy between devices:
The function
dev.copy
copies a plot from one device to another, anddev.copy2pdf
specifically copies a plot to a PDF file.
> dev.copy(png, file='geyserplot.png')
Plotting Systems#
Base Plotting System#
- Comes with
R
- Start with a blank canvas
- Intuitive and Exploratory
- You can’t go backwards
- based on “Artist’s Pallette”
Lattice plots#
- single function call such as
xyplot
orbwplot
- most useful for conditioning types of plots which display how y changes with x across levels of z
- Cannot add to the plot
Using the R forumla: Life.Exp ~ Income | region
: we’re plotting life expectancy as it depends on income for each region
> xyplot(Life.Exp ~ Income | region, state, layout=c(4,1))
The second argument is data
and the third shows layout is 4 columns 1 row
GGPlot2#
The best of both worlds, does automatic titles and margins etc but also allows annotatins and adding.
> qplot(displ, hwy, data = mpg)
- Uses Graphics Grammar
Base Plotting System#
See the range for a column of data excluding NA
:
range(airquality$Ozone, na.rm=TRUE)
Histogram#
Uses one variable
> hist(airquality$Ozone)
Boxplot#
> boxplot(Ozone~Month, airquality)
Adding labels and colours:
> boxplot(Ozone~Month, airquality, xlab='Month', ylab='Ozone (ppb)', col.axis='blue', col.lab='red')
Check number of parameters you can give to par()
> length(par())
[1] 72
The available options are:
> names(par())
[1] "xlog" "ylog" "adj" "ann" "ask" "bg" "bty"
[8] "cex" "cex.axis" "cex.lab" "cex.main" "cex.sub" "cin" "col"
[15] "col.axis" "col.lab" "col.main" "col.sub" "cra" "crt" "csi"
[22] "cxy" "din" "err" "family" "fg" "fig" "fin"
[29] "font" "font.axis" "font.lab" "font.main" "font.sub" "lab" "las"
[36] "lend" "lheight" "ljoin" "lmitre" "lty" "lwd" "mai"
[43] "mar" "mex" "mfcol" "mfg" "mfrow" "mgp" "mkh"
[50] "new" "oma" "omd" "omi" "page" "pch" "pin"
[57] "plt" "ps" "pty" "smo" "srt" "tck" "tcl"
[64] "usr" "xaxp" "xaxs" "xaxt" "xpd" "yaxp" "yaxs"
[71] "yaxt" "ylbias"
Background colour
> par('fg')
[1] "black"
Plot character
> par('pch')
[1] 1 (Circle)
Line type
> par('lty')
[1] "solid"
The
par()
function is used to specify global graphics parameters that affect all plots in an R session. (Use dev.off or plot.new to reset to the defaults.)
las
(the orientation of the axis labels on the plot)bg
(background color)mar
(margin size)oma
(outer margin size)mfrow
(plots per row)mfcol
(plots per column)
lines
can be used to add lines to the plot
Create a plot but do not plot the points:
plot(airquality$Wind, airquality$Ozone, type='n')
Then add points
> points(may$Wind,may$Ozone,col="blue",pch=17)
Adding a legend:
> legend('topright', pch=c(17, 8), co=c("blue", "red"), legend=c("May", "Other Months"))
Add absolute line at the median:
abline(v=median(airquality$Wind), lty=2, lwd=2)
Giving a title to many plots
> mtext("Ozone and Weather in New York City", outer=TRUE)
Lattice Plotting System#
Must be loaded: library('lattice')
Lattice is implemented using 2 packages:
lattice
contains code for producing trellis graphics. These includexyplot
,bwplot
, andlevelplot
-
grid
system: low level functions -
xyplot
produces a scatterplot bwplot
produces a box and whisker plothistogram
produces a histogram
All plotting is done in a single call
Lattice functions gnerally take a formua as their first argument. (y ~ x
)
xyplot(y ~ x | f * g, data)
g
would represent optional conditions
xyplot(Ozone ~ Wind, data=airquality)
Add snowflake marks:
> xyplot(Ozone~Wind, data=airquality, col='red', pch=8, main='Big Apple Data')
As factor:
> xyplot(Ozone ~ Wind | as.factor(Month), data=airquality, layout=c(5,1))
Since Month is a named column of the airquality dataframe we had to tell R to treat it as a factor
Lattice functions behave differently from base graphics functions in one critical way. Recall that base graphics functions plot data directly to the graphics device (e.g., screen, or file such as a PDF file). In contrast, lattice graphics functions return an object of class trellis.
lattice returns the Trellis
object.
p <- xyplot(Ozone~Wind,data=airquality)
Type p
or print(p)
to show the chart
See arguments for the object:
> names(p)
[1] "formula" "as.table" "aspect.fill" "legend"
[5] "panel" "page" "layout" "skip"
[9] "strip" "strip.left" "xscale.components" "yscale.components"
[13] "axis" "xlab" "ylab" "xlab.default"
[17] "ylab.default" "xlab.top" "ylab.right" "main"
[21] "sub" "x.between" "y.between" "par.settings"
[25] "plot.args" "lattice.options" "par.strip.text" "index.cond"
[29] "perm.cond" "condlevels" "call" "x.scales"
[33] "y.scales" "panel.args.common" "panel.args" "packet.sizes"
[37] "x.limits" "y.limits" "x.used.at" "y.used.at"
[41] "x.num.limit" "y.num.limit" "aspect.ratio" "prepanel.default"
[45] "prepanel"
Get the formula
> p[["formula"]]
Ozone ~ Wind
Check the limits of the x value:
> p[['x.limits']]
[1] 0.37 22.03
Lattice create a file to create the plot:
p2 <- xyplot(y ~ x | f, panel = function(x, y, ...) {
panel.xyplot(x, y, ...) ## First call default panel function
panel.lmline(x, y, col = 2) ## Overlay a simple linear regression line
})
print(p2)
invisible()
to run this do:
source(pathtofile("plot2.R"), local=TRUE)
Table of multiple values:
> table(diamonds$color, diamonds$cut)
Fair Good Very Good Premium Ideal
D 163 662 1513 1603 2834
E 224 933 2400 2337 3903
F 312 909 2164 2331 3826
G 314 871 2299 2924 4884
H 303 702 1824 2360 3115
I 175 522 1204 1428 2093
J 119 307 678 808 896
Edit myLabels.R
:
> myedit('myLabels.R')
source(pathtofile('myLabels.R'), local=TRUE)
Using the labels:
> xyplot(price~carat | color*cut,data=diamonds, strip=FALSE, pch=20, xlab = myxlab, ylab=myylab, main=mymain)
The strip=FALSE
variable removes labels from the panels
The lattice system is ideal for creating conditioning plots where you examine the same kind of plot under many different conditions.
Working with Colors#
Effectively using colors can enhance your plots and presentations, emphasizing the important points you’re trying to convey
grDevices
gives you the colouors()
function
Get a sample of some colours:
> sample(colors(), 10)
[1] "cornsilk2" "aquamarine4" "gray75" "grey46"
[5] "honeydew" "gray43" "grey50" "palegoldenrod"
[9] "lightsteelblue2" "indianred3"
colorRamp
and colorRampPalette
blend colours together
The first, colorRamp, takes a palette of colors (the arguments) and returns a function that takes values between 0 and 1 as arguments. The 0 and 1 correspond to the extremes of the color palette. Arguments between 0 and 1 return blends of these extremes.
> pal <- colorRamp(c('red', 'blue'))
So:
> pal(0)
[,1] [,2] [,3]
[1,] 255 0 0
Which gives RGB
colours
So pal creates colors using the palette we specified when we called colorRamp
> pal(seq(0,1,len=6))
[,1] [,2] [,3]
[1,] 255 0 0
[2,] 204 0 51
[3,] 153 0 102
[4,] 102 0 153
[5,] 51 0 204
[6,] 0 0 255
ColorRampalette
> p1 <- colorRampPalette(c('red', 'blue'))
> p1(2)
[1] "#FF0000" "#0000FF"
colorRamp
andcolorRampPalette
could return a 3 or 4 long vector of colors
Set the alpha
p3 <- colorRampPalette(c('blue', 'green'), alpha=.5)
Alpha represents an opacity level, that is, how transparent should the colors be
An alpha will help you if there are manny scatter points together and you want to know the density
> plot(x, y, pch=19, col=rgb(0, .5, .5, .3))
In the above .3
is the density
RColorBrewer#
Three types:
sequential
- light to darkdivergent
- divergent (neutral colour white is centre)qualitative
- random colours used to distinguish data
Use a palette:
> cols <- brewer.pal(3, 'BuGn')
[1] "#E5F5F9" "#99D8C9" "#2CA25F"
> pal <- colorRampPalette(cols)
Using the pallete:
> image(volcano, col=pal(20))
GGPlot2 Part1#
Complete docuemtnation on ggplot2
Grammar of Graphics
ggplot2
combines the best of base and lattice
2 workhorse functions:
- qplot - Less flexible
- ggplot - More flexible
Basic plot:
> qplot(displ, hwy, data=mpg)
Add an aestheitc (colour based on factor)
> qplot(displ, hwy, data=mpg, color=drv)
We can automatically add a trendlines with the use of geom
, the first element means data piint and second is trnedline.
> qplot(displ, hwy, data=mpg, color=drv, geom=c('point', 'smooth'))
Notice the gray areas surrounding each trend lines. These indicate the 95% confidence intervals for the lines
If no x-axis is given the plot will just show the index /order of the value in the dataset:
> qplot(y=hwy, data=mpg, color=drv)
Create a box and whisker plot:
> qplot(drv, hwy, data=mpg, geom='boxplot')
Set the colour to a factor:
> qplot(drv, hwy, data=mpg, geom='boxplot', color=manufacturer)
Create a histogram with coloured attribute:
> qplot(hwy, data=mpg, fill=drv)
. ~ drv
is ggplot2’s shorthand for number of rows (to the left of the~
)
Scatterplot split into panels / facets#
> qplot(displ, hwy, data=mpg, facets=. ~ drv)
Histogram split into panels#
> qplot(hwy, data=mpg, facets=drv ~ ., binwidth=2)
The facet argument drv ~ .
resulted in a 3 by 1
setup.
GGPlot2 Part2#
Components of GGPLot2#
- DATA FRAME - data you are plotting
- AESTHETIC MAPPINGS - how data is mapped to color, size, etc.
- GEOMS (geometric objects) - are what you see in the plot (points, lines, shapes)
- FACETS - panels used in conditional plots
- STATS - statistical transformations (binning, quantiles, and smoothing)
- SCALES - what coding an aesthetic map uses (for example, male = red, female = blue)
- COORDINATE SYSTEM - how plots are depicted
Point and smooth scatterplot:
> qplot(displ, hwy, data=mpg, geom=c('point', 'smooth'), facets=.~drv)
Creating a mapping:
> g <- ggplot(mpg, aes(displ, hwy))
Summary of graphical object:
> summary(g)
data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl, class
[234x11]
mapping: x = displ, y = hwy
A 234 x 11
matrix and a x (displ) and y (hwy)
mapping
Tell ggplot2
how to show the data:
> g+geom_point()
Show points and trend line:
> g+geom_point()+geom_smooth()
Change the smoothing function to linear model:
> g+geom_point()+geom_smooth(method="lm")
Split it into facets
:
> g+geom_point()+geom_smooth(method="lm")+facet_grid(. ~ drv)
Add a title to the chart:
> g+geom_point()+geom_smooth(method="lm")+facet_grid(. ~ drv)+ggtitle('Swirl Rules!')
Two standard appearance themes are included in ggplot. These are
theme_gray()
which is the default theme (gray background with white grid lines) andtheme_bw()
which is a plainer (black and white) color scheme.
Create geometric points with colour, alpha and size:
> g+geom_point(color='pink', size=4, alpha=1/2)
Color based on factor:
> g+geom_point(size=4, alpha=1/2, aes(color=drv))
Changing labels:
g + geom_point(aes(color = drv)) + labs(title="Swirl Rules!") + labs(x="Displacement", y="Hwy Mileage")
Dashed linear regression line and turn off confidence level grey area:
> g + geom_point(size=2, alpha=1/2, aes(color = drv)) + geom_smooth(size=4, linetype=3, method='lm', se=FALSE)
Change the theme and font style:
> g+geom_point(aes(color=drv))+theme_bw(base_family='Times')
Outlier Data#
Set limits to not show an outlier
Using plot with outlier data:
> plot(myx, myy, type='l', ylim=c(-3,3))
With ggplot:
> g <- ggplot(testdat, aes(x=myx, y=myy))
> g+geom_line()+ylim(-3,3)
Can also be done by limiting the coordinate system:
> g + geom_line() + coord_cartesian(ylim=c(-3,3))
Create facetted scatterplot with marginal totals:
> g + geom_point() + facet_grid(drv~cyl, margins=TRUE)
With a smooth trendline:
> g + geom_point() + facet_grid(drv~cyl, margins=TRUE) + geom_smooth(method='lm', se=FALSE, size=2, color='black')
GGPlot2 Extras#
You can specify the binwidth (usually range / x):
> qplot(price, data=diamonds, binwidth=18497/30)
Change fill of histogram based on factor:
> qplot(price, data=diamonds, binwidth=18497/30, fill=cut)
Histogram as a density function:
> qplot(price, data=diamonds, geom='density')
Density by colour:
> qplot(price, data=diamonds, geom='density', color=cut)
Change shape based on cut:
> qplot(carat, price, data=diamonds, shape=cut)
Add regression lines for each cut:
> qplot(carat, price, data=diamonds, color=cut) + geom_smooth(method='lm')
facets: The symbol to the left of the tilde indicates rows and the symbol to the right of the tilde indicates columns
> qplot(carat, price, data=diamonds, color=cut, facets=.~cut) + geom_smooth(method='lm')
cut, which allows you to divide your data into sets and label each entry as belonging to one of the sets
So you can create a factor out of numerical data
> cutpoints <- quantile(diamonds$carat,seq(0,1,length=4),na.rm=TRUE)
> cutpoints
0% 33.33333% 66.66667% 100%
0.20 0.50 1.00 5.01
Create a new name for the data:
> diamonds$car2 <- cut(diamonds$carat,cutpoints); stageVariable("diamonds$car2",diamonds$car2)
Because the dataset was changed we have to create the graphical object again:
> g <- ggplot(diamonds, aes(depth, price))
> g+geom_point(alpha=1/3) + facet_grid(cut ~ car2)
Boxplot#
> ggplot(diamonds, aes(carat, price))+geom_boxplot()+facet_grid(. ~ cut)
Hierarchical Clustering#
A simple way of quickly examining and displaying multi-dimensional data. This technique is usually most useful in the early stages of analysis when you’re trying to get an understanding of the data
Clustering organizes data points that are close into groups. So obvious questions are “How do we define close?”, “How do we group things?”, and “How do we interpret the grouping?” Cluster analysis is a very important topic in data analysis.
Closeness: Distance or similarity are usually the metrics used
Measuring distance: Euclidean distance and correlation similarity are continuous measures, while Manhattan distance is a binary measure
Euclidean distance is distance “as the crow flies”. Many applications, however, can’t realistically use crow-flying distance. Cars, for instance, have to follow roads.
Manhattan distance is the sum of the absolute values of the distances between each coordinate
Computer euclidean distances between points:
> dist(dataFrame)
You can create the dendogram with:
> hc <- hclust(distxy)
You can plot the dendogram:
> plot(hc)
Printing at the same level:
> plot(as.dendrogram(hc))
Then you can cut a line at a certain distance to create clusters
Measuring distance between clusters:
- complete linkage - furtherest distance of elements in respective clusters
- average linkage - get the mean x and y coordinates
Heatmap#
> heatmap(dataMatrix, col=cm.colors(25))
K Means Clustering#
Another simple way of examining and organizing multi-dimensional data. Most useful in the early stages of analysis when you’re trying to get an understanding of the data, e.g., finding some pattern or relationship between different factors or variables.
R documentation tells us that the k-means method “aims to partition the points into k groups such that the sum of squares from points to the assigned cluster centres is minimized.”
- First guess how many clusters you have or want
- Randomly create a “centroid” (a phantom point) for each cluster
- Readjust the centroid’s position by making it the average of the points assigned to it.
k-means clustering requires some distance metric (say Euclidean), a hypothesized fixed number of clusters, and an initial guess as to cluster centroids.
k-means clustering returns a final position of each cluster’s centroid as well as the assignment of each data point or observation to a cluster
Add centroids to chart:
> points(cx, cy, col=c('red', 'orange', 'purple'), pch=3, cex=2, lwd=2)
Check distnace between centroid and points:
> mdist(x,y,cx,cy)
You can find the minimum with:
> apply(distTmp, 2, which.min)
[1] 2 2 2 1 3 3 3 1 3 3 3 3
Can colour based on above output:
> points(x,y,pch=19, cex=2, col=cols1[newClust])
Recaculate the centroids based on clusters:
> tapply(x, newClust, mean)
1 2 3
1.210767 1.010320 2.498011
> tapply(y, newClust, mean)
1 2 3
1.730555 1.016513 1.354373
Apply the points on plot:
> points(newCx, newCy, col=cols1, pch=8, cex=2, lwd=2)
Some points will need to change. Recolour the points again:
> points(x,y,pch=19, cex=2, col=cols1[newClust2])
Create a kmeans object:
> kmeans(dataFrame, centers=3)
Check number of iterations:
> kmObj$iter
[1] 2
Dimension Reduction#
principal component analysis (PCA) singular value decomposition (SVD)
PCA and SVD are used in both the exploratory phase and the more formal modelling stage of analysis
we’d like to find a smaller set of multivariate variables that are uncorrelated AND explain as much variance (or variability) of the data as possible.
Orthagonal == Uncorrelated
Singular Value Decomposition#
> svd(mat)
$d
[1] 9.5899624 0.1806108
$u
[,1] [,2]
[1,] -0.3897782 -0.9209087
[2,] -0.9209087 0.3897782
$v
[,1] [,2]
[1,] -0.2327012 -0.7826345
[2,] -0.5614308 0.5928424
[3,] -0.7941320 -0.1897921
Recall that in R matrix multiplication requires you to use the operator %*%
> matu %*% diag %*% t(matv)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 5 7
Principal Component Analysis#
a simple, non-parametric method for extracting relevant information from confusing data sets.
PCA is a method to reduce a high-dimensional data set to its essential elements (not lose information) and explain the variability in the data
> svd(scale(mat))
$d
[1] 1.732051 0.000000
$u
[,1] [,2]
[1,] -0.7071068 0.7071068
[2,] 0.7071068 0.7071068
$v
[,1] [,2]
[1,] 0.5773503 -0.5773503
[2,] 0.5773503 0.7886751
[3,] 0.5773503 -0.2113249
> prcomp(scale(mat))
Standard deviations (1, .., p=2):
[1] 1.732051 0.000000
Rotation (n x k) = (3 x 2):
PC1 PC2
[1,] 0.5773503 -0.5773503
[2,] 0.5773503 0.7886751
[3,] 0.5773503 -0.2113249
SVD and PCA cannot deal with MISSING
data
Singular value decomposition is a good way to approximate data without having to store a lot.
Make sure data is on consistent units Seperating out real patterns requires detective work
distance matrix as first 3 columns of dist
> mdist <- dist(sub1[,1:3])
Create clusters
> hclustering <- hclust(mdist)
Case Study#
You can use read.table
to read in data. R
is smart enough to unzip it.
> dim(pm0)
[1] 117421 5
The data set has 117421
lines and 5
columns
Reassign a variable by splitting with |
character:
> cnames <- strsplit(cnames, '|', fixed=TRUE)
Make syntactically valid names:
> names(pm0) <- make.names(cnames[[1]][wcol])
> names(pm1) <- make.names(cnames[[1]][wcol])
Create x1
by assigning it to Sample.Value
> x1 <- pm1$Sample.Value
The 1999 data:
> summary(x0)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 0.00 7.20 11.50 13.74 17.90 157.10 13217
The 2012 data:
> summary(x1)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s -10.00 4.00 7.63 9.14 12.00 908.97 73133
Indicates an improved situation. The maximum increases indicates possible malfunction with the capturing devices.
Boxplot the data:
> boxplot(x0, x1)
There are so many values outside the boxes and the range of x1 is so big that the boxes are flattened. It might be more informative to call boxplot on the logs (base 10) of x0 and x1. Do this now using log10(x0) and log10(x1) as the 2 arguments.
> boxplot(log10(x0), log10(x1))
R
will warn about values that can’t be log
ed
Get sum of all the negative values excluding NA
:
> negative <- x1<0
> sum(negative, na.rm=TRUE)
[1] 26474
> mean(negative, na.rm=TRUE)
[1] 0.0215034
We see that just 2% of the x1 values are negative. Perhaps that’s a small enough percentage that we can ignore them
Get array of dates:
> str(dates)
int [1:1304287] 20120101 20120104 20120107 20120110 20120113 20120116 20120119 20120122 20120125 20120128 ...
The dates are hard to read though, imporved with:
> dates <- as.Date(as.character(dates), '%Y%m%d')
Show a histogram of the dates that are negative:
> hist(dates[negative], "month")
We see the bulk of the negative measurements were taken in the winter months, with a spike in May. Not many of these negative measurements occurred in summer months. We can take a guess that because particulate measures tend to be low in winter and high in summer, coupled with the fact that higher densities are easier to measure, that measurement errors occurred when the values were low. For now we’ll attribute these negative measurements to errors. Also, since they account for only 2% of the 2012 data, we’ll ignore them.
View data that intersects:
> both <- intersect(site0, site1)
Subset the data for a certain county and site:
> cnt0 <- subset(pm0, State.Code == 36 & county.site %in% both)
Membership is denoted with
%in%
Split data:
> sapply(split(cnt1, cnt1$county.site), nrow)
1.12 1.5 101.3 13.11 29.5 31.3 5.80 63.2008 67.1015 85.55 31 64 31 31 33 15 31 30 31 31
Create the plot#
> par(mfrow=c(1,2), mar=c(4,4,2,1))
> plot(dates0, x0sub, pch=20)
Add a line:
> abline(h=median(x0sub, na.rm=TRUE), lwd=2)
Merge data:
> mrg <- merge(d0, d1, by='state')