In this tutorial, we will be covering basics of data manipulations using in-built R functions.
For the basics, we will be using two inbuilt datasets in R for practice purposes: airquality (you can call this out directly)
Data Viewing
head(<dataframe>) – This allow you to see the first 5 rows of a dataset
head(airquality)
Output:
## Ozone Solar.R Wind Temp Month Day Year Date DateType
## 1 41 190 7.4 67 5 1 2017 2017-5-1 2017-05-01
## 2 36 118 8.0 72 5 2 2017 2017-5-2 2017-05-02
## 3 12 149 12.6 74 5 3 2017 2017-5-3 2017-05-03
## 4 18 313 11.5 62 5 4 2017 2017-5-4 2017-05-04
## 5 NA NA 14.3 56 5 5 2017 2017-5-5 2017-05-05
## 6 28 NA 14.9 66 5 6 2017 2017-5-6 2017-05-06
tail(<dataframe>) – This allows you to see the last 5 rows of a dataset
tail(airquality)
Output:
## Ozone Solar.R Wind Temp Month Day Year Date DateType
## 148 14 20 16.6 63 9 25 2017 2017-9-25 2017-09-25
## 149 30 193 6.9 70 9 26 2017 2017-9-26 2017-09-26
## 150 NA 145 13.2 77 9 27 2017 2017-9-27 2017-09-27
## 151 14 191 14.3 75 9 28 2017 2017-9-28 2017-09-28
## 152 18 131 8.0 76 9 29 2017 2017-9-29 2017-09-29
## 153 20 223 11.5 68 9 30 2017 2017-9-30 2017-09-30
str(<dataframe>) – This allows you to see an overview of your variables in the dataset.
str(airquality)
Output:
## 'data.frame': 153 obs. of 9 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R : int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : num 2017 2017 2017 2017 2017 ...
## $ Date : chr "2017-5-1" "2017-5-2" "2017-5-3" "2017-5-4" ...
## $ DateType: Date, format: "2017-05-01" "2017-05-02" ...
summary(<dataframe>) – This helps you see the statistics of observations in the variables present.
summary(airquality)
Output:
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day Year Date
## Min. :5.000 Min. : 1.0 Min. :2017 Length:153
## 1st Qu.:6.000 1st Qu.: 8.0 1st Qu.:2017 Class :character
## Median :7.000 Median :16.0 Median :2017 Mode :character
## Mean :6.993 Mean :15.8 Mean :2017
## 3rd Qu.:8.000 3rd Qu.:23.0 3rd Qu.:2017
## Max. :9.000 Max. :31.0 Max. :2017
##
## DateType
## Min. :2017-05-01
## 1st Qu.:2017-06-08
## Median :2017-07-16
## Mean :2017-07-16
## 3rd Qu.:2017-08-23
## Max. :2017-09-30
##
dim(<dataframe>) – A quick way to know the number of rows and column of a dataset.
dim(airquality)
Output:
## [1] 153 9
Subsetting Data
A quick way to isolate rows and colums can be done in the manner dataframe[x,y], where x is the column and y is row number. Leaving x or y blank will return the entire row or column that you want. Also, you can not only isolate for 1 row/column, but a certain range of it if you replace the value from x and y with a vector of integers that correspond with the row/columns number you want.
Singular row/columns
You should assign the subsetted dataframe to another variable for easy access in the future, else you will have to retype the command again if you want that specific subset again.
a <- airquality[3,] b <- airquality[,4] a
Output:
## Ozone Solar.R Wind Temp Month Day Year Date DateType
## 3 12 149 12.6 74 5 3 2017 2017-5-3 2017-05-03
b
Output:
## [1] 67 72 74 62 56 66 65 59 61 69 74 69 66 68 58 64 66 57 68 62 59 73 61
## [24] 61 57 58 57 67 81 79 76 78 74 67 84 85 79 82 87 90 87 93 92 82 80 79
## [47] 77 72 65 73 76 77 76 76 76 75 78 73 80 77 83 84 85 81 84 83 83 88 92
## [70] 92 89 82 73 81 91 80 81 82 84 87 85 74 81 82 86 85 82 86 88 86 83 81
## [93] 81 81 82 86 85 87 89 90 90 92 86 86 82 80 79 77 79 76 78 78 77 72 75
## [116] 79 81 86 88 97 94 96 94 91 92 93 93 87 84 80 78 75 73 81 76 77 71 71
## [139] 78 67 76 68 82 64 71 81 69 63 70 77 75 76 68
Range of row/columns
Let’s say I want rows from 4 to 10, I just need to create a vector c(4:10) and put it in.
REMEMBER THE COMMA, it was one of the major mistakes I made.
airquality[c(4:10),]
Output:
## Ozone Solar.R Wind Temp Month Day Year Date DateType
## 4 18 313 11.5 62 5 4 2017 2017-5-4 2017-05-04
## 5 NA NA 14.3 56 5 5 2017 2017-5-5 2017-05-05
## 6 28 NA 14.9 66 5 6 2017 2017-5-6 2017-05-06
## 7 23 299 8.6 65 5 7 2017 2017-5-7 2017-05-07
## 8 19 99 13.8 59 5 8 2017 2017-5-8 2017-05-08
## 9 8 19 20.1 61 5 9 2017 2017-5-9 2017-05-09
## 10 NA 194 8.6 69 5 10 2017 2017-5-10 2017-05-10
If you want to exclude a certain row/column, you just need to put an ‘-‘ sign in front of the vector.
airquality[-c(1:140),]
Output:
## Ozone Solar.R Wind Temp Month Day Year Date DateType
## 141 13 27 10.3 76 9 18 2017 2017-9-18 2017-09-18
## 142 24 238 10.3 68 9 19 2017 2017-9-19 2017-09-19
## 143 16 201 8.0 82 9 20 2017 2017-9-20 2017-09-20
## 144 13 238 12.6 64 9 21 2017 2017-9-21 2017-09-21
## 145 23 14 9.2 71 9 22 2017 2017-9-22 2017-09-22
## 146 36 139 10.3 81 9 23 2017 2017-9-23 2017-09-23
## 147 7 49 10.3 69 9 24 2017 2017-9-24 2017-09-24
## 148 14 20 16.6 63 9 25 2017 2017-9-25 2017-09-25
## 149 30 193 6.9 70 9 26 2017 2017-9-26 2017-09-26
## 150 NA 145 13.2 77 9 27 2017 2017-9-27 2017-09-27
## 151 14 191 14.3 75 9 28 2017 2017-9-28 2017-09-28
## 152 18 131 8.0 76 9 29 2017 2017-9-29 2017-09-29
## 153 20 223 11.5 68 9 30 2017 2017-9-30 2017-09-30
P.S. I would like to point out that instead of calling columns out by its column number, it is more intuitive to call out the header of the column through the method below:
c <- airquality$Wind c
Output:
## [1] 7.4 8.0 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 6.9 9.7 9.2 10.9
## [15] 13.2 11.5 12.0 18.4 11.5 9.7 9.7 16.6 9.7 12.0 16.6 14.9 8.0 12.0
## [29] 14.9 5.7 7.4 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 10.9
## [43] 9.2 8.0 13.8 11.5 14.9 20.7 9.2 11.5 10.3 6.3 1.7 4.6 6.3 8.0
## [57] 8.0 10.3 11.5 14.9 8.0 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7
## [71] 7.4 8.6 14.3 14.9 14.9 14.3 6.9 10.3 6.3 5.1 11.5 6.9 9.7 11.5
## [85] 8.6 8.0 8.6 12.0 7.4 7.4 7.4 9.2 6.9 13.8 7.4 6.9 7.4 4.6
## [99] 4.0 10.3 8.0 8.6 11.5 11.5 11.5 9.7 11.5 10.3 6.3 7.4 10.9 10.3
## [113] 15.5 14.3 12.6 9.7 3.4 8.0 5.7 9.7 2.3 6.3 6.3 6.9 5.1 2.8
## [127] 4.6 7.4 15.5 10.9 10.3 10.9 9.7 14.9 15.5 6.3 10.9 11.5 6.9 13.8
## [141] 10.3 10.3 8.0 12.6 9.2 10.3 10.3 16.6 6.9 13.2 14.3 8.0 11.5
At this point, you might question the usefulness of subsetting via column/row numbers. This is just to give you a basic understanding of how R handles data frames. I will introduce a more efficient method in Tutorial 7.
Conditions with Logical Operators
Now we will move on to subset data using logical operators (less than / greater than / equal to). This is a big advantage R has over Excel as Excel commands and filters are rather limited.
An inbuilt function called subset(, condition) is key to doing so. We now want to isolate observations in the dataset where the temperature fell below 65F:
colddays <- subset(airquality, airquality$Temp<65) colddays
Output:
## Ozone Solar.R Wind Temp Month Day Year Date DateType
## 4 18 313 11.5 62 5 4 2017 2017-5-4 2017-05-04
## 5 NA NA 14.3 56 5 5 2017 2017-5-5 2017-05-05
## 8 19 99 13.8 59 5 8 2017 2017-5-8 2017-05-08
## 9 8 19 20.1 61 5 9 2017 2017-5-9 2017-05-09
## 15 18 65 13.2 58 5 15 2017 2017-5-15 2017-05-15
## 16 14 334 11.5 64 5 16 2017 2017-5-16 2017-05-16
## 18 6 78 18.4 57 5 18 2017 2017-5-18 2017-05-18
## 20 11 44 9.7 62 5 20 2017 2017-5-20 2017-05-20
## 21 1 8 9.7 59 5 21 2017 2017-5-21 2017-05-21
## 23 4 25 9.7 61 5 23 2017 2017-5-23 2017-05-23
## 24 32 92 12.0 61 5 24 2017 2017-5-24 2017-05-24
## 25 NA 66 16.6 57 5 25 2017 2017-5-25 2017-05-25
## 26 NA 266 14.9 58 5 26 2017 2017-5-26 2017-05-26
## 27 NA NA 8.0 57 5 27 2017 2017-5-27 2017-05-27
## 144 13 238 12.6 64 9 21 2017 2017-9-21 2017-09-21
## 148 14 20 16.6 63 9 25 2017 2017-9-25 2017-09-25
You can also have many other conditions, that can be joined together with the ‘and’ and ‘or’ functions.
coldandwindydays <- subset(airquality, airquality$windy>15&airquality$Temp<65) coldandwindydays
Output:
## Ozone Solar.R Wind Temp Month Day Year Date DateType
## 9 8 19 20.1 61 5 9 2017 2017-5-9 2017-05-09
## 18 6 78 18.4 57 5 18 2017 2017-5-18 2017-05-18
## 25 NA 66 16.6 57 5 25 2017 2017-5-25 2017-05-25
## 148 14 20 16.6 63 9 25 2017 2017-9-25 2017-09-25
Max & Min
Thus, with the understanding how subsetting works, we can move on to utilise more functions that help you refine your subsets.
Max() and Min () returns the max/min of the column Temperature
max(airquality$Temp)
Output:
## [1] 97
min(airquality$Temp)
Output:
## [1] 56
which.min & which.max
‘which’ function here helps you isolate the row/column number of the min or max. ‘which’ can be used with other functions as well, so try it out yourself!
which.min(airquality$Temp)
Output:
## [1] 5
which.max(airquality$Temp)
Output:
## [1] 120
mean() & median()
If you would like to know if the observations are above/below the average or median you can use mean() and median().
mean(airquality$Temp)
Output:
## [1] 77.88235
median(airquality$Temp)
Output:
## [1] 79
Quantile
There is also this function called quantile, for those people who want to find out the bell curve 😉
quantile(airquality$Temp,probs = seq(0,1,0.25))
Output:
## 0% 25% 50% 75% 100%
## 56 72 79 85 97
Data Cleansing
Any data scientist/or analyst handling data will tell you they spend 90% of the time on this part, as real-world data always look like a hot mess of raw text. Identifying potential pitfalls of your data (e.g. missing, incomplete, badly formatted data) comes with experience.
The writer of this article (me) has very little experience in this. Hence, we will be covering some elementary techniques of “data-cleaning”
Removing Empty Observations
#before deleting observations with missing values, you might want to check it out first. #is.na(dataframe) returns a vector of numbers that are the rows that have missing values. airquality[is.na(airquality),]
Output:
## Ozone Solar.R Wind Temp Month Day Year Date DateType
## 5 NA NA 14.3 56 5 5 2017 2017-5-5 2017-05-05
## 10 NA 194 8.6 69 5 10 2017 2017-5-10 2017-05-10
## 25 NA 66 16.6 57 5 25 2017 2017-5-25 2017-05-25
## 26 NA 266 14.9 58 5 26 2017 2017-5-26 2017-05-26
## 27 NA NA 8.0 57 5 27 2017 2017-5-27 2017-05-27
## 32 NA 286 8.6 78 6 1 2017 2017-6-1 2017-06-01
## 33 NA 287 9.7 74 6 2 2017 2017-6-2 2017-06-02
## 34 NA 242 16.1 67 6 3 2017 2017-6-3 2017-06-03
## 35 NA 186 9.2 84 6 4 2017 2017-6-4 2017-06-04
## 36 NA 220 8.6 85 6 5 2017 2017-6-5 2017-06-05
## 37 NA 264 14.3 79 6 6 2017 2017-6-6 2017-06-06
## 39 NA 273 6.9 87 6 8 2017 2017-6-8 2017-06-08
## 42 NA 259 10.9 93 6 11 2017 2017-6-11 2017-06-11
## 43 NA 250 9.2 92 6 12 2017 2017-6-12 2017-06-12
## 45 NA 332 13.8 80 6 14 2017 2017-6-14 2017-06-14
## 46 NA 322 11.5 79 6 15 2017 2017-6-15 2017-06-15
## 52 NA 150 6.3 77 6 21 2017 2017-6-21 2017-06-21
## 53 NA 59 1.7 76 6 22 2017 2017-6-22 2017-06-22
## 54 NA 91 4.6 76 6 23 2017 2017-6-23 2017-06-23
## 55 NA 250 6.3 76 6 24 2017 2017-6-24 2017-06-24
## 56 NA 135 8.0 75 6 25 2017 2017-6-25 2017-06-25
## 57 NA 127 8.0 78 6 26 2017 2017-6-26 2017-06-26
## 58 NA 47 10.3 73 6 27 2017 2017-6-27 2017-06-27
## 59 NA 98 11.5 80 6 28 2017 2017-6-28 2017-06-28
## 60 NA 31 14.9 77 6 29 2017 2017-6-29 2017-06-29
## 61 NA 138 8.0 83 6 30 2017 2017-6-30 2017-06-30
## 65 NA 101 10.9 84 7 4 2017 2017-7-4 2017-07-04
## 72 NA 139 8.6 82 7 11 2017 2017-7-11 2017-07-11
## 75 NA 291 14.9 91 7 14 2017 2017-7-14 2017-07-14
## 83 NA 258 9.7 81 7 22 2017 2017-7-22 2017-07-22
## 84 NA 295 11.5 82 7 23 2017 2017-7-23 2017-07-23
## 102 NA 222 8.6 92 8 10 2017 2017-8-10 2017-08-10
## 103 NA 137 11.5 86 8 11 2017 2017-8-11 2017-08-11
## 107 NA 64 11.5 79 8 15 2017 2017-8-15 2017-08-15
## 115 NA 255 12.6 75 8 23 2017 2017-8-23 2017-08-23
## 119 NA 153 5.7 88 8 27 2017 2017-8-27 2017-08-27
## 150 NA 145 13.2 77 9 27 2017 2017-9-27 2017-09-27
## NA NA NA NA NA NA NA NA
## NA.1 NA NA NA NA NA NA NA
## NA.2 NA NA NA NA NA NA NA
## NA.3 NA NA NA NA NA NA NA
## NA.4 NA NA NA NA NA NA NA
## NA.5 NA NA NA NA NA NA NA
## NA.6 NA NA NA NA NA NA NA
#you might be just concerned with the NA values of 1 certain column such as Solar.R, so you can refine the search as below airquality[is.na(airquality$Solar.R),]
Output:
## Ozone Solar.R Wind Temp Month Day Year Date DateType
## 5 NA NA 14.3 56 5 5 2017 2017-5-5 2017-05-05
## 6 28 NA 14.9 66 5 6 2017 2017-5-6 2017-05-06
## 11 7 NA 6.9 74 5 11 2017 2017-5-11 2017-05-11
## 27 NA NA 8.0 57 5 27 2017 2017-5-27 2017-05-27
## 96 78 NA 6.9 86 8 4 2017 2017-8-4 2017-08-04
## 97 35 NA 7.4 85 8 5 2017 2017-8-5 2017-08-05
## 98 66 NA 4.6 87 8 6 2017 2017-8-6 2017-08-06
#From the above methods, you can assign dummy values to the rows, but if you want to remove those observations, you can use na.omit(dataframe) na.omit(airquality)
Output:
## Ozone Solar.R Wind Temp Month Day Year Date DateType
## 1 41 190 7.4 67 5 1 2017 2017-5-1 2017-05-01
## 2 36 118 8.0 72 5 2 2017 2017-5-2 2017-05-02
## 3 12 149 12.6 74 5 3 2017 2017-5-3 2017-05-03
## 4 18 313 11.5 62 5 4 2017 2017-5-4 2017-05-04
## 7 23 299 8.6 65 5 7 2017 2017-5-7 2017-05-07
## 8 19 99 13.8 59 5 8 2017 2017-5-8 2017-05-08
## 9 8 19 20.1 61 5 9 2017 2017-5-9 2017-05-09
.
.
.
## 152 18 131 8.0 76 9 29 2017 2017-9-29 2017-09-29
## 153 20 223 11.5 68 9 30 2017 2017-9-30 2017-09-30
Changing data types of variables
Though there are other data types that can be used for analyses, most of the time raw data come in raw text or number (e.g. phone numbers, address).
But for this tutorial, we will just look at the example of formatting raw text into the ‘date’ class. Why bother to change it to date type? When you do time series analysis, you will realise R packages would require the column’s data class to be “Date”.
First, we create a column for a year.
airquality$Year <- 2017
Then, we create another column for a formatted version of the date, separated with ‘-‘
airquality$Date <- with(airquality,paste(Year,Month,Day,sep='-'))
Now the class of the Date Column in the data frame is “character”.
class(airquality$Date)
Output:
## [1] "character"
Now we can use as.Date(dataframe, format) to change this raw text into a ‘Date’ class. As you type as.Date, you might realise the autocomplete shows you many other “as.xxxx” function, so you can actually explore different classes.
P.S. you can actually type the previous line into this line, without creating another column, but for the sake of understanding, it is separated.
airquality$DateType <- as.Date(airquality$Date,"%Y-%m-%d")
Check the class of the Date Column again.
class(airquality$DateType)
Output:
## [1] "Date"
Next: Tutorial 7 Data Manipulation (Continued)