Data Analysis with the Tidyverse (2023)

In this tutorial, we will learn about R functions and data analysis with the tidyverse package.

Please click the button below to open an interactive version of all course R tutorials through RStudio Cloud.

Note: you will need to register for an account before opening the project. Please remember to use your GMU e-mail address.

Click the button below to launch an interactive RStudio environment using Binder.org. This will launch a pre-configured RStudio environment within your browser. Unlike RStudio cloud, this service has no monthly usage limits, but it may take up to 10 minutes to launch and you will not be able to save your work.

In this section will learn about common built-in functions that are useful for obtaining summary statistics, ranking data, and data analysis. We will also learn how to write our own custom functions in R.

Built-in Functions

Percentiles

The functions below are useful for studying the distribution of numeric values within a data set. All of these functions take a numeric vector as their input.

  • min()
    • Returns the minimum value
  • max()
    • Returns the maximum value
  • range()
    • Returns a vector of length 2 with the range of observed values (minimum and maximum values)
  • median()
    • Returns the median value (50th percentile)
  • fivenum()
    • Returns a vector of length 5 with the minimum, 25th percentile, median, 75th percentile, maximum values
  • quantile()
    • Returns the specified percentile(s) of a set of numeric values

Examples

Obtaining the range of values present in a numeric vector.

data_vector <- c(3, 9, 11.2, 14, 28.7, 30, 15, 21, 5.7, 9.1, 24.6)# minimum value in data_vectormin(data_vector)
[1] 3
# maximum valuemax(data_vector)
[1] 30
# range of data valuesrange(data_vector)
[1] 3 30

The median() and quantile() functions are used for obtaining specific percentiles from a distribution of numbers. A percentile of a set of numbers is a value below which a given percentage of the total values fall at or below. For example, the 50th percentile (also called the median) represents the center of a set of numeric data. This means that 50% of all the values are less than or equal to the 50th percentile.

The quantile() function requires two inputs. The first is a numeric vector of data values and the second is a vector with values ranging from 0 to 1, representing the percentile(s) to calculate.

# medianmedian(data_vector)
[1] 14
# 30th percentilequantile(data_vector, 0.3)
30% 9.1 
# 30th, 60th, and 90th percentilesquantile(data_vector, c(0.3, 0.6, 0.9))
 30% 60% 90% 9.1 15.0 28.7 

The fivenum() function calculates the five number summary (min, 25th, median, 75th, max) of a numeric vector.

fivenum(data_vector)
[1] 3.00 9.05 14.00 22.80 30.00

Mean and Standard Deviation

The mean() and sd() functions are used to calculate the mean and standard deviation of a set of data values.

# mean valuemean(data_vector)
[1] 15.57273
# standard deviationsd(data_vector)
[1] 9.241114

Adding Elements of a Numeric Vector

The sum() and cumsum() functions are used for summing the numbers within a vector. The sum() function simply returns the sum of all numbers within a vector.

The cumsum() functions calculates a cumulative sum for every position within a vector. This function always returns a vector of the same length as the input.

# sum of all valuessum(data_vector)
[1] 171.3
# cumulative sumcumsum(data_vector)
 [1] 3.0 12.0 23.2 37.2 65.9 95.9 110.9 131.9 137.6 146.7 171.3

Functions Useful for Ranking Data

The abs() and rank() functions are useful for ranking data values. The abs() function returns the absolute values of a vector.

negative_data <- c(-2, 4.5, -6, 10, 12)# returns the absolute value of all elementsabs(negative_data)
[1] 2.0 4.5 6.0 10.0 12.0

The rank() function returns the ranks of a set of data values from smallest to largest. The smallest value is given a rank of 1.

data_vector
 [1] 3.0 9.0 11.2 14.0 28.7 30.0 15.0 21.0 5.7 9.1 24.6
rank(data_vector)
 [1] 1 3 5 6 10 11 7 8 2 4 9

To obtain ranks from largest to smallest, where rank 1 represents the largest value, just take the rank of the negative of a numeric vector. In the example below, the value 30 is given a rank of 1.

data_vector
 [1] 3.0 9.0 11.2 14.0 28.7 30.0 15.0 21.0 5.7 9.1 24.6
rank(-data_vector)
 [1] 11 9 7 6 2 1 5 4 10 8 3

Writing Functions in R

There are many cases when we will have to write our own functions to achieve tasks in an analytics project. R functions can be defined to take any amount of inputs (usually called arguments) but only return one object.

The basic syntax of creating a function with arguments x and y is as follows:

my_function <- function(x, y) { R Code here }

Assume that we would like to write a function that takes a numeric vector as input and returns a vector of scaled values. For each value in our original vector, we will subtract the mean and divide by the standard deviation. In Statistics, this transformation is sometimes called a z-score.

In the code cell below, I demonstrate how this can be done without writing a function.

numeric_data <- c(3, 8, 4, 7, 12, 2)# Calculate the z-scores of numeric_data(numeric_data - mean(numeric_data)) / sd(numeric_data)
[1] -0.8017837 0.5345225 -0.5345225 0.2672612 1.6035675 -1.0690450

Instead of typing the above expression every time we need to perform this transformation, let’s write a custom function that performs this task.

I will show two equivalent ways of writing this function and discuss the difference.

Note that the input value is named x. This is completely arbitrary. The input value could also have been named input as long as the same name is used within the code of the function. In our code below, x simply represents the numeric vector that we expect to get passed into the function.

z_score_1 <- function(x) { return((x - mean(x))/sd(x))}
# Let's test our functionage_vector <- c(18, 24, 21, 37, 51, 34, 41)z_score_1(age_vector)
[1] -1.1992327 -0.6955550 -0.9473939 0.3957468 1.5709949 0.1439079 0.7315320

By default, an R function returns the results of the last operation that it performed. The code below is an equivalent way of writing the same function. In this case we do not need to use return to give us the result.

# Equivalentz_score_2 <- function(x) { (x - mean(x))/sd(x)}
# Check resultsz_score_2(age_vector)
[1] -1.1992327 -0.6955550 -0.9473939 0.3957468 1.5709949 0.1439079 0.7315320

The return() call is useful when you need to return a list of results from a function. The function below creates three objects, mean_x, sd_x, and scaled_data. To obtain all of these results, we must use return and build a list that contains all of the objects.

# return a listz_score_3 <- function(x) { mean_x <- mean(x) # Calculate and save the mean sd_x <- sd(x) # Calculate and save the standard deviation scaled_data <- (x - mean_x)/sd_x # Save the transformed vector return(list(mean_value = mean_x, sd_value = sd_x, scaled_vector = scaled_data)) }
detailed_results <- z_score_3(age_vector)# View the resultsdetailed_results
$mean_value[1] 32.28571$sd_value[1] 11.91238$scaled_vector[1] -1.1992327 -0.6955550 -0.9473939 0.3957468 1.5709949 0.1439079 0.7315320

This section will cover the basics of data manipulation using the tidyverse package. Before we can use the package, we must load it into our environment with the following code library(tidyverse). This will import all of the functions available in the tidyverse package into our environment.

The tidyverse is a collection of 8 packages that are designed specifically for data science tasks.

In this course, I have installed all required packages into our RStudio Cloud environment. If you are ever working with RStudio on your desktop, you must install packages before they can be used. This is done with the following code install.packages('tidyverse').

To get more details about the tidyverse package see the tidyverse documentation

We will also load the skimr package which is used for exploring the structure of a data frame.

# This will load all 8 of the tidyverse packageslibrary(tidyverse)library(skimr)

Tibbles

The first package we will explore is tibble. The tibble package is used for creating special types of data frames called tibbles.

Tibbles are data frames with added properties and functionality. Many of the core functions in the tidyverse take tibbles as arguments and return them as results after execution.

Creating tibbles

R has many built-in datasets that can be loaded as data frames. One example is the iris data frame. To load this data, you just have to type iris in the R console.

Each row in iris represents a flower with corresponding measurements of height and width of the sepal and petal.

By default, R will try to print every row of a data frame, easily overwhelming your console. Another property of R data frames is that each row is labeled with a number. These are known as row labels.

iris
 Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5.0 3.6 1.4 0.2 setosa6 5.4 3.9 1.7 0.4 setosa7 4.6 3.4 1.4 0.3 setosa8 5.0 3.4 1.5 0.2 setosa9 4.4 2.9 1.4 0.2 setosa10 4.9 3.1 1.5 0.1 setosa11 5.4 3.7 1.5 0.2 setosa12 4.8 3.4 1.6 0.2 setosa13 4.8 3.0 1.4 0.1 setosa14 4.3 3.0 1.1 0.1 setosa15 5.8 4.0 1.2 0.2 setosa16 5.7 4.4 1.5 0.4 setosa17 5.4 3.9 1.3 0.4 setosa18 5.1 3.5 1.4 0.3 setosa19 5.7 3.8 1.7 0.3 setosa20 5.1 3.8 1.5 0.3 setosa21 5.4 3.4 1.7 0.2 setosa22 5.1 3.7 1.5 0.4 setosa23 4.6 3.6 1.0 0.2 setosa24 5.1 3.3 1.7 0.5 setosa25 4.8 3.4 1.9 0.2 setosa26 5.0 3.0 1.6 0.2 setosa27 5.0 3.4 1.6 0.4 setosa28 5.2 3.5 1.5 0.2 setosa29 5.2 3.4 1.4 0.2 setosa30 4.7 3.2 1.6 0.2 setosa31 4.8 3.1 1.6 0.2 setosa32 5.4 3.4 1.5 0.4 setosa33 5.2 4.1 1.5 0.1 setosa34 5.5 4.2 1.4 0.2 setosa35 4.9 3.1 1.5 0.2 setosa36 5.0 3.2 1.2 0.2 setosa37 5.5 3.5 1.3 0.2 setosa38 4.9 3.6 1.4 0.1 setosa39 4.4 3.0 1.3 0.2 setosa40 5.1 3.4 1.5 0.2 setosa41 5.0 3.5 1.3 0.3 setosa42 4.5 2.3 1.3 0.3 setosa43 4.4 3.2 1.3 0.2 setosa44 5.0 3.5 1.6 0.6 setosa45 5.1 3.8 1.9 0.4 setosa46 4.8 3.0 1.4 0.3 setosa47 5.1 3.8 1.6 0.2 setosa48 4.6 3.2 1.4 0.2 setosa49 5.3 3.7 1.5 0.2 setosa50 5.0 3.3 1.4 0.2 setosa51 7.0 3.2 4.7 1.4 versicolor52 6.4 3.2 4.5 1.5 versicolor53 6.9 3.1 4.9 1.5 versicolor54 5.5 2.3 4.0 1.3 versicolor55 6.5 2.8 4.6 1.5 versicolor56 5.7 2.8 4.5 1.3 versicolor57 6.3 3.3 4.7 1.6 versicolor58 4.9 2.4 3.3 1.0 versicolor59 6.6 2.9 4.6 1.3 versicolor60 5.2 2.7 3.9 1.4 versicolor61 5.0 2.0 3.5 1.0 versicolor62 5.9 3.0 4.2 1.5 versicolor63 6.0 2.2 4.0 1.0 versicolor64 6.1 2.9 4.7 1.4 versicolor65 5.6 2.9 3.6 1.3 versicolor66 6.7 3.1 4.4 1.4 versicolor67 5.6 3.0 4.5 1.5 versicolor68 5.8 2.7 4.1 1.0 versicolor69 6.2 2.2 4.5 1.5 versicolor70 5.6 2.5 3.9 1.1 versicolor71 5.9 3.2 4.8 1.8 versicolor72 6.1 2.8 4.0 1.3 versicolor73 6.3 2.5 4.9 1.5 versicolor74 6.1 2.8 4.7 1.2 versicolor75 6.4 2.9 4.3 1.3 versicolor76 6.6 3.0 4.4 1.4 versicolor77 6.8 2.8 4.8 1.4 versicolor78 6.7 3.0 5.0 1.7 versicolor79 6.0 2.9 4.5 1.5 versicolor80 5.7 2.6 3.5 1.0 versicolor81 5.5 2.4 3.8 1.1 versicolor82 5.5 2.4 3.7 1.0 versicolor83 5.8 2.7 3.9 1.2 versicolor84 6.0 2.7 5.1 1.6 versicolor85 5.4 3.0 4.5 1.5 versicolor86 6.0 3.4 4.5 1.6 versicolor87 6.7 3.1 4.7 1.5 versicolor88 6.3 2.3 4.4 1.3 versicolor89 5.6 3.0 4.1 1.3 versicolor90 5.5 2.5 4.0 1.3 versicolor91 5.5 2.6 4.4 1.2 versicolor92 6.1 3.0 4.6 1.4 versicolor93 5.8 2.6 4.0 1.2 versicolor94 5.0 2.3 3.3 1.0 versicolor95 5.6 2.7 4.2 1.3 versicolor96 5.7 3.0 4.2 1.2 versicolor97 5.7 2.9 4.2 1.3 versicolor98 6.2 2.9 4.3 1.3 versicolor99 5.1 2.5 3.0 1.1 versicolor100 5.7 2.8 4.1 1.3 versicolor101 6.3 3.3 6.0 2.5 virginica102 5.8 2.7 5.1 1.9 virginica103 7.1 3.0 5.9 2.1 virginica104 6.3 2.9 5.6 1.8 virginica105 6.5 3.0 5.8 2.2 virginica106 7.6 3.0 6.6 2.1 virginica107 4.9 2.5 4.5 1.7 virginica108 7.3 2.9 6.3 1.8 virginica109 6.7 2.5 5.8 1.8 virginica110 7.2 3.6 6.1 2.5 virginica111 6.5 3.2 5.1 2.0 virginica112 6.4 2.7 5.3 1.9 virginica113 6.8 3.0 5.5 2.1 virginica114 5.7 2.5 5.0 2.0 virginica115 5.8 2.8 5.1 2.4 virginica116 6.4 3.2 5.3 2.3 virginica117 6.5 3.0 5.5 1.8 virginica118 7.7 3.8 6.7 2.2 virginica119 7.7 2.6 6.9 2.3 virginica120 6.0 2.2 5.0 1.5 virginica121 6.9 3.2 5.7 2.3 virginica122 5.6 2.8 4.9 2.0 virginica123 7.7 2.8 6.7 2.0 virginica124 6.3 2.7 4.9 1.8 virginica125 6.7 3.3 5.7 2.1 virginica126 7.2 3.2 6.0 1.8 virginica127 6.2 2.8 4.8 1.8 virginica128 6.1 3.0 4.9 1.8 virginica129 6.4 2.8 5.6 2.1 virginica130 7.2 3.0 5.8 1.6 virginica131 7.4 2.8 6.1 1.9 virginica132 7.9 3.8 6.4 2.0 virginica133 6.4 2.8 5.6 2.2 virginica134 6.3 2.8 5.1 1.5 virginica135 6.1 2.6 5.6 1.4 virginica136 7.7 3.0 6.1 2.3 virginica137 6.3 3.4 5.6 2.4 virginica138 6.4 3.1 5.5 1.8 virginica139 6.0 3.0 4.8 1.8 virginica140 6.9 3.1 5.4 2.1 virginica141 6.7 3.1 5.6 2.4 virginica142 6.9 3.1 5.1 2.3 virginica143 5.8 2.7 5.1 1.9 virginica144 6.8 3.2 5.9 2.3 virginica145 6.7 3.3 5.7 2.5 virginica146 6.7 3.0 5.2 2.3 virginica147 6.3 2.5 5.0 1.9 virginica148 6.5 3.0 5.2 2.0 virginica149 6.2 3.4 5.4 2.3 virginica150 5.9 3.0 5.1 1.8 virginica

Coverting Data Frames to Tibbles

To convert any R data frame into a tibble, we can use the as_tibble() function from the tibble package. In the code below, we create a tibble named iris_tbl.

A nice property of tibbles is that they only print the first 10 rows of data and label each column with its respective data type. In the output below, “dbl” stands for numeric.

iris_tbl <- as_tibble(iris)iris_tbl

When we pass iris_tbl to the str() function, we see that it lets us know that we have a tibble.

str(iris_tbl)
tibble [150 × 5] (S3: tbl_df/tbl/data.frame) $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Converting Tibbles to Date Frames

In general, tibbles are much easier to work with than data frames. However, not all R functions are able to work with them. If you ever encounter this situation, it is easy to convert a tibble back to a data frame with the as.data.frame() function.

The code below converts out iris_tbl back to a data frame.

iris_df <- as.data.frame(iris_tbl)str(iris_df)
'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Creating Tibbles with tibble()

We can create tibbles from individual vectors using the tibble() function. This is similar to how data frames are created with the data.frame() function.

One major difference is that tibble() allows you to reference variables within the function call. You can even use R functions to create new columns. See the example below that uses tibble() to create a simple dataset.

my_tbl <- tibble(column_1 = c(1, 3, 7, 2.5, 22), column_2 = c('A', 'B', 'C', 'D', 'E'), column_3 = (column_1 * 2) + 10, column_4 = column_1 + mean(column_1))my_tbl

Loading Data into R

Before we are able to perform data analysis, we must import data into our R environment.

The tidyverse package loads the readr package which contains a number of functions for importing data into R.

The read_delim() function is used to import flat files such as comma-delimited (.csv) or tab-delimited (.txt) files.

The read_delim() functions takes many arguments, but the 3 most important are:

  • file - the first argument is the path to a file on your computer or website address of the data file
  • delim - the type of delimiter in the data file (either “,” for comma, “\t” for tab, or any other character)
  • col_names - TRUE or FALSE to indicate whether a file has column names

To see how this function works, let’s import the Wine Dataset from the UCI Machine Learning Repository.

If there are no column names in a dataset, read_delim() will auto-generate names that begin with an X and cycle through a sequence of integers.

The read_delim() function will also print a message to the R console about the data types it has assigned to each column.

wine_data <- read_delim('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', delim = ',', col_names = FALSE)
Parsed with column specification:cols( X1 = col_double(), X2 = col_double(), X3 = col_double(), X4 = col_double(), X5 = col_double(), X6 = col_double(), X7 = col_double(), X8 = col_double(), X9 = col_double(), X10 = col_double(), X11 = col_double(), X12 = col_double(), X13 = col_double(), X14 = col_double())
wine_data

In this course, we will be loading tibbles from our course website with the read_rds() function (as demonstrated below).

However, I recommend that you refer to the readr documentation to get more familiar with reading different types of data into your R environment.

Employee Attrition Data

The code below will import a data set from our course website. The data consists of 1,470 employee records for a U.S. based product company. The rows in this data frame represent the attributes of an employee at this company across the variables listed in the table below.

VariableDefinition
left_companyDid the employee leave the company? (Yes/No)
departmentDepartment within the company
job_levelJob Level (Associate - Vice President)
salaryEmployee yearly salary (US Dollars)
weekly_hoursSelf-reported average weekly hours spent on the job (company survey)
business_travelLevel of required business travel
yrs_at_companyTenure at the company (years)
yrs_since_promotionYears since last promotion
previous_companiesNumber of previous companies for which the employee has worked
job_satisfactionSelf-reported job satisfaction (company survey)
performance_ratingMost recent annual performance rating
marital_statusMarital status (Single, Married, or Divorced)
miles_from_homeDistance from employee address to office location

This data is a special type of data frame known as a tibble. All data frames in the tidyverse are usually stored in this format. It has special properties which include better printing features and labels for column data types.

employee_data <- read_rds(url('https://gmudatamining.com/data/employee_data.rds'))# View dataemployee_data

Exploring Data Frames with skimr

The first step in a data analysis project is to explore your data source. This includes summarizing the values within each column, checking for missing data, checking the data types of each column, and verifying the number of rows and columns.

The skim() function can be used to accomplish all of this. It takes your data frame as an argument. In the output below, we first get the number of rows and columns along with the data types present in our data.

The results are then grouped by the type of variables in our data.

First we get a summary of our factor variables, including the number of missing observations, whether our factor levels are ordered, the count of unique levels, and an abbreviated list of the most frequent factor levels.

Then we get a summary of our numeric variables which include the number of missing observations, the mean and standard deviation, a five number summary, and a plot of the distribution of values.

# View data frame properties and summary statisticsskim(employee_data)
Data summary
Nameemployee_data
Number of rows1470
Number of columns13
_______________________
Column type frequency:
factor7
numeric6
________________________
Group variablesNone

Variable type: factor

skim_variablen_missingcomplete_rateorderedn_uniquetop_counts
left_company01FALSE2No: 1233, Yes: 237
department01FALSE6IT : 399, Res: 293, Sal: 252, Mar: 238
job_level01FALSE5Sen: 476, Man: 344, Dir: 331, Ass: 185
business_travel01FALSE3Rar: 1043, Fre: 277, Non: 150
job_satisfaction01FALSE4Ver: 459, Hig: 442, Low: 289, Med: 280
performance_rating01FALSE5Mee: 546, Exc: 472, Exc: 286, Min: 136
marital_status01FALSE3Mar: 673, Sin: 470, Div: 327

Variable type: numeric

skim_variablen_missingcomplete_ratemeansdp0p25p50p75p100hist
salary0194076.2537590.2429848.5670379.4888555.53117099.9212134.7▃▇▃▁▁
weekly_hours0150.024.8240.0047.0049.0052.066.0▂▇▃▂▁
yrs_at_company017.016.130.003.005.009.040.0▇▂▁▁▁
yrs_since_promotion012.193.220.000.001.003.015.0▇▁▁▁▁
previous_companies013.241.581.002.003.004.07.0▇▇▂▂▃
miles_from_home019.198.111.002.007.0014.029.0▇▅▂▂▂

It is also possible to select a subset of variables to explore. Just pass a sequence of unquoted variable names into the skim() function.

The skimr package has many more features for exploring data. Once we cover the fundamentals of dplyr in the next sections, I encourage interested students to explore the skimr documentation

# View data frame properties and summary statisticsskim(employee_data, left_company, department, salary, weekly_hours)
Data summary
Nameemployee_data
Number of rows1470
Number of columns13
_______________________
Column type frequency:
factor2
numeric2
________________________
Group variablesNone

Variable type: factor

skim_variablen_missingcomplete_rateorderedn_uniquetop_counts
left_company01FALSE2No: 1233, Yes: 237
department01FALSE6IT : 399, Res: 293, Sal: 252, Mar: 238

Variable type: numeric

skim_variablen_missingcomplete_ratemeansdp0p25p50p75p100hist
salary0194076.2537590.2429848.5670379.4888555.53117099.9212134.7▃▇▃▁▁
weekly_hours0150.024.8240.0047.0049.0052.066.0▂▇▃▂▁

Data Manipulation with dplyr

In this section we will cover data manipulation with the dplyr package. This is one of the core tidyverse packages used for exploring data frames.

Chapter 5 of R for Data Science covers the basics of manipulating data frames in R. In this tutorial, I would like to provide additional examples of the main functions of dplyr, including filter(), select(), arrange(), summarise(), and mutate().

The first argument to all of these functions is a data frame, followed by additional arguments that perform various manipulations on the data. The output from all of these functions will also be a special type of data frame known as a tibble.

filter()

The filter() function is used for subsetting rows of a data frame. It is much more intuitive than subsetting with the base R functions [ ] and [[ ]].

The first argument to filter() is a data frame, followed by one or more logical conditions on the variables within the data frame. Logical conditions separated by a comma are treated as an AND (&) operation. The advantage of dplyr, is that you can pass variable names of a data frame in raw, unquoted format to many functions. The filter() function returns a data frame that has been subsetted by the logical conditions within its arguments.

# employees that left the companyfilter(employee_data, left_company == 'Yes') 
# View employees that left from the Sales departmentfilter(employee_data, left_company == 'Yes', department == 'Sales') 

To filter a data frame using an OR condition, we must use the | operator.

# employees from Sales or Marketing departmentfilter(employee_data, department == 'Sales' | department == 'Marketing')

Another way to execute OR statements is by using the %in% function. This function is used to check whether a column’s variable values match at least one element within a vector. In many cases, it can save lots of typing. The code below will produce the same result as the previous command

# employees from Sales or Marketing departmentfilter(employee_data, department %in% c('Sales', 'Marketing'))

What if we are interested in employees from Sales or Marketing that make over $80,000? We can just add another condition to the previous code. Remember that conditions separated by a comma represent an AND operation. So in the code below, we are passing the following condition: employees with salary > 80000 AND (department is Sales OR department is Marketing)

# employees from Sales or Marketing departmentfilter(employee_data, salary > 80000, department %in% c('Sales', 'Marketing'))

select()

The select() function allows you to select a subset of columns from a data frame. There are multiple ways to enter the selection condition and many helper functions, such as starts_with(), ends_with(), and contains(). See the documentation for more examples.

We can select columns by used unquoted column names.

# Select the first three columnsselect(employee_data, left_company, department, job_level)

We can also select columns by using their numeric positions

# Select the first three columns with a numeric vectorselect(employee_data, c(1, 2, 3))

We can also pass a sequence of numeric positions separated by a comma.

# Select the first three columns with raw numbersselect(employee_data, 1, 2, 3)

Adding a - in front of numeric positions or variable names excludes those variables and returns all others

# Select all columns except department and job_levelselect(employee_data, -department, -job_level)
# Exclude the first 5 columnsselect(employee_data, -1, -2, -3, -4, -5)
# Many times it's easier to use a sequenceselect(employee_data, -(1:5))

There are many helper functions available with the select() function. Below are two examples. The first selects columns that contain the word “job”. The second selects columns that start with the letter “y”

# Select all variables that contain the word jobselect(employee_data, contains('job'))
# Select all variables that start with "y" (or "Y")select(employee_data, starts_with("y"))

arrange()

The arrange() function allows you to sort a data frame by specific columns. The default sorting order is ascending.

# Sort by left_company and salaryarrange(employee_data, left_company, salary)

To sort a column in descending order, wrap it in the desc() function.

# Sort by left company (descending) and salary (ascending)arrange(employee_data, desc(left_company), salary)

summarise()

The summarise() function allows you to create numerical summaries, such as averages, from the columns of a data frame. The summarise() function is an aggregation function, meaning that it takes input with multiple values and returns a single value. With a sample average, for example, we take a numeric vector with n numbers and summarize it with a single value, the average.

The summarise() function takes a data frame as the first argument followed by a sequence of functions to execute on the columns of the data frame.

Remember that dplyr functions will always return a data frame of results, even if the results are a single number or row.

# Let's get the average salarysummarise(employee_data, average_salary = mean(salary))
# We can also create multiple summaries at oncesummarise(employee_data, salary_min = min(salary), salary_25th = quantile(salary, 0.25), salary_50th = median(salary), salary_75th = quantile(salary, 0.75), salary_max = max(salary))

mutate()

The mutate() function allows you to create new variables in your data. The mutate() function is a window function. Window functions take input with n values and always return a result with n values. Like summarise(), mutate() takes a data frame as the first argument followed by a sequence of functions to execute on the columns of the data frame.

Let’s create a new variable salary_scaled that transforms the salary values by subtracting the mean and dividing by the standard deviation. In the code below, I save this new data frame as employee_data_scaled.

# Create salary_scaled variableemployee_data_scaled <- mutate(employee_data, salary_scaled = (salary - mean(salary))/sd(salary))
# Let's see the resultselect(employee_data_scaled, salary, salary_scaled)

Combining Steps With The %>% Operator

One of the most useful tools in the tidyverse is the pipe operator, %>% . This operator allows users to combine multiple data manipulation tasks in one line of code.

The pipe operator tells R to take the object on the left and pass it to the right as the first argument to the next function. A keyboard shortcut for the pipe operator is Ctrl + Shift + m.

Let’s go through a simple example to demonstrate the %>% operator. Suppose I have a vector of numeric values and I need to do the following:

  • Calculate a vector with the cumulative sums using the cumsum() function
  • Calculate the median of this new vector with the median() function

The code below shows how this is done without the pipe operator. Most people new to programming find the code a bit difficult to read because the operations go from right to left.

cumsum(numeric_data) is evaluated first, then passed to the median() function on the left.

If we had an operation that required 3 or more steps, writing code in this way would quickly get confusing to understand and explain.

# Data valuesnumeric_data <- c(2, 5, 1, 7, 9, 10)# Cumulative sumcumsum(numeric_data)
[1] 2 7 8 15 24 34
# Median of the cumulative summedian(cumsum(numeric_data))
[1] 11.5

The %>% operator can be used to write cleaner code that moves from right to left in a do this - then that fashion. In the code below, we first create the cumulative sum vector with cumsum() and then pass it to the first argument of the median() function. Most people find this style of coding much easier to understand.

cumsum(numeric_data) %>% median()
[1] 11.5

Using %>% with dplyr

The pipe operator makes it easy to create data analysis workflows with dplyr functions. When splitting steps across multiple lines, make sure that each line ends with %>% to prevent errors.

Let’s use %>% to create the following data analysis:

  • Start with our data, employee_data
  • Filter the data to select employees who left the company
  • Calculate their average salary

The code below demonstrates how to achieve this with one expression using the %>% operator.

Notice that we do not have to pass employee_data into the first argument of filter() since we have piped it in with the first %>%. The filter() function then produces a new data frame that only contains rows that have left_company values of “Yes”. This is then passed to the first argument of summarise() by the second %>%.

employee_data %>% filter(left_company == 'Yes') %>% summarise(average_salary = mean(salary))

For comparison, the code below executes the same workflow without the %>% operator. In this case, we must save our filtered data frame and pass it into summarise().

employees_left <- filter(employee_data, left_company == 'Yes')summarise(employees_left, average_salary = mean(salary))

Let’s use %>% to create the following data analysis workflow:

  • Start with our data, employee_data
  • Create a new column, weekly_hrs_scaled, that performs a z-score transformation on the weekly_hours variable
  • Select rows that have weekly_hrs_scaled values greater than 1.5
employee_data %>% mutate(weekly_hrs_scaled = (weekly_hours - mean(weekly_hours)) / sd(weekly_hours)) %>% filter(weekly_hrs_scaled > 1.5)

Let’s use %>% to create the following data analysis workflow:

  • Start with our data, employee_data
  • Select rows with employees from either Finance and Operations or Marketing and a job level of Director
  • Calculate their average salary
employee_data %>% filter(department %in% c('Finance and Operations', 'Marketing'), job_level == 'Director') %>% summarise(average_salary = mean(salary))

Test Your Knowledge

Use dplyr and the %>% operator to execute the following data analysis workflow:

  • Start with our data, employee_data
  • Select rows with employees from the Marketing department
  • Create a new column, salary_scaled, that performs a z-score transformation on the salary variable
  • Select rows where salary_scaled is greater than or equal to 0.5
  • Select the following columns: left_company, department, job_level, salary_scaled


You should get a tibble with 65 rows as the result.

Top Articles
Latest Posts
Article information

Author: Manual Maggio

Last Updated: 22/02/2023

Views: 5749

Rating: 4.9 / 5 (49 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Manual Maggio

Birthday: 1998-01-20

Address: 359 Kelvin Stream, Lake Eldonview, MT 33517-1242

Phone: +577037762465

Job: Product Hospitality Supervisor

Hobby: Gardening, Web surfing, Video gaming, Amateur radio, Flag Football, Reading, Table tennis

Introduction: My name is Manual Maggio, I am a thankful, tender, adventurous, delightful, fantastic, proud, graceful person who loves writing and wants to share my knowledge and understanding with you.