# Data Analysis with the Tidyverse (2023)

In this tutorial, we will learn about `R` functions and data analysis with the `tidyverse` package.

Please click the button below to open an interactive version of all course `R` tutorials through RStudio Cloud.

Note: you will need to register for an account before opening the project. Please remember to use your GMU e-mail address.

Click the button below to launch an interactive RStudio environment using `Binder.org`. This will launch a pre-configured RStudio environment within your browser. Unlike RStudio cloud, this service has no monthly usage limits, but it may take up to 10 minutes to launch and you will not be able to save your work.

In this section will learn about common built-in functions that are useful for obtaining summary statistics, ranking data, and data analysis. We will also learn how to write our own custom functions in `R`.

## Built-in Functions

### Percentiles

The functions below are useful for studying the distribution of numeric values within a data set. All of these functions take a numeric vector as their input.

• `min()`
• Returns the minimum value
• `max()`
• Returns the maximum value
• `range()`
• Returns a vector of length 2 with the range of observed values (minimum and maximum values)
• `median()`
• Returns the median value (50th percentile)
• `fivenum()`
• Returns a vector of length 5 with the minimum, 25th percentile, median, 75th percentile, maximum values
• `quantile()`
• Returns the specified percentile(s) of a set of numeric values

### Examples

Obtaining the range of values present in a numeric vector.

``data_vector <- c(3, 9, 11.2, 14, 28.7, 30, 15, 21, 5.7, 9.1, 24.6)# minimum value in data_vectormin(data_vector)``
``[1] 3``
``# maximum valuemax(data_vector)``
``[1] 30``
``# range of data valuesrange(data_vector)``
``[1] 3 30``

The `median()` and `quantile()` functions are used for obtaining specific percentiles from a distribution of numbers. A percentile of a set of numbers is a value below which a given percentage of the total values fall at or below. For example, the 50th percentile (also called the median) represents the center of a set of numeric data. This means that 50% of all the values are less than or equal to the 50th percentile.

The `quantile()` function requires two inputs. The first is a numeric vector of data values and the second is a vector with values ranging from 0 to 1, representing the percentile(s) to calculate.

``# medianmedian(data_vector)``
``[1] 14``
``# 30th percentilequantile(data_vector, 0.3)``
``30% 9.1 ``
``# 30th, 60th, and 90th percentilesquantile(data_vector, c(0.3, 0.6, 0.9))``
`` 30% 60% 90% 9.1 15.0 28.7 ``

The `fivenum()` function calculates the five number summary (min, 25th, median, 75th, max) of a numeric vector.

``fivenum(data_vector)``
``[1] 3.00 9.05 14.00 22.80 30.00``

### Mean and Standard Deviation

The `mean()` and `sd()` functions are used to calculate the mean and standard deviation of a set of data values.

``# mean valuemean(data_vector)``
``[1] 15.57273``
``# standard deviationsd(data_vector)``
``[1] 9.241114``

### Adding Elements of a Numeric Vector

The `sum()` and `cumsum()` functions are used for summing the numbers within a vector. The `sum()` function simply returns the sum of all numbers within a vector.

The `cumsum()` functions calculates a cumulative sum for every position within a vector. This function always returns a vector of the same length as the input.

``# sum of all valuessum(data_vector)``
``[1] 171.3``
``# cumulative sumcumsum(data_vector)``
`` [1] 3.0 12.0 23.2 37.2 65.9 95.9 110.9 131.9 137.6 146.7 171.3``

### Functions Useful for Ranking Data

The `abs()` and `rank()` functions are useful for ranking data values. The `abs()` function returns the absolute values of a vector.

``negative_data <- c(-2, 4.5, -6, 10, 12)# returns the absolute value of all elementsabs(negative_data)``
``[1] 2.0 4.5 6.0 10.0 12.0``

The `rank()` function returns the ranks of a set of data values from smallest to largest. The smallest value is given a rank of 1.

``data_vector``
`` [1] 3.0 9.0 11.2 14.0 28.7 30.0 15.0 21.0 5.7 9.1 24.6``
``rank(data_vector)``
`` [1] 1 3 5 6 10 11 7 8 2 4 9``

To obtain ranks from largest to smallest, where rank 1 represents the largest value, just take the rank of the negative of a numeric vector. In the example below, the value 30 is given a rank of 1.

``data_vector``
`` [1] 3.0 9.0 11.2 14.0 28.7 30.0 15.0 21.0 5.7 9.1 24.6``
``rank(-data_vector)``
`` [1] 11 9 7 6 2 1 5 4 10 8 3``

## Writing Functions in R

There are many cases when we will have to write our own functions to achieve tasks in an analytics project. `R` functions can be defined to take any amount of inputs (usually called arguments) but only return one object.

The basic syntax of creating a function with arguments x and y is as follows:

``my_function <- function(x, y) { R Code here }``

Assume that we would like to write a function that takes a numeric vector as input and returns a vector of scaled values. For each value in our original vector, we will subtract the mean and divide by the standard deviation. In Statistics, this transformation is sometimes called a z-score.

In the code cell below, I demonstrate how this can be done without writing a function.

``numeric_data <- c(3, 8, 4, 7, 12, 2)# Calculate the z-scores of numeric_data(numeric_data - mean(numeric_data)) / sd(numeric_data)``
``[1] -0.8017837 0.5345225 -0.5345225 0.2672612 1.6035675 -1.0690450``

Instead of typing the above expression every time we need to perform this transformation, let’s write a custom function that performs this task.

I will show two equivalent ways of writing this function and discuss the difference.

Note that the input value is named `x`. This is completely arbitrary. The input value could also have been named `input` as long as the same name is used within the code of the function. In our code below, `x` simply represents the numeric vector that we expect to get passed into the function.

``z_score_1 <- function(x) { return((x - mean(x))/sd(x))}``
``# Let's test our functionage_vector <- c(18, 24, 21, 37, 51, 34, 41)z_score_1(age_vector)``
``[1] -1.1992327 -0.6955550 -0.9473939 0.3957468 1.5709949 0.1439079 0.7315320``

By default, an `R` function returns the results of the last operation that it performed. The code below is an equivalent way of writing the same function. In this case we do not need to use `return` to give us the result.

``# Equivalentz_score_2 <- function(x) { (x - mean(x))/sd(x)}``
``# Check resultsz_score_2(age_vector)``
``[1] -1.1992327 -0.6955550 -0.9473939 0.3957468 1.5709949 0.1439079 0.7315320``

The `return()` call is useful when you need to return a list of results from a function. The function below creates three objects, `mean_x`, `sd_x`, and `scaled_data`. To obtain all of these results, we must use `return` and build a list that contains all of the objects.

``# return a listz_score_3 <- function(x) { mean_x <- mean(x) # Calculate and save the mean sd_x <- sd(x) # Calculate and save the standard deviation scaled_data <- (x - mean_x)/sd_x # Save the transformed vector return(list(mean_value = mean_x, sd_value = sd_x, scaled_vector = scaled_data)) }``
``detailed_results <- z_score_3(age_vector)# View the resultsdetailed_results``
``\$mean_value[1] 32.28571\$sd_value[1] 11.91238\$scaled_vector[1] -1.1992327 -0.6955550 -0.9473939 0.3957468 1.5709949 0.1439079 0.7315320``

This section will cover the basics of data manipulation using the `tidyverse` package. Before we can use the package, we must load it into our environment with the following code `library(tidyverse)`. This will import all of the functions available in the `tidyverse` package into our environment.

The `tidyverse` is a collection of 8 packages that are designed specifically for data science tasks.

In this course, I have installed all required packages into our RStudio Cloud environment. If you are ever working with RStudio on your desktop, you must install packages before they can be used. This is done with the following code `install.packages('tidyverse')`.

To get more details about the `tidyverse` package see the tidyverse documentation

We will also load the `skimr` package which is used for exploring the structure of a data frame.

``# This will load all 8 of the tidyverse packageslibrary(tidyverse)library(skimr)``

## Tibbles

The first package we will explore is `tibble`. The `tibble` package is used for creating special types of data frames called tibbles.

Tibbles are data frames with added properties and functionality. Many of the core functions in the `tidyverse` take tibbles as arguments and return them as results after execution.

### Creating tibbles

`R` has many built-in datasets that can be loaded as data frames. One example is the `iris` data frame. To load this data, you just have to type `iris` in the `R` console.

Each row in `iris` represents a flower with corresponding measurements of height and width of the sepal and petal.

By default, `R` will try to print every row of a data frame, easily overwhelming your console. Another property of `R` data frames is that each row is labeled with a number. These are known as row labels.

``iris``
`` Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5.0 3.6 1.4 0.2 setosa6 5.4 3.9 1.7 0.4 setosa7 4.6 3.4 1.4 0.3 setosa8 5.0 3.4 1.5 0.2 setosa9 4.4 2.9 1.4 0.2 setosa10 4.9 3.1 1.5 0.1 setosa11 5.4 3.7 1.5 0.2 setosa12 4.8 3.4 1.6 0.2 setosa13 4.8 3.0 1.4 0.1 setosa14 4.3 3.0 1.1 0.1 setosa15 5.8 4.0 1.2 0.2 setosa16 5.7 4.4 1.5 0.4 setosa17 5.4 3.9 1.3 0.4 setosa18 5.1 3.5 1.4 0.3 setosa19 5.7 3.8 1.7 0.3 setosa20 5.1 3.8 1.5 0.3 setosa21 5.4 3.4 1.7 0.2 setosa22 5.1 3.7 1.5 0.4 setosa23 4.6 3.6 1.0 0.2 setosa24 5.1 3.3 1.7 0.5 setosa25 4.8 3.4 1.9 0.2 setosa26 5.0 3.0 1.6 0.2 setosa27 5.0 3.4 1.6 0.4 setosa28 5.2 3.5 1.5 0.2 setosa29 5.2 3.4 1.4 0.2 setosa30 4.7 3.2 1.6 0.2 setosa31 4.8 3.1 1.6 0.2 setosa32 5.4 3.4 1.5 0.4 setosa33 5.2 4.1 1.5 0.1 setosa34 5.5 4.2 1.4 0.2 setosa35 4.9 3.1 1.5 0.2 setosa36 5.0 3.2 1.2 0.2 setosa37 5.5 3.5 1.3 0.2 setosa38 4.9 3.6 1.4 0.1 setosa39 4.4 3.0 1.3 0.2 setosa40 5.1 3.4 1.5 0.2 setosa41 5.0 3.5 1.3 0.3 setosa42 4.5 2.3 1.3 0.3 setosa43 4.4 3.2 1.3 0.2 setosa44 5.0 3.5 1.6 0.6 setosa45 5.1 3.8 1.9 0.4 setosa46 4.8 3.0 1.4 0.3 setosa47 5.1 3.8 1.6 0.2 setosa48 4.6 3.2 1.4 0.2 setosa49 5.3 3.7 1.5 0.2 setosa50 5.0 3.3 1.4 0.2 setosa51 7.0 3.2 4.7 1.4 versicolor52 6.4 3.2 4.5 1.5 versicolor53 6.9 3.1 4.9 1.5 versicolor54 5.5 2.3 4.0 1.3 versicolor55 6.5 2.8 4.6 1.5 versicolor56 5.7 2.8 4.5 1.3 versicolor57 6.3 3.3 4.7 1.6 versicolor58 4.9 2.4 3.3 1.0 versicolor59 6.6 2.9 4.6 1.3 versicolor60 5.2 2.7 3.9 1.4 versicolor61 5.0 2.0 3.5 1.0 versicolor62 5.9 3.0 4.2 1.5 versicolor63 6.0 2.2 4.0 1.0 versicolor64 6.1 2.9 4.7 1.4 versicolor65 5.6 2.9 3.6 1.3 versicolor66 6.7 3.1 4.4 1.4 versicolor67 5.6 3.0 4.5 1.5 versicolor68 5.8 2.7 4.1 1.0 versicolor69 6.2 2.2 4.5 1.5 versicolor70 5.6 2.5 3.9 1.1 versicolor71 5.9 3.2 4.8 1.8 versicolor72 6.1 2.8 4.0 1.3 versicolor73 6.3 2.5 4.9 1.5 versicolor74 6.1 2.8 4.7 1.2 versicolor75 6.4 2.9 4.3 1.3 versicolor76 6.6 3.0 4.4 1.4 versicolor77 6.8 2.8 4.8 1.4 versicolor78 6.7 3.0 5.0 1.7 versicolor79 6.0 2.9 4.5 1.5 versicolor80 5.7 2.6 3.5 1.0 versicolor81 5.5 2.4 3.8 1.1 versicolor82 5.5 2.4 3.7 1.0 versicolor83 5.8 2.7 3.9 1.2 versicolor84 6.0 2.7 5.1 1.6 versicolor85 5.4 3.0 4.5 1.5 versicolor86 6.0 3.4 4.5 1.6 versicolor87 6.7 3.1 4.7 1.5 versicolor88 6.3 2.3 4.4 1.3 versicolor89 5.6 3.0 4.1 1.3 versicolor90 5.5 2.5 4.0 1.3 versicolor91 5.5 2.6 4.4 1.2 versicolor92 6.1 3.0 4.6 1.4 versicolor93 5.8 2.6 4.0 1.2 versicolor94 5.0 2.3 3.3 1.0 versicolor95 5.6 2.7 4.2 1.3 versicolor96 5.7 3.0 4.2 1.2 versicolor97 5.7 2.9 4.2 1.3 versicolor98 6.2 2.9 4.3 1.3 versicolor99 5.1 2.5 3.0 1.1 versicolor100 5.7 2.8 4.1 1.3 versicolor101 6.3 3.3 6.0 2.5 virginica102 5.8 2.7 5.1 1.9 virginica103 7.1 3.0 5.9 2.1 virginica104 6.3 2.9 5.6 1.8 virginica105 6.5 3.0 5.8 2.2 virginica106 7.6 3.0 6.6 2.1 virginica107 4.9 2.5 4.5 1.7 virginica108 7.3 2.9 6.3 1.8 virginica109 6.7 2.5 5.8 1.8 virginica110 7.2 3.6 6.1 2.5 virginica111 6.5 3.2 5.1 2.0 virginica112 6.4 2.7 5.3 1.9 virginica113 6.8 3.0 5.5 2.1 virginica114 5.7 2.5 5.0 2.0 virginica115 5.8 2.8 5.1 2.4 virginica116 6.4 3.2 5.3 2.3 virginica117 6.5 3.0 5.5 1.8 virginica118 7.7 3.8 6.7 2.2 virginica119 7.7 2.6 6.9 2.3 virginica120 6.0 2.2 5.0 1.5 virginica121 6.9 3.2 5.7 2.3 virginica122 5.6 2.8 4.9 2.0 virginica123 7.7 2.8 6.7 2.0 virginica124 6.3 2.7 4.9 1.8 virginica125 6.7 3.3 5.7 2.1 virginica126 7.2 3.2 6.0 1.8 virginica127 6.2 2.8 4.8 1.8 virginica128 6.1 3.0 4.9 1.8 virginica129 6.4 2.8 5.6 2.1 virginica130 7.2 3.0 5.8 1.6 virginica131 7.4 2.8 6.1 1.9 virginica132 7.9 3.8 6.4 2.0 virginica133 6.4 2.8 5.6 2.2 virginica134 6.3 2.8 5.1 1.5 virginica135 6.1 2.6 5.6 1.4 virginica136 7.7 3.0 6.1 2.3 virginica137 6.3 3.4 5.6 2.4 virginica138 6.4 3.1 5.5 1.8 virginica139 6.0 3.0 4.8 1.8 virginica140 6.9 3.1 5.4 2.1 virginica141 6.7 3.1 5.6 2.4 virginica142 6.9 3.1 5.1 2.3 virginica143 5.8 2.7 5.1 1.9 virginica144 6.8 3.2 5.9 2.3 virginica145 6.7 3.3 5.7 2.5 virginica146 6.7 3.0 5.2 2.3 virginica147 6.3 2.5 5.0 1.9 virginica148 6.5 3.0 5.2 2.0 virginica149 6.2 3.4 5.4 2.3 virginica150 5.9 3.0 5.1 1.8 virginica``

### Coverting Data Frames to Tibbles

To convert any `R` data frame into a tibble, we can use the `as_tibble()` function from the `tibble` package. In the code below, we create a tibble named `iris_tbl`.

A nice property of tibbles is that they only print the first 10 rows of data and label each column with its respective data type. In the output below, “dbl” stands for numeric.

``iris_tbl <- as_tibble(iris)iris_tbl``

When we pass `iris_tbl` to the `str()` function, we see that it lets us know that we have a tibble.

``str(iris_tbl)``
``tibble [150 × 5] (S3: tbl_df/tbl/data.frame) \$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... \$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... \$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... \$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... \$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...``

### Converting Tibbles to Date Frames

In general, tibbles are much easier to work with than data frames. However, not all `R` functions are able to work with them. If you ever encounter this situation, it is easy to convert a tibble back to a data frame with the `as.data.frame()` function.

The code below converts out `iris_tbl` back to a data frame.

``iris_df <- as.data.frame(iris_tbl)str(iris_df)``
``'data.frame': 150 obs. of 5 variables: \$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... \$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... \$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... \$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... \$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...``

### Creating Tibbles with `tibble()`

We can create tibbles from individual vectors using the `tibble()` function. This is similar to how data frames are created with the `data.frame()` function.

One major difference is that `tibble()` allows you to reference variables within the function call. You can even use `R` functions to create new columns. See the example below that uses `tibble()` to create a simple dataset.

``my_tbl <- tibble(column_1 = c(1, 3, 7, 2.5, 22), column_2 = c('A', 'B', 'C', 'D', 'E'), column_3 = (column_1 * 2) + 10, column_4 = column_1 + mean(column_1))my_tbl``

Before we are able to perform data analysis, we must import data into our `R` environment.

The `tidyverse` package loads the `readr` package which contains a number of functions for importing data into `R`.

The `read_delim()` function is used to import flat files such as comma-delimited (.csv) or tab-delimited (.txt) files.

The `read_delim()` functions takes many arguments, but the 3 most important are:

• `file` - the first argument is the path to a file on your computer or website address of the data file
• `delim` - the type of delimiter in the data file (either “,” for comma, “\t” for tab, or any other character)
• `col_names` - TRUE or FALSE to indicate whether a file has column names

To see how this function works, let’s import the Wine Dataset from the UCI Machine Learning Repository.

If there are no column names in a dataset, `read_delim()` will auto-generate names that begin with an X and cycle through a sequence of integers.

The `read_delim()` function will also print a message to the `R` console about the data types it has assigned to each column.

``wine_data <- read_delim('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', delim = ',', col_names = FALSE)``
``Parsed with column specification:cols( X1 = col_double(), X2 = col_double(), X3 = col_double(), X4 = col_double(), X5 = col_double(), X6 = col_double(), X7 = col_double(), X8 = col_double(), X9 = col_double(), X10 = col_double(), X11 = col_double(), X12 = col_double(), X13 = col_double(), X14 = col_double())``
``wine_data``

In this course, we will be loading tibbles from our course website with the `read_rds()` function (as demonstrated below).

However, I recommend that you refer to the readr documentation to get more familiar with reading different types of data into your `R` environment.

## Employee Attrition Data

The code below will import a data set from our course website. The data consists of 1,470 employee records for a U.S. based product company. The rows in this data frame represent the attributes of an employee at this company across the variables listed in the table below.

VariableDefinition
left_companyDid the employee leave the company? (Yes/No)
departmentDepartment within the company
job_levelJob Level (Associate - Vice President)
salaryEmployee yearly salary (US Dollars)
weekly_hoursSelf-reported average weekly hours spent on the job (company survey)
yrs_at_companyTenure at the company (years)
yrs_since_promotionYears since last promotion
previous_companiesNumber of previous companies for which the employee has worked
job_satisfactionSelf-reported job satisfaction (company survey)
performance_ratingMost recent annual performance rating
marital_statusMarital status (Single, Married, or Divorced)
miles_from_homeDistance from employee address to office location

This data is a special type of data frame known as a `tibble`. All data frames in the `tidyverse` are usually stored in this format. It has special properties which include better printing features and labels for column data types.

``employee_data <- read_rds(url('https://gmudatamining.com/data/employee_data.rds'))# View dataemployee_data``

## Exploring Data Frames with `skimr`

The first step in a data analysis project is to explore your data source. This includes summarizing the values within each column, checking for missing data, checking the data types of each column, and verifying the number of rows and columns.

The `skim()` function can be used to accomplish all of this. It takes your data frame as an argument. In the output below, we first get the number of rows and columns along with the data types present in our data.

The results are then grouped by the type of variables in our data.

First we get a summary of our factor variables, including the number of missing observations, whether our factor levels are ordered, the count of unique levels, and an abbreviated list of the most frequent factor levels.

Then we get a summary of our numeric variables which include the number of missing observations, the mean and standard deviation, a five number summary, and a plot of the distribution of values.

``# View data frame properties and summary statisticsskim(employee_data)``
Data summary
Nameemployee_data
Number of rows1470
Number of columns13
_______________________
Column type frequency:
factor7
numeric6
________________________
Group variablesNone

Variable type: factor

skim_variablen_missingcomplete_rateorderedn_uniquetop_counts
left_company01FALSE2No: 1233, Yes: 237
department01FALSE6IT : 399, Res: 293, Sal: 252, Mar: 238
job_level01FALSE5Sen: 476, Man: 344, Dir: 331, Ass: 185
business_travel01FALSE3Rar: 1043, Fre: 277, Non: 150
job_satisfaction01FALSE4Ver: 459, Hig: 442, Low: 289, Med: 280
performance_rating01FALSE5Mee: 546, Exc: 472, Exc: 286, Min: 136
marital_status01FALSE3Mar: 673, Sin: 470, Div: 327

Variable type: numeric

skim_variablen_missingcomplete_ratemeansdp0p25p50p75p100hist
salary0194076.2537590.2429848.5670379.4888555.53117099.9212134.7▃▇▃▁▁
weekly_hours0150.024.8240.0047.0049.0052.066.0▂▇▃▂▁
yrs_at_company017.016.130.003.005.009.040.0▇▂▁▁▁
yrs_since_promotion012.193.220.000.001.003.015.0▇▁▁▁▁
previous_companies013.241.581.002.003.004.07.0▇▇▂▂▃
miles_from_home019.198.111.002.007.0014.029.0▇▅▂▂▂

It is also possible to select a subset of variables to explore. Just pass a sequence of unquoted variable names into the `skim()` function.

The `skimr` package has many more features for exploring data. Once we cover the fundamentals of `dplyr` in the next sections, I encourage interested students to explore the skimr documentation

``# View data frame properties and summary statisticsskim(employee_data, left_company, department, salary, weekly_hours)``
Data summary
Nameemployee_data
Number of rows1470
Number of columns13
_______________________
Column type frequency:
factor2
numeric2
________________________
Group variablesNone

Variable type: factor

skim_variablen_missingcomplete_rateorderedn_uniquetop_counts
left_company01FALSE2No: 1233, Yes: 237
department01FALSE6IT : 399, Res: 293, Sal: 252, Mar: 238

Variable type: numeric

skim_variablen_missingcomplete_ratemeansdp0p25p50p75p100hist
salary0194076.2537590.2429848.5670379.4888555.53117099.9212134.7▃▇▃▁▁
weekly_hours0150.024.8240.0047.0049.0052.066.0▂▇▃▂▁

## Data Manipulation with `dplyr`

In this section we will cover data manipulation with the `dplyr` package. This is one of the core `tidyverse` packages used for exploring data frames.

Chapter 5 of R for Data Science covers the basics of manipulating data frames in `R`. In this tutorial, I would like to provide additional examples of the main functions of `dplyr`, including `filter()`, `select()`, `arrange()`, `summarise()`, and `mutate()`.

The first argument to all of these functions is a data frame, followed by additional arguments that perform various manipulations on the data. The output from all of these functions will also be a special type of data frame known as a `tibble`.

### `filter()`

The `filter()` function is used for subsetting rows of a data frame. It is much more intuitive than subsetting with the base `R` functions `[ ]` and `[[ ]]`.

The first argument to `filter()` is a data frame, followed by one or more logical conditions on the variables within the data frame. Logical conditions separated by a comma are treated as an AND (&) operation. The advantage of `dplyr`, is that you can pass variable names of a data frame in raw, unquoted format to many functions. The `filter()` function returns a data frame that has been subsetted by the logical conditions within its arguments.

``# employees that left the companyfilter(employee_data, left_company == 'Yes') ``
``# View employees that left from the Sales departmentfilter(employee_data, left_company == 'Yes', department == 'Sales') ``

To filter a data frame using an OR condition, we must use the `|` operator.

``# employees from Sales or Marketing departmentfilter(employee_data, department == 'Sales' | department == 'Marketing')``

Another way to execute OR statements is by using the `%in%` function. This function is used to check whether a column’s variable values match at least one element within a vector. In many cases, it can save lots of typing. The code below will produce the same result as the previous command

``# employees from Sales or Marketing departmentfilter(employee_data, department %in% c('Sales', 'Marketing'))``

What if we are interested in employees from Sales or Marketing that make over \$80,000? We can just add another condition to the previous code. Remember that conditions separated by a comma represent an AND operation. So in the code below, we are passing the following condition: employees with `salary` > 80000 AND (`department` is Sales OR `department` is Marketing)

``# employees from Sales or Marketing departmentfilter(employee_data, salary > 80000, department %in% c('Sales', 'Marketing'))``

### `select()`

The `select()` function allows you to select a subset of columns from a data frame. There are multiple ways to enter the selection condition and many helper functions, such as `starts_with()`, `ends_with()`, and `contains()`. See the documentation for more examples.

We can select columns by used unquoted column names.

``# Select the first three columnsselect(employee_data, left_company, department, job_level)``

We can also select columns by using their numeric positions

``# Select the first three columns with a numeric vectorselect(employee_data, c(1, 2, 3))``

We can also pass a sequence of numeric positions separated by a comma.

``# Select the first three columns with raw numbersselect(employee_data, 1, 2, 3)``

Adding a `-` in front of numeric positions or variable names excludes those variables and returns all others

``# Select all columns except department and job_levelselect(employee_data, -department, -job_level)``
``# Exclude the first 5 columnsselect(employee_data, -1, -2, -3, -4, -5)``
``# Many times it's easier to use a sequenceselect(employee_data, -(1:5))``

There are many helper functions available with the `select()` function. Below are two examples. The first selects columns that contain the word “job”. The second selects columns that start with the letter “y”

``# Select all variables that contain the word jobselect(employee_data, contains('job'))``
``# Select all variables that start with "y" (or "Y")select(employee_data, starts_with("y"))``

### `arrange()`

The `arrange()` function allows you to sort a data frame by specific columns. The default sorting order is ascending.

``# Sort by left_company and salaryarrange(employee_data, left_company, salary)``

To sort a column in descending order, wrap it in the `desc()` function.

``# Sort by left company (descending) and salary (ascending)arrange(employee_data, desc(left_company), salary)``

### `summarise()`

The `summarise()` function allows you to create numerical summaries, such as averages, from the columns of a data frame. The `summarise()` function is an aggregation function, meaning that it takes input with multiple values and returns a single value. With a sample average, for example, we take a numeric vector with n numbers and summarize it with a single value, the average.

The `summarise()` function takes a data frame as the first argument followed by a sequence of functions to execute on the columns of the data frame.

Remember that `dplyr` functions will always return a data frame of results, even if the results are a single number or row.

``# Let's get the average salarysummarise(employee_data, average_salary = mean(salary))``
``# We can also create multiple summaries at oncesummarise(employee_data, salary_min = min(salary), salary_25th = quantile(salary, 0.25), salary_50th = median(salary), salary_75th = quantile(salary, 0.75), salary_max = max(salary))``

### `mutate()`

The `mutate()` function allows you to create new variables in your data. The `mutate()` function is a window function. Window functions take input with n values and always return a result with n values. Like `summarise()`, `mutate()` takes a data frame as the first argument followed by a sequence of functions to execute on the columns of the data frame.

Let’s create a new variable `salary_scaled` that transforms the `salary` values by subtracting the mean and dividing by the standard deviation. In the code below, I save this new data frame as `employee_data_scaled`.

``# Create salary_scaled variableemployee_data_scaled <- mutate(employee_data, salary_scaled = (salary - mean(salary))/sd(salary))``
``# Let's see the resultselect(employee_data_scaled, salary, salary_scaled)``

## Combining Steps With The `%>%` Operator

One of the most useful tools in the `tidyverse` is the pipe operator, `%>%` . This operator allows users to combine multiple data manipulation tasks in one line of code.

The pipe operator tells `R` to take the object on the left and pass it to the right as the first argument to the next function. A keyboard shortcut for the pipe operator is `Ctrl` + `Shift` + `m`.

Let’s go through a simple example to demonstrate the `%>%` operator. Suppose I have a vector of numeric values and I need to do the following:

• Calculate a vector with the cumulative sums using the `cumsum()` function
• Calculate the median of this new vector with the `median()` function

The code below shows how this is done without the pipe operator. Most people new to programming find the code a bit difficult to read because the operations go from right to left.

`cumsum(numeric_data)` is evaluated first, then passed to the `median()` function on the left.

If we had an operation that required 3 or more steps, writing code in this way would quickly get confusing to understand and explain.

``# Data valuesnumeric_data <- c(2, 5, 1, 7, 9, 10)# Cumulative sumcumsum(numeric_data)``
``[1] 2 7 8 15 24 34``
``# Median of the cumulative summedian(cumsum(numeric_data))``
``[1] 11.5``

The `%>%` operator can be used to write cleaner code that moves from right to left in a `do this - then that` fashion. In the code below, we first create the cumulative sum vector with `cumsum()` and then pass it to the first argument of the `median()` function. Most people find this style of coding much easier to understand.

``cumsum(numeric_data) %>% median()``
``[1] 11.5``

### Using `%>%` with `dplyr`

The pipe operator makes it easy to create data analysis workflows with `dplyr` functions. When splitting steps across multiple lines, make sure that each line ends with %>% to prevent errors.

Let’s use `%>%` to create the following data analysis:

• Start with our data, `employee_data`
• Filter the data to select employees who left the company
• Calculate their average salary

The code below demonstrates how to achieve this with one expression using the `%>%` operator.

Notice that we do not have to pass `employee_data` into the first argument of `filter()` since we have piped it in with the first `%>%`. The `filter()` function then produces a new data frame that only contains rows that have `left_company` values of “Yes”. This is then passed to the first argument of `summarise()` by the second `%>%`.

``employee_data %>% filter(left_company == 'Yes') %>% summarise(average_salary = mean(salary))``

For comparison, the code below executes the same workflow without the `%>%` operator. In this case, we must save our filtered data frame and pass it into `summarise()`.

``employees_left <- filter(employee_data, left_company == 'Yes')summarise(employees_left, average_salary = mean(salary))``

Let’s use `%>%` to create the following data analysis workflow:

• Start with our data, `employee_data`
• Create a new column, `weekly_hrs_scaled`, that performs a z-score transformation on the `weekly_hours` variable
• Select rows that have `weekly_hrs_scaled` values greater than 1.5
``employee_data %>% mutate(weekly_hrs_scaled = (weekly_hours - mean(weekly_hours)) / sd(weekly_hours)) %>% filter(weekly_hrs_scaled > 1.5)``

Let’s use `%>%` to create the following data analysis workflow:

• Start with our data, `employee_data`
• Select rows with employees from either Finance and Operations or Marketing and a job level of Director
• Calculate their average salary
``employee_data %>% filter(department %in% c('Finance and Operations', 'Marketing'), job_level == 'Director') %>% summarise(average_salary = mean(salary))``

Use `dplyr` and the `%>%` operator to execute the following data analysis workflow:

• Start with our data, `employee_data`
• Select rows with employees from the Marketing department
• Create a new column, `salary_scaled`, that performs a z-score transformation on the `salary` variable
• Select rows where `salary_scaled` is greater than or equal to 0.5
• Select the following columns: `left_company`, `department`, `job_level`, `salary_scaled`

You should get a tibble with 65 rows as the result.

Top Articles
Latest Posts
Article information

Author: Manual Maggio

Last Updated: 22/02/2023

Views: 5749

Rating: 4.9 / 5 (49 voted)

Author information

Name: Manual Maggio

Birthday: 1998-01-20

Address: 359 Kelvin Stream, Lake Eldonview, MT 33517-1242

Phone: +577037762465

Job: Product Hospitality Supervisor

Hobby: Gardening, Web surfing, Video gaming, Amateur radio, Flag Football, Reading, Table tennis

Introduction: My name is Manual Maggio, I am a thankful, tender, adventurous, delightful, fantastic, proud, graceful person who loves writing and wants to share my knowledge and understanding with you.