### ================================================================================= ### DATA WRANGLING ### ================================================================================= # Toni Monleon Getino - Section of Statistics. University of Barcelona #Los ejercicios y ejemplos proporcionados en este apartado se basan en materiales propios y de acceso libre, y su único propósito es el uso docente. #Se ha hecho un esfuerzo para garantizar la precisión y confiabilidad de la información, sin embargo, es importante tener en cuenta que la información puede cambiar con el tiempo. #Se recomienda a los usuarios que consulten fuentes adicionales y verifiquen la información por sí mismos antes de tomar cualquier decisión o acción basada en la información proporcionada aquí. # =========Description: # A recent article from the New York Times said "Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information." # =========Principia: # Data you find “in the wild” will rarely be in a format necessary for analysis, and you will need to manipulate it before exploring the questions you are interested in. This may take more time than doing the analysis itself! I # According to Wickham, data is “tidy” when it meets three key criteria: # =========Each observation is in a row. # =========Each variable is in a column. # =========Each value has its own cell. # Being observent of these criteria allows us to recognize when data is organized or unorganized. It also provides us a standardized schema and set of tools for cleaning up some of the most common ways that datasets are “messy:” # =========Column headers are values, not variable names. # =========Multiple variables are stored in one column. # =========Variables are stored in both rows and columns. # =========Multiple types of observational units are stored in the same table. # =========A single observational unit is stored in multiple tables. # Before an R program can look for answers, your data must be cleaned up and converted to a form that makes information accessible. In this webinar, you will learn how to use the `dplyr` and `tidyr` packages to optimise the data wrangling process. You'll learn to: # Spot the variables and observations within your data # Quickly derive new variables and observations to explore # Reshape your data into the layout that works best for R # Join multiple data sets together # Use group-wise summaries to explore hidden levels of information within your data # TUTORIAL (ensenyar): http://jules32.github.io/2016-07-12-Oxford/dplyr_tidyr/ # SEE A TUTORIAL IN: https://ramnathv.github.io/pycon2014-r/explore/README.html # Install tidyverse libraries and load it (librerias y paquetes para hacer data-wrangling) pkgs <- c('reshape2', 'plyr', 'ggplot2', 'dplyr', 'data.table', 'Lahman') install.packages(pkgs) # Do not worry if this takes a while #otro paquete Importante los "Data wranglers": https://www.tidyverse.org/packages/ if (!require("tidyverse")) install.packages("tidyverse") library(tidyverse) #ejercicios con ggplot #ver en http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html#Waffle%20Chart # =========EJERCICIO DE DATA-WRANGLING... # https://programminghistorian.org/lessons/data_wrangling_and_management_in_R # Data Wrangling and Management in R # Nabeel Siddiqui # This tutorial explores how scholars can organize 'tidy' data, understand R packages to manipulate data, and conduct basic data analysis. ########################################################################### # 1-Import CSV File and save to us_state_populations_import # Make sure you set the path of the file correctly # Lets go through an example to see how dplyr can aid us as historians by inputting U.S. dicennial census data from 1790 to 2010 library(readr) us_state_populations_import<-read_csv("introductory_state_example.csv") # Filter to California and New York states only library(dplyr) #extraer datos: california_and_new_york_state_populations<-us_state_populations_import %>% filter(state %in% c("California", "New York")) #EN LOS OLD TIMES: # using subset function newdata <- subset(us_state_populations_import, state == "California" | state == "New York") View(california_and_new_york_state_populations) # Plot California and New York State Populations library(ggplot2) ggplot(data=california_and_new_york_state_populations, aes(x=year, y=population, color=state)) + geom_line() + geom_point() # continuar el ejercicio en: https://programminghistorian.org/lessons/data_wrangling_and_management_in_R #otros aspectos generales del data wrangling (datos biometricos) BMI <- data.frame( gender = c("Male", "Male","Female"), height = c(NA, 171.5, 165), weight = c(81,93, 78), Age = c(42,38,26)) print(BMI) ls.str() ## Some useful functions to deal with missing values ## ------------------------------------------------- is.na(BMI) which(is.na(BMI), arr.ind = T) #cuantos son missings por filas y columnas # Missings per variable colSums(is.na(BMI)) # Data frame containing only complete observations na.omit(BMI) ## Function fix permits to use the data editor ## ------------------------------------------- fix(BMI) BMI ### ================================================================================= ### MAS DATA WRANGLING CON DATA-FRAMES ### ================================================================================= ## Working with data frames ## ======================== ## Extraction of columns and changing their order ## ---------------------------------------------- load("dfram1.Rda") dfram1[, c(1, 3, 4)] dfram1 <- dfram1[, c(1, 5, 2:4)] dfram1 ## Function attach ## --------------- Height # No such object available attach(dfram1) Height # Ahora s? mean(Height) summary(Height) detach(dfram1) mean(Height) # Again, no such object available # But, be VERY careful when using function attach because # (a) Using attach too often is not a good idea # (b) Objects in the workspace may have the same names as variables Age <- 70:80 attach(dfram1) Age mean(Age) rm(Age) Age mean(Age) detach(dfram1) ## A good alternative: function with ## --------------------------------- with(dfram1, Height) # same a dfram1$Height with(dfram1, sum(Sex == "M" & Age > 22)) with(dfram1, mean(Weight/(Height/100)^2)) ## Adding new variables to a data frame ## ------------------------------------ dfram1$BMI <- with(dfram1, round(Weight/(Height/100)^2, 2)) dfram1$car <- rep(c("Yes", "No"), c(3, 5)) dfram1 # Delete variable BMI dfram1$BMI <- NULL dfram1 # Function transform can also be used dfram1<-transform(dfram1, Height = Height/100, BMI = round(Weight/(Height/100)^2, 2)) dfram1 ## Subsetting data frames ## ---------------------- dfram23 <- subset(dfram1, Age == 23) dfram23 subset(dfram1, Sex == "M" & Age > 22) subset(dfram1, Sex == "F" | Height < 175) rm(dfram23) ## Merging data frames ## ------------------- file.show("Data3.txt") # To continue, the file Data3.txt must be closed dfram3 <- read.table("Data3.txt", header = T) dfram3 # Which variables are in both data frames intersect(names(dfram1), names(dfram3)) # Which individuals are in both data frames? intersect(dfram1$Name, dfram3$Name) # Four different ways to merge two data frames merge(dfram1, dfram3) merge(dfram1, dfram3, all = T) merge(dfram1, dfram3, all.x = T) merge(dfram1, dfram3, all.y = T) (dfram3 <- merge(dfram1, dfram3, all = T)) dfram3[9, "Sex"] <- "F" dfram3[10, "Sex"] <- "M" dfram3$BMI <- with(dfram3, round(Weight/(Height/100)^2, 2)) dfram3 summary(dfram3) ## How to sort a data frame? ## ------------------------- # Using function order dfram3[order(dfram3$Sex, dfram3$Age), ] dfram3[order(dfram3$Sex, -dfram3$Age), ] # Using function orderBy of package doBy # install.packages("doBy") # Only necessary if not yet installed library(doBy) search() dfram3 <- orderBy(~Sex + Age, dfram3) dfram3 ############ laboratory ################################################## # EJERCICIO: CUANTOS MISSINGS? #######################################3 # Install historydata package install.packages("historydata") # Load historydata package library(historydata) data(early_colleges) View(early_colleges) #This packages contains samples of historical datasets–the earlier U.S. Census data sample was taken from this package. Throughout this tutorial, we are specifically going to work with the early_colleges dataset that contains data about colleges founded before 1848. Lets start by loading the data and view it. #colSums(is.na(early_colleges)) #How many missings? Where? ################## ADVANCED DATA WRANGLING WITH TIDYR ########################## #see exercise in https://rpubs.com/drkblake/wrangling and try to explain the code #The data were extracted from the U.S. Census Bureau’s 2021 American Community Survey five-year dataset. If you want a look at the data, see: # https://docs.google.com/spreadsheets/d/1XeAuHIBz6AAZxU6M1Bz6RvJEvRRv7OLrokEKMzE9MHs/edit?usp=sharing #From left to right in the data frame, the variables are: # County: The name of the county. “Anderson,” “Bedford,” “Benton,” etc. In all, there are 95 counties in the data frame. #Region: The Tennessee region in which the county is located. There are three regions: “West,” “Middle,” and “East.” #Med_HH_Income: Each county’s median household income. #Households: The number of households in each county. #Pct_BB: The percentage of households in each county that have broadband internet access. #Pct_College: The percentage of residents in each county who have a four-year college degree or higher (like a master’s degree, a law degree, a Ph.D., or a medical degree). #Land_area: Each county’s land area, in square miles. Land area is area in the county that is not covered by a river, lake, or other body of water. #It is not necessary to run each block of code to learn what it does. All the information your need is contained in the video. However, if you want to run the code, just to be sure, you can do so after running the code in this block, which will read the data from a file on my website, save the file on your computer, install and activate the tidyverse library, and load the data file into memory as a data frame called “mydata.” # Read the data from the web FetchedData <- read.csv("https://drkblake.com/wp-content/uploads/2023/11/DataWrangling.csv") # Save the data on your computer write.csv(FetchedData, "DataWrangling.csv", row.names=FALSE) # remove the data from the environment rm (FetchedData) # Installing required packages if (!require("tidyverse")) install.packages("tidyverse") library(tidyverse) # Read the data mydata <- read.csv("DataWrangling.csv") #Q1: What would the code shown below do? (Answer using the choices for Q1 on the D2L quiz) NewData <- filter(mydata, Region == "Middle") #Q2: What would the code shown below do? (Answer using the choices for Q2 on the D2L quiz) NewData <- filter(mydata, Pct_BB > 75.0) #Q3: What would the code shown below do? (Answer using the choices for Q3 on the D2L quiz) NewData <- filter(mydata, Pct_BB > 75.0, Region == "Middle") #Q4: What would the code shown below do? (Answer using the choices for Q4 on the D2L quiz) NewData <- filter(mydata, Pct_BB > 75.0| Region == "Middle") #Q5: What would the code shown below do? (Answer using the choices for Q5 on the D2L quiz) NewData <- select(mydata,County, Region, Pct_BB) #Q6: What would the code shown below do? (Answer using the choices for Q6 on the D2L quiz) NewData <- select(mydata,Region, County, Pct_BB) #Q7: What would the code shown below do? (Answer using the choices for Q7 on the D2L quiz) NewData <- select(mydata, contains("Pct_")) #Q8: What would the code shown below do? (Answer using the choices for Q8 on the D2L quiz) ?select #Q9: What would the code shown below do? (Answer using the choices for Q9 on the D2L quiz) NewData <- select(mydata, -Region) #Q10: What would the code shown below do? (Answer using the choices for Q10 on the D2L quiz) NewData <- mydata %>% select(-Region) #Q11: What would the code shown below do? (Answer using the choices for Q11 on the D2L quiz) NewData <- mydata %>% arrange(Region) #Q12: What would the code shown below do? (Answer using the choices for Q12 on the D2L quiz) NewData <- mydata %>% arrange(Pct_BB) #Q13: What would the code shown below do? (Answer using the choices for Q13 on the D2L quiz) NewData <- mydata %>% arrange(Region, Pct_BB) #Q14: What would the code shown below do? (Answer using the choices for Q14 on the D2L quiz) NewData <- mydata %>% arrange(desc(Pct_BB)) #Q15: What would the code shown below do? (Answer using the choices for Q15 on the D2L quiz) NewData <- mydata %>% mutate(Density = Households / Land_area) #Q16: What would the code shown below do? (Answer using the choices for Q16 on the D2L quiz) NewData <- mydata %>% mutate(Density = Households / Land_area, High_Access = Pct_BB > 75.0) #Q17: What would the code shown below do? (Answer using the choices for Q17 on the D2L quiz) mydata %>% group_by(Region) %>% summarize(mean(Pct_BB)) #Q18: What would the code shown below do? (Answer using the choices for Q18 on the D2L quiz) mydata %>% group_by(Region) %>% summarize(Avg_Access = mean(Pct_BB)) #Q19: What would the code shown below do? (Answer using the choices for Q19 on the D2L quiz) mydata %>% group_by(Region) %>% summarize(Avg_Access = mean(Pct_BB), SD = sd(Pct_BB)) #Q20: What would the code shown below do? (Answer using the choices for Q20 on the D2L quiz) mydata %>% group_by(Region, High_Access = Pct_BB > 75.0) %>% summarize(Avg_Access = mean(Pct_BB), SD = sd(Pct_BB), count = n())