##Getting and Cleaning Data ###Course Project
The purpose of this project is to demonstrate your ability to collect, work with, and clean a data set. The goal is to prepare tidy data that can be used for later analysis. You will be graded by your peers on a series of yes/no questions related to the project. You will be required to submit: 1) a tidy data set as described below, 2) a link to a Github repository with your script for performing the analysis, and 3) a code book that describes the variables, the data, and any transformations or work that you performed to clean up the data called CodeBook.md. You should also include a README.md in the repo with your scripts. This repo explains how all of the scripts work and how they are connected.
###Raw Data collection
- Get the data
- Download the Files fileURL = "https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip" download.file(fileURL, destfile = "./Data/Dataset.zip", method = "curl")
- Unzip the Files - Decompressing the data , the files are in the folder named UCI HAR Dataset unzip(zipfile="./Data/Dataset.zip",exdir="./Project")
- Get the list of the files in UCI HAR Dataset folder
- fSource = file.path("./", "Data", "UCI HAR Dataset")
- files = list.files(fSource, recursive=TRUE)
- files
- Read data from files and assign to data frames variables
- Read the Test and Train Activity files
- dataActivityTest = read.table(file.path(fSource, "test" , "Y_test.txt" ), header = FALSE)
- dataActivityTrain = read.table(file.path(fSource, "train", "Y_train.txt"), header = FALSE)
- Read the Subject Test and Train files
- dataSubjectTrain = read.table(file.path(fSource, "train", "subject_train.txt"), header = FALSE)
- dataSubjectTest = read.table(file.path(fSource, "test" , "subject_test.txt"), header = FALSE)
- Read Fearures Test and Train files
- dataFeaturesTest = read.table(file.path(fSource, "test" , "X_test.txt" ),header = FALSE)
- dataFeaturesTrain = read.table(file.path(fSource, "train", "X_train.txt"),header = FALSE)
- Read the Test and Train Activity files
###Raw Data transformation
#####The R script run_analysis.R does the following.
- Merges the training and the test sets to create one data set
- Concatenate the data tables by rows
- dataSubject = rbind(dataSubjectTrain, dataSubjectTest)
- dataActivity= rbind(dataActivityTrain, dataActivityTest)
- dataFeatures= rbind(dataFeaturesTrain, dataFeaturesTest)
- Set names to variables
- names(dataSubject)=c("Subject")
- names(dataActivity)= c("Activity")
- dataFeaturesNames = read.table(file.path(fSource, "features.txt"), head=FALSE)
- names(dataFeaturesNames)=c("Key","Descripcion")
- names(dataFeatures)= dataFeaturesNames$Descripcion
- head(dataFeatures)
- activityLabels = read.table(file.path(fSource, "activity_labels.txt"), head=FALSE)
- names(activityLabels)=c("Activity","Descripcion")
- Merge Columns
- Data = cbind(dataFeatures,dataSubject,dataActivity)#563 Variables
- Concatenate the data tables by rows
- Extracts only the measurements on the mean and standard deviation for each measurement
- load dplyr package
- suppressMessages(library(dplyr))
- ColSelected = dataFeaturesNames %>% select(Descripcion) %>% filter(grepl('Mean|Std', Descripcion,ignore.case=TRUE));
- ColSelected = c(as.character(ColSelected$Descripcion), "Subject", "Activity")
- Data=Data[,ColSelected]
- load dplyr package
- Uses descriptive activity names to name the activities in the data set
- Using Inner Join to Merge de Data
- Data = inner_join(Data, activityLabels,by="Activity")
- Data$Activity = Data$Descripcion
- Data$Descripcion = NULL
</li> </ul> - Appropriately labels the data set with descriptive variable names
- Pattern Matching and Replacement
- names(Data)=gsub("Acc", "Accelerometer", names(Data))
- names(Data)=gsub("Gyro", "Gyroscope", names(Data))
- names(Data)=gsub("BodyBody", "Body", names(Data))
- names(Data)=gsub("Mag", "Magnitude", names(Data))
- names(Data)=gsub("^t", "Time", names(Data))
- names(Data)=gsub("^f", "Frequency", names(Data))
- names(Data)=gsub("tBody", "TimeBody", names(Data))
- names(Data)=gsub("-mean()", "Mean", names(Data), ignore.case = TRUE)
- names(Data)=gsub("-std()", "STD", names(Data), ignore.case = TRUE)
- names(Data)=gsub("-freq()", "Frequency", names(Data), ignore.case = TRUE)
- names(Data)=gsub("angle", "Angle", names(Data))
- names(Data)=gsub("gravity", "Gravity", names(Data))
- tbl_df(Data)
- Pattern Matching and Replacement
- Creates a second,independent tidy data set and output it
In this part a second independent tidy data set will be created with the average of each variable
for each activity and each subject based on the data set in step 4.
- Tidy data set
- DataT = Data %>% group_by(Subject ,Activity) %>% summarise_each(funs(mean(.))) %>% arrange(Activity,Subject);
- write.table(DataT, file = "tidydata.txt",row.name=FALSE)
- tbl_df(DataT)
- Tidy data set
- Using Inner Join to Merge de Data