Skip to content

Crozal/GettingAndCleaningData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Getting and Cleaning Data

##Getting and Cleaning Data ###Course Project

The purpose of this project is to demonstrate your ability to collect, work with, and clean a data set. The goal is to prepare tidy data that can be used for later analysis. You will be graded by your peers on a series of yes/no questions related to the project. You will be required to submit: 1) a tidy data set as described below, 2) a link to a Github repository with your script for performing the analysis, and 3) a code book that describes the variables, the data, and any transformations or work that you performed to clean up the data called CodeBook.md. You should also include a README.md in the repo with your scripts. This repo explains how all of the scripts work and how they are connected.

###Raw Data collection

  1. Get the data
    • Download the Files
    • fileURL = "https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip" download.file(fileURL, destfile = "./Data/Dataset.zip", method = "curl")
    • Unzip the Files - Decompressing the data , the files are in the folder named UCI HAR Dataset
    • unzip(zipfile="./Data/Dataset.zip",exdir="./Project")
    • Get the list of the files in UCI HAR Dataset folder
      • fSource = file.path("./", "Data", "UCI HAR Dataset")
      • files = list.files(fSource, recursive=TRUE)
      • files
  2. Read data from files and assign to data frames variables
    • Read the Test and Train Activity files
      • dataActivityTest = read.table(file.path(fSource, "test" , "Y_test.txt" ), header = FALSE)
      • dataActivityTrain = read.table(file.path(fSource, "train", "Y_train.txt"), header = FALSE)
    • Read the Subject Test and Train files
      • dataSubjectTrain = read.table(file.path(fSource, "train", "subject_train.txt"), header = FALSE)
      • dataSubjectTest = read.table(file.path(fSource, "test" , "subject_test.txt"), header = FALSE)
    • Read Fearures Test and Train files
      • dataFeaturesTest = read.table(file.path(fSource, "test" , "X_test.txt" ),header = FALSE)
      • dataFeaturesTrain = read.table(file.path(fSource, "train", "X_train.txt"),header = FALSE)

###Raw Data transformation

#####The R script run_analysis.R does the following.

  1. Merges the training and the test sets to create one data set
    • Concatenate the data tables by rows
      • dataSubject = rbind(dataSubjectTrain, dataSubjectTest)
      • dataActivity= rbind(dataActivityTrain, dataActivityTest)
      • dataFeatures= rbind(dataFeaturesTrain, dataFeaturesTest)
    • Set names to variables
      • names(dataSubject)=c("Subject")
      • names(dataActivity)= c("Activity")
      • dataFeaturesNames = read.table(file.path(fSource, "features.txt"), head=FALSE)
      • names(dataFeaturesNames)=c("Key","Descripcion")
      • names(dataFeatures)= dataFeaturesNames$Descripcion
      • head(dataFeatures)
      • activityLabels = read.table(file.path(fSource, "activity_labels.txt"), head=FALSE)
      • names(activityLabels)=c("Activity","Descripcion")
    • Merge Columns
      • Data = cbind(dataFeatures,dataSubject,dataActivity)#563 Variables
  2. Extracts only the measurements on the mean and standard deviation for each measurement
    • load dplyr package
      • suppressMessages(library(dplyr))
      • ColSelected = dataFeaturesNames %>% select(Descripcion) %>% filter(grepl('Mean|Std', Descripcion,ignore.case=TRUE));
      • ColSelected = c(as.character(ColSelected$Descripcion), "Subject", "Activity")
      • Data=Data[,ColSelected]
  3. Uses descriptive activity names to name the activities in the data set
    • Using Inner Join to Merge de Data
      • Data = inner_join(Data, activityLabels,by="Activity")
      • Data$Activity = Data$Descripcion
      • Data$Descripcion = NULL
                  </li>
          </ul>
      
    • Appropriately labels the data set with descriptive variable names
      • Pattern Matching and Replacement
        • names(Data)=gsub("Acc", "Accelerometer", names(Data))
        • names(Data)=gsub("Gyro", "Gyroscope", names(Data))
        • names(Data)=gsub("BodyBody", "Body", names(Data))
        • names(Data)=gsub("Mag", "Magnitude", names(Data))
        • names(Data)=gsub("^t", "Time", names(Data))
        • names(Data)=gsub("^f", "Frequency", names(Data))
        • names(Data)=gsub("tBody", "TimeBody", names(Data))
        • names(Data)=gsub("-mean()", "Mean", names(Data), ignore.case = TRUE)
        • names(Data)=gsub("-std()", "STD", names(Data), ignore.case = TRUE)
        • names(Data)=gsub("-freq()", "Frequency", names(Data), ignore.case = TRUE)
        • names(Data)=gsub("angle", "Angle", names(Data))
        • names(Data)=gsub("gravity", "Gravity", names(Data))
        • tbl_df(Data)
    • Creates a second,independent tidy data set and output it In this part a second independent tidy data set will be created with the average of each variable for each activity and each subject based on the data set in step 4.
      • Tidy data set
        • DataT = Data %>% group_by(Subject ,Activity) %>% summarise_each(funs(mean(.))) %>% arrange(Activity,Subject);
        • write.table(DataT, file = "tidydata.txt",row.name=FALSE)
        • tbl_df(DataT)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages