Executive Summary

Using inertial measurement devices such as those found in Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity relatively inexpensively. One thing that people regularly do is quantify how much of a particular activity they do, but rarely do they quantify how well they perform it. The goal of this project is to use machine learning techniques to classify data obtained from inertial measurement devices on the belt, forearm, arm, and dumbell of 6 participants as they perform barbell lifts correctly and incorrectly in 5 different ways. This report will demonstrate that > 99% classification accuracy can be achieved using Random Forest Classification. The classification methods described here can provide users with feedback on how well they are performing an activity.

Data Cleanup

The training and testing data sets provided both contain a mix of raw data and summary data obtained from analysis of the raw data (e.g., variables with avg, var, stddev labels). The summary data contains values only once per window (windows were defined by the authors of the study: Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013., from which the data was obtained) and other than sparse entries, contain blank or NA values. From the paper, it seems that the authors used the summary data for their classification, but it is somewhat difficult to decipher how the summary data was used. Because the testing data set contains entirely blank or NA values for each of the summary variables, these variables cannot be used for classification of the test cases, so the training and testing datasets were subsetted to remove these variables entirely from this analysis. Additionally, variables that describe the ID of an observation are removed from the data, since these are not variables to be included in a model (i.e., information about user names, time, and window). Thus, the machine learning methods described here only use raw inertial data for classification and the number of variables included in the data sets were reduced from 160 to 53.

training <- read.table("~/Desktop/programming/DataSci/RMachineLearning/finprojtrain.csv", sep = ",", header=TRUE)
testing <- read.table("~/Desktop/programming/DataSci/RMachineLearning/finprojtest.csv", sep = ",", header=TRUE)

drops <- vector()
for (i in names(training)) {
  x <- paste(c("training",eval(i)), collapse="$")
  if (anyNA(eval(parse(text=x))) | any(eval(parse(text=x))=="")) {
    drops <- c(drops, i)
  }
}
subtraining <- training[, !(names(training) %in% drops)]
subtraining <- subtraining[,-(1:7)]

subtesting <- testing[, !(names(testing) %in% drops)]
subtesting <- subtesting[,-(1:7)]

names(subtraining)
##  [1] "roll_belt"            "pitch_belt"           "yaw_belt"            
##  [4] "total_accel_belt"     "gyros_belt_x"         "gyros_belt_y"        
##  [7] "gyros_belt_z"         "accel_belt_x"         "accel_belt_y"        
## [10] "accel_belt_z"         "magnet_belt_x"        "magnet_belt_y"       
## [13] "magnet_belt_z"        "roll_arm"             "pitch_arm"           
## [16] "yaw_arm"              "total_accel_arm"      "gyros_arm_x"         
## [19] "gyros_arm_y"          "gyros_arm_z"          "accel_arm_x"         
## [22] "accel_arm_y"          "accel_arm_z"          "magnet_arm_x"        
## [25] "magnet_arm_y"         "magnet_arm_z"         "roll_dumbbell"       
## [28] "pitch_dumbbell"       "yaw_dumbbell"         "total_accel_dumbbell"
## [31] "gyros_dumbbell_x"     "gyros_dumbbell_y"     "gyros_dumbbell_z"    
## [34] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
## [37] "magnet_dumbbell_x"    "magnet_dumbbell_y"    "magnet_dumbbell_z"   
## [40] "roll_forearm"         "pitch_forearm"        "yaw_forearm"         
## [43] "total_accel_forearm"  "gyros_forearm_x"      "gyros_forearm_y"     
## [46] "gyros_forearm_z"      "accel_forearm_x"      "accel_forearm_y"     
## [49] "accel_forearm_z"      "magnet_forearm_x"     "magnet_forearm_y"    
## [52] "magnet_forearm_z"     "classe"

Model Building with Random Forests

Now that the data has been cleaned, the next step is to partition the training data into a training set and a validation set. A 70% to 30% split is used for the training and validation sets. The training set is used to train a classifier using Random Forests. So that the model building is completed in a reasonable amount of time yet still achieves accurate results, the number of trees is limited to 200 and 5-fold cross-validation is performed for resampling.

library(caret)
set.seed(100)
inTrain <- createDataPartition(y = subtraining$classe, p=0.7, list = FALSE)
trainSet = subtraining[inTrain, ]
validSet = subtraining[-inTrain, ]

tc = trainControl(method = "cv", number = 5, allowParallel =TRUE)
modRF <- train(classe ~., data = trainSet, method="rf", trControl=tc, ntree=200, importance=TRUE)
modRF
## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10990, 10989, 10989, 10990, 10990 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9901727  0.9875670
##   27    0.9896630  0.9869238
##   52    0.9818738  0.9770667
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.
modRF$finalModel
## 
## Call:
##  randomForest(x = x, y = y, ntree = 200, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.9%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3903    2    1    0    0 0.0007680492
## B   17 2630   11    0    0 0.0105342363
## C    0   26 2364    6    0 0.0133555927
## D    0    0   48 2201    3 0.0226465364
## E    1    0    2    6 2516 0.0035643564

Calculating Out of Sample Error and Variable Importance

The model achieves 99.41% accuracy on the validation set, so the out of sample error is 0.59%. The variable importance plot shows that yaw_belt and roll_belt are the most important variables. Interestingly, if only these two variables are used to build a classifier, ~70% accuracy is achieved; if all data from the belt sensor is used for building the classifier, the model achieves ~90% accuracy (not shown).

predRF <- predict(modRF, validSet)
confusionMatrix(validSet$classe, predRF)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    1    0    0    0
##          B    7 1130    2    0    0
##          C    0    4 1020    2    0
##          D    0    0   18  946    0
##          E    0    0    0    1 1081
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9941          
##                  95% CI : (0.9917, 0.9959)
##     No Information Rate : 0.2855          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9925          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9958   0.9956   0.9808   0.9968   1.0000
## Specificity            0.9998   0.9981   0.9988   0.9964   0.9998
## Pos Pred Value         0.9994   0.9921   0.9942   0.9813   0.9991
## Neg Pred Value         0.9983   0.9989   0.9959   0.9994   1.0000
## Prevalence             0.2855   0.1929   0.1767   0.1613   0.1837
## Detection Rate         0.2843   0.1920   0.1733   0.1607   0.1837
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9978   0.9968   0.9898   0.9966   0.9999
varImpPlot(modRF$finalModel, main="Variable Importance")

Test Case Classification

Classifications for the twenty test cases are shown:

predict(modRF, subtesting[, -53])
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Conclusion

High accuracy of classification (> 99%) of the “Qualitative Analysis of Weightlifting”" data set can be achieved using Random Forest Classification.