My dear reader, how are you? السلام عليكم
Design is not just what it looks like and feels like. Design is how it works – Steve jobs
This post explains the Random Forest (RF) modelling technique in the R programming environment.
What is a random forest Model?
Random forests also known as the random forest model is a method for classification and regression-based tasks. It operates from decision trees and outputs classification of the individual trees. A typical problem in a decision tree-based techniques is that it overfits the training set. The overfitting can be considered as an error in modelling that occurs when the trained model closely fits a limited set of data-points. Random forest is a useful technique to correct the habit of decision trees to overfit the input dataset.
The random forest a form of ensemble approach, that is, a divide-and-conquer approach; and can also be considered as an advanced form of the nearest neighbor predictor. The principle that drives the ensemble methods is that a group of weak learners can come together to form a strong learner. Individually, each classifier is a weak learner, while taken together they are a strong learner.
The random forest starts with a standard machine learning method called a decision tree. In ensemble terms, it corresponds to our weak learner. Here, the input is entered at the top and, as it traverses down the tree, the data gets bucketed into increasingly smaller sets. When a new input is entered into the system, it is run down all of the trees. The result may either be an average—or weighted average—of all of the terminal nodes that are reached. To probe further, watch the following video DirectMe.
Advantage
- Random forest predictors lead to the dissimilarity between observations. A random forest can also be attractive because it handles missed variables well. It also reacts well to outside observations. In other words, an RF is able to deal with unbalanced data, missing data, and other data-based inconsistencies and unknowns.
- It has the capability to tackle the non-linearity in data very well.
- To a great extent, it prevents the over-fitting problem that is very common in a typical tree-based algorithmic model.
Disadvantage
- Random forest models cannot predict beyond the training data rance, specifically when used for regression.
- Additionally, they may over-fit data sets that are particularly “noisy.” The best thing is to test the algorithm against your own data sets and decide the best option for you (this is insane but true).
- It’s more complex and computationally expensive than decision tree algorithm. Fact which makes the algorithm slow and ineffective for real-time predictions as a more accurate prediction requires more trees.
How a RANDOM Forest Regression works?
A typical RF model follows the following three steps:
Step 1: Samples are taken repeatedly from the training data so that each data point is having an equal probability of getting selected, and all the samples have the same size as the original training set.
Let’s say we have the following data:
- x= {0.1,0.5,0.4,0.8,0.6}
- y= {0.1,0.2,0.15,0.11,0.13}
where, x is an independent variable with 5 data points and y is dependent variable.
Now Bootstrap samples are taken with replacement from the above data set. n_estimators is set to 3 (no of tree in random forest), then:
The tree will have a bootstrap sample of size 5 (same as the original dataset), and assuming it to be the following:
- x1={0.5,0.1,0.1,0.6,0.6}
- x2={0.4,0.8,0.6,0.8,0.1}
- x3={0.1,0.5,0.4,0.8,0.8}
Step 2: A Random Forest Regressor model is trained at each bootstrap sample drawn in the above step, and a prediction is recorded for each sample.
Step 3: Now the ensemble prediction is calculated by averaging the predictions of the above trees producing the final prediction.
How to build a RANDOM Forest Regression Model in R?
Building a random forest model in the R programming environment in simple. First of all install RStudio. Then follow the steps shown below:
# Load the Random forest package in R require(randomForest) # Load the training set training = read.csv("../Desktop/training.csv") # Check if correct data has been imported using the command below head(training) # A sample output is given below
energy |
y1 |
y2 |
y3 |
y4 |
y5 |
y6 |
y7 |
y8 |
y9 |
5404.69 |
31038832 |
50798074 |
3.85E+08 |
424 |
8850802 |
20068 |
45250716 |
1.35E+10 |
265268 |
433.99 |
1647616 |
8710487 |
29983242 |
1178 |
257396 |
16348 |
1714747 |
8.02E+08 |
69291 |
16534.3 |
79108301 |
54540286 |
8.82E+08 |
1100 |
18507472 |
29909 |
1.07E+08 |
4.22E+10 |
685752 |
2778.26 |
6693094 |
46422085 |
1.67E+08 |
400 |
895036 |
18038 |
7619163 |
3.13E+09 |
273757 |
# In order to get more insights about the data check the summary summary(training) # Load the Testing set testing = read.csv("../Desktop/testing.csv") # Check if correct data has been imported using the command below head(testing) # Again, in order to get more insights about the data check the summary summary(testing) # Train the model using default parameters using the command below # Other parameters that can be given to the model are: ntree (number of trees), mtry (number of features tried in each tree), importance = TRUE output <- randomForest(energy ~ y1 + y2 + y3 + y4 + y5 + y6 + y7 + y8 + y9, data=training) # let us now predict the testing set once the model has been trained Prediction <- predict(output,testing) # Finally you can save the output results in a csv file using the following command write.csv(Prediction, file = "firstforest.csv", row.names = FALSE)
I hope you find this tutorial useful. If you find any errors or feel any need for improvement, let me know in your comments below.
Signing off for today. Stay tuned and I will see you next week! Happy learning.