My dear reader, how are you? السلام عليكم
Everything we hear is an opinion, not a fact. Everything we see is a perspective, not the truth – Marcus Aurelius
This post explains linear regression modelling techniques in Python using real datasets of Performance Monitoring Counters (PMCs) on an Intel Haswell Server. This tutorial explains the prediction using three types of linear regression models.
- A linear model with intercept and coefficients.
- A linear model no intercept.
- A linear model with no intercept and positive coefficients only.
Few useful links to practically follow the project:
- LinearRegression GitHub repository — DirectMe
- All other tutorials on LinearRegression — DirectMe
- Clone LinearRegression GitHub repository using the following command:
git clone https://github.com/ArsalanShahid116/LinearRegression.git
Explaining the dataset
We predict the energy consumption of a Dense Matrix-Multiplication (DGEMM) Application. There are 4 predictor variables in our dataset.
- Floating-point operations (FLOPS). This data is collected using the Likwid (DirectMe) tool.
- RAPL dynamic energy readings. This data is also collected using Likwid tool.
- Problem size of DGEMM application.
- The execution time of DGEMM application.
and the output or response variable is dynamic energy consumption of application. The dynamic energy consumption can be collected using HCLWattsUp API (DirectMe)
The dataset is available on GitHub (DirectMe)
Explaining the CODE
Follow the steps given below to develop and analyze models.
1) First of all, create a Python virtual environment and install necessary packages.
LinearRegression$ virtualenv lrenv LinearRegression$ source lrenv/bin/activate (lrenv)LinearRegression$ pip3 install numpy pandas matplotlib seaborn sklearn # Generate a requirements file (lrenv)LinearRegression$ pip3 freeze > requirements.txt # Create a python file to create linear models. (lrenv)LinearRegression$ source dgemm_models.py
2) Import the necessary packages.
# open dgemm_models.py and add the following program import numpy as np import pandas as pd import matplotlib.pyplot as plt import matplotlib.style as sty import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression, Lasso import csv sty.use('seaborn-bright')
We use a default library for splitting the data in training and testing dataset according to best practices. Usually, it is a 70/30 or 75/25 split in percentage.
LinearRegression package from sciket-learn library can be used to create simple linear models with and without intercepts. However, Lasso and Ridge is a specialized linear regression and can be used to develop models with constraints such as forcing the coefficients to be positive in a linear model.
3) Import the data file as shown below using and make a pandas data frame. You can also rename the data tags or variable names.
data = pd.read_csv('../data/dgemm.csv') print(data.head(5)) newData = ( data .rename(columns={'time': 'Time', 'ps': 'ProblemSize', 'energy': 'WattsUpPro', 'rapl': 'RAPL', 'flops': 'CPU FLOPS' }) ) # check if the dataframe contains the correct data as expected print(newData.head(5)) # you can also check the dimentions of dataset in order to be sure that you loaded complete dataset print(newData.shape)
4) Check the correlation of variables as shown below. We also plot a heatmap to show correlations.
corr = newData.corr() print(corr) # Plot correlation coefficients as a heat map sns.heatmap(corr, vmax=1, linewidths=0.5, cbar_kws={"shrink": .5}) plt.title('Correlation matrix for DGEMM Application') plt.show()
5) Pair plots is also a nice way to understand the relationship between data variables.
sns.pairplot(newData, x_vars=['Rapl', 'CPU FLOPS', 'ProblemSize'], y_vars='WattsUpPro', height=4, kind='reg', aspect=0.7) sns.pairplot(newData, diag_kind='kde', markers='+') plt.show()
6) Prepare the input and response variables.
X = newData[['Time', 'RAPL', 'CPU FLOPS', 'ProblemSize']] y = newData[['WattsUpPro']] # check if correct data loaded into variables print(X.head(5)) print(y.head(5))
7) Split the datasets into training and testing dataset.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) # check if correct data loaded into variables print(X_train.shape) print(X_test.shape) print(X_test.head(6)) print(y_train.shape) print(y_test.shape)
8) Let us now develop our first model. We will build a simple linear model using regression technique without any constraints.
modelA = LinearRegression(fit_intercept=True) modelAfit = modelA.fit(X_train, y_train) print("\n Model A") print("Intercept", modelAfit.intercept_) print("Coefficients", modelAfit.coef_) # reset y_test indexes y_test_new = y_test.reset_index(drop=True) modelApredict = modelA.predict(X_test) modelAprediction = pd.DataFrame(modelApredict)\ .rename(columns={0: 'ModelAPrediction'})\ .assign(errors=lambda x: (((y_test_new.WattsUpPro - x['ModelAPrediction']) / y_test_new.WattsUpPro)*100).abs() ) print("Prediction Mean: %.2f , Max: %.2f , Min: %.2f " % (modelAprediction.errors.mean(), modelAprediction.errors.max(), modelAprediction.errors.min() ) ) # uncomment to print prediction results # print(modelAprediction)
9) The second model will have intercept forced to zero as shown below
modelB = LinearRegression(fit_intercept=False) modelBfit = modelB.fit(X_train, y_train) print("\n Model B") print("Intercept", modelBfit.intercept_) print("Coefficients", modelBfit.coef_) modelBpredict = modelB.predict(X_test) modelBprediction = pd.DataFrame(modelBpredict)\ .rename(columns={0: 'ModelBPrediction'})\ .assign(errors=lambda x: (((y_test_new.WattsUpPro - x['ModelBPrediction']) / y_test_new.WattsUpPro)*100).abs() ) print("Prediction Mean: %.2f , Max: %.2f , Min: %.2f " % (modelBprediction.errors.mean(), modelBprediction.errors.max(), modelBprediction.errors.min() ) ) # uncomment to print prediction results # print(modelAprediction.head(5)) # print(modelBprediction.head(5))
10) Finally, we will build a specialized linear model with constraints such as forcing the intercepts to be positive only and intercepts to zero as shown below.
modelC = Lasso(alpha=0.0001, precompute=True, max_iter=10000, fit_intercept=False, positive=True, random_state=9999, selection='random') modelCfit = modelC.fit(X_train, y_train) print("\n Model C") print("Intercept", modelCfit.intercept_) print("Coefficients", modelCfit.coef_) modelCpredict = modelC.predict(X_test) modelCprediction = pd.DataFrame(modelCpredict)\ .rename(columns={0: 'ModelCPrediction'})\ .assign(errors=lambda x: (((y_test_new.WattsUpPro - x['ModelCPrediction']) / y_test_new.WattsUpPro)*100).abs() ) print("Prediction Mean: %.2f , Max: %.2f , Min: %.2f " % (modelCprediction.errors.mean(), modelCprediction.errors.max(), modelCprediction.errors.min() ) )
11) Let us now plot the results using a line chart from python seaborn package.
modelPlot = sns.lineplot(x=y_test.index, y=y_test.WattsUpPro, label='HCLWattsUp', color='brown', markers=True, marker='o', markersize=5, markeredgecolor='red' ) modelPlot = sns.lineplot(x=y_test.index, y=modelCprediction. ModelCPrediction, label='Model MM', color='green', markers=True, marker='+', markersize=5, markeredgecolor='green' ) modelPlot = sns.lineplot(x=y_test.index, y=X_test.RAPL, label='RAPL', color='blue', markers=True, marker='*', markersize=5, markeredgecolor='blue' ) modelPlot.set(xlabel='Problem Sizes', ylabel='Dynamic Energy [J]') plt.title('Comparison of Model Predictions') plt.show() # Uncomment the line below to save the heatmap in local machine #plt.savefig('ResultsDgemm/fft-predictions.png', format='png', dpi=1000)
12) If you following the tutorial correctly, you should be able to see the following output for model predictions once you execute python3 dgemm_models.py
Model A Intercept [2112.04743501] Coefficients [[-4.46923029e+02 2.81579830e-02 5.15053894e-09 -2.04198715e-01]] Prediction Mean: 11.22 , Max: 48.16 , Min: 0.03 Model B Intercept 0.0 Coefficients [[-3.27792392e+02 -5.31791279e-02 4.04480979e-09 -3.48395841e-02]] Prediction Mean: 13.09 , Max: 49.50 , Min: 0.30 Model C Intercept 0.0 Coefficients [1.75312693e+01 0.00000000e+00 1.26471747e-09 0.00000000e+00] Prediction Mean: 18.57 , Max: 81.08 , Min: 0.23
I hope you find this tutorial useful. If you find any errors or feel any need for improvement, let me know in your comments below.
Signing off for today. Stay tuned and I will see you next week! Happy learning.