Arsalan Shahid

Machine Learning Via Three Linear Regression Models in Python

My dear reader, how are you? السلام عليكم

Everything we hear is an opinion, not a fact. Everything we see is a perspective, not the truth – Marcus Aurelius

This post explains linear regression modelling techniques in Python using real datasets of Performance Monitoring Counters (PMCs) on an Intel Haswell Server. This tutorial explains the prediction using three types of linear regression models.

  1. A linear model with intercept and coefficients.
  2. A linear model no intercept.
  3. A linear model with no intercept and positive coefficients only.

Few useful links to practically follow the project:

  1. LinearRegression GitHub repository — DirectMe
  2. All other tutorials on LinearRegression — DirectMe
  3. Clone LinearRegression GitHub repository using the following command:
git clone

Explaining the dataset

We predict the energy consumption of a Dense Matrix-Multiplication (DGEMM) Application. There are 4 predictor variables in our dataset.

  1. Floating-point operations (FLOPS). This data is collected using the Likwid (DirectMe) tool.
  2. RAPL dynamic energy readings. This data is also collected using Likwid tool.
  3. Problem size of DGEMM application.
  4. The execution time of DGEMM application.

and the output or response variable is dynamic energy consumption of application. The dynamic energy consumption can be collected using HCLWattsUp API (DirectMe)

The dataset is available on GitHub (DirectMe)

Explaining the CODE

Follow the steps given below to develop and analyze models.

1) First of all, create a Python virtual environment and install necessary packages.

LinearRegression$ virtualenv lrenv 
LinearRegression$ source lrenv/bin/activate
(lrenv)LinearRegression$ pip3 install numpy pandas matplotlib seaborn sklearn

# Generate a requirements file
(lrenv)LinearRegression$ pip3 freeze > requirements.txt

# Create a python file to create linear models.
(lrenv)LinearRegression$ source

2) Import the necessary packages.

# open and add the following program

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import as sty
import seaborn as sns  
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
import csv


We use a default library for splitting the data in training and testing dataset according to best practices. Usually, it is a 70/30 or 75/25 split in percentage.

LinearRegression package from sciket-learn library can be used to create simple linear models with and without intercepts. However, Lasso and Ridge is a specialized linear regression and can be used to develop models with constraints such as forcing the coefficients to be positive in a linear model.

3) Import the data file as shown below using and make a pandas data frame. You can also rename the data tags or variable names.

data = pd.read_csv('../data/dgemm.csv')

newData = (
    .rename(columns={'time': 'Time', 'ps': 'ProblemSize', 'energy': 'WattsUpPro', 'rapl': 'RAPL', 'flops': 'CPU FLOPS'

# check if the dataframe contains the correct data as expected 
# you can also check the dimentions of dataset in order to be sure that you loaded complete dataset

4) Check the correlation of variables as shown below. We also plot a heatmap to show correlations.

corr = newData.corr()

# Plot correlation coefficients as a heat map
sns.heatmap(corr, vmax=1, linewidths=0.5, cbar_kws={"shrink": .5})
plt.title('Correlation matrix for DGEMM Application')

5) Pair plots is also a nice way to understand the relationship between data variables.

sns.pairplot(newData, x_vars=['Rapl', 'CPU FLOPS', 'ProblemSize'],
             y_vars='WattsUpPro', height=4, kind='reg', aspect=0.7)

sns.pairplot(newData, diag_kind='kde', markers='+')

6) Prepare the input and response variables.

X = newData[['Time', 'RAPL', 'CPU FLOPS', 'ProblemSize']]
y = newData[['WattsUpPro']]

# check if correct data loaded into variables 

7) Split the datasets into training and testing dataset.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# check if correct data loaded into variables 

8) Let us now develop our first model. We will build a simple linear model using regression technique without any constraints.

modelA = LinearRegression(fit_intercept=True)
modelAfit =, y_train)

print("\n Model A")
print("Intercept", modelAfit.intercept_)
print("Coefficients", modelAfit.coef_)

# reset y_test indexes
y_test_new = y_test.reset_index(drop=True)

modelApredict = modelA.predict(X_test)

modelAprediction = pd.DataFrame(modelApredict)\
    .rename(columns={0: 'ModelAPrediction'})\
    .assign(errors=lambda x: (((y_test_new.WattsUpPro -
                               x['ModelAPrediction']) /

print("Prediction Mean: %.2f , Max: %.2f , Min: %.2f " %

# uncomment to print prediction results
# print(modelAprediction)

9) The second model will have intercept forced to zero as shown below

modelB = LinearRegression(fit_intercept=False)

modelBfit =, y_train)

print("\n Model B")
print("Intercept", modelBfit.intercept_)
print("Coefficients", modelBfit.coef_)

modelBpredict = modelB.predict(X_test)

modelBprediction = pd.DataFrame(modelBpredict)\
    .rename(columns={0: 'ModelBPrediction'})\
    .assign(errors=lambda x: (((y_test_new.WattsUpPro -
                               x['ModelBPrediction']) /

print("Prediction Mean: %.2f , Max: %.2f , Min: %.2f " %

# uncomment to print prediction results
# print(modelAprediction.head(5))
# print(modelBprediction.head(5))

10) Finally, we will build a specialized linear model with constraints such as forcing the intercepts to be positive only and intercepts to zero as shown below.

modelC = Lasso(alpha=0.0001, precompute=True, max_iter=10000,
               fit_intercept=False, positive=True, random_state=9999,

modelCfit =, y_train)

print("\n Model C")
print("Intercept", modelCfit.intercept_)
print("Coefficients", modelCfit.coef_)

modelCpredict = modelC.predict(X_test)

modelCprediction = pd.DataFrame(modelCpredict)\
    .rename(columns={0: 'ModelCPrediction'})\
    .assign(errors=lambda x: (((y_test_new.WattsUpPro -
                               x['ModelCPrediction']) /

print("Prediction Mean: %.2f , Max: %.2f , Min: %.2f " %

11) Let us now plot the results using a line chart from python seaborn package.

modelPlot = sns.lineplot(x=y_test.index, y=y_test.WattsUpPro,
                          label='HCLWattsUp', color='brown',
                          markers=True, marker='o', markersize=5,

modelPlot = sns.lineplot(x=y_test.index, y=modelCprediction.
                         ModelCPrediction, label='Model MM',
                         color='green', markers=True,
                         marker='+', markersize=5,

modelPlot = sns.lineplot(x=y_test.index, y=X_test.RAPL, label='RAPL',
                         color='blue', markers=True,
                         marker='*', markersize=5,

modelPlot.set(xlabel='Problem Sizes', ylabel='Dynamic Energy [J]')
plt.title('Comparison of Model Predictions')
# Uncomment the line below to save the heatmap in local machine
#plt.savefig('ResultsDgemm/fft-predictions.png', format='png', dpi=1000)

12) If you following the tutorial correctly, you should be able to see the following output for model predictions once you execute python3

 Model A
Intercept [2112.04743501]
Coefficients [[-4.46923029e+02  2.81579830e-02  5.15053894e-09 -2.04198715e-01]]
Prediction Mean: 11.22 , Max: 48.16 , Min: 0.03

 Model B
Intercept 0.0
Coefficients [[-3.27792392e+02 -5.31791279e-02  4.04480979e-09 -3.48395841e-02]]
Prediction Mean: 13.09 , Max: 49.50 , Min: 0.30

 Model C
Intercept 0.0
Coefficients [1.75312693e+01 0.00000000e+00 1.26471747e-09 0.00000000e+00]
Prediction Mean: 18.57 , Max: 81.08 , Min: 0.23

I hope you find this tutorial useful. If you find any errors or feel any need for improvement, let me know in your comments below.

Signing off for today. Stay tuned and I will see you next week! Happy learning.

Exit mobile version