Linear Regression Analysis fits a straight line between dependent variable and one or more independent variables. It is used for predicting the dependent variable using independent variables. In this section, Linear Regression analysis will be performed using some of the Python’s libraries/modules such as ‘Panda’, ‘Scikit-Learn’, ‘Numpy’, ‘MatPlot’ and ‘Statsmodels’.
Python is an open source programming language. It has simple and easy to use syntax. It is widely used for data analysis, scientific computation, statistical modeling, machine learning methods and plotting graphs. Boston housing data set is used to perform the analysis, and this public data is available at ‘https://archive.ics.uci.edu/ml/machine-learning-databases/housing/‘, maintained by UCI.
The data contains 13 features (known as independent variables) and one target (known as dependent variable) variable. The model will be built on historical data to predict the Median Price for houses using features related to neighborhood and house characteristics listed below:
- CRIM:per capita crime rate by town
- ZN:proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS:proportion of non-retail business acres per town
- CHAS:Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX:nitric oxides concentration (parts per 10 million)
- RM:average number of rooms per dwelling
- AGE:proportion of owner-occupied units built prior to 1940
- DIS:weighted distances to five Boston employment centres
- RAD:index of accessibility to radial highways
- TAX:full-value property-tax rate per $10,000
- PTRATIO:pupil-teacher ratio by town
- B:This field is calculated through the formula 1000*(Bk – 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT:% lower status of the population
- MEDV:Median value of owner-occupied homes in $1000’s
The last column MEDV is the target.
Data Exploration is important before performing any analysis, Steps in Data Exploration are mentioned below:
- Create DataFrame from external file
- Assign columns names to the columns
- Generate basic statistics for each of the columns
- Plot histogram for target column
First 3 tasks are done using Panda library. Plot is generated using Matplotlib. The source data did not contain any headers.
import pandas as pdboston_pd=pd.read_csv('C:\..\boston.csv', header =None)
Column names can be assigned to the data frame in below manner. The last column refers Target.
boston_pd.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'target']
Below command prints top 5 rows of the data. This is to check whether the data is read properly.
Some basic statistics can be printed for all the columns through the below command.
It has been observed that there is no missing data for any columns in the source data provided.
Histogram will be plotted for ‘target’ column using Python’s Matplotlib library.
The plot looks slightly positively skewed. Skewness of the target variable can be calculated in below manner.
Since the target column is slightly positively skewed, Log transformation is used. However, two models are developed, one with Log(target) and the other without log. Both model are compared in later section.
Data Preparation for the Model
Once the data is explored and ready to be operated, Data preparation is needed before the model is built. Below are the steps performed for preparing the data.
- Split the complete data into training data and test data
- Create X (features) and Y (target) for developing the model
- Log transformation on target
Data will be split into training and test data. Training data will be used to train the model and test data will be used to see how well the it performs.
X refers features (Independent variables) and Y refers target (dependent variable). If random_state is used along with test_size in the train_test_split then split is always same. This argument is helpful when reproducible results are required. test_size = 0.2 argument splits training and test into 80:20 ratio (80% train and 20% test).
from sklearn.model_selection import train_test_splitboston_train, boston_test = train_test_split(boston_pd, test_size = 0.2)X_train = boston_train.drop('target', axis =1)Y_train =boston_train.targetX_test = boston_test.drop('target', axis =1)Y_test =boston_test.targetY_log_train = np.log(boston_train['target'])
Model Building and Performance Test
Once the data is ready, below steps are followed to build the models and test performance.
- Fit Linear Models
- Calculate Measures for the Models
- Residual Plot
- Model Performance
These steps are explained below in details.
Fit Linear Models
Scikit-Learn and Statsmodel libraries are explored in Python v3.4.6 for building the Linear Regression model. Both of the ways are demonstrated here.
Model fit with Scikit-Learn
Below are the steps used to build the model using Scikit-Learn library.
- Define LinearRegression object
- Fit the model using .fit()
- Convert the output of .coef_ into series and assign the respective features names.
- Print estimates of the features and intercept
- Calculate R2 and Adjusted R2
LinearRegression object is defined in below manner. To add the intercept in the model, fit_intercept is set as True.
from sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error, r2_scorelr =LinearRegression(fit_intercept=True)lrmodel1 =lr.fit(X_train, Y_train)coef_lr =pd.Series(lrmodel1.coef_, index =X_train.columns)print(lrmodel1.intercept_)
Model fit with Statsmodel
Below are the steps used to build the model using Statsmodel library.
- Add constants term into features
- Fit the model
- Print intercept and features estimates
- Print model summary which includes estimates, p value, t value, standard error etc.
- Calculate VIF
- Fit the model for Log(target)
import statsmodels.api as sm
Adding intercept into the model:X_train1 = sm.add_constant(X_train)model =sm.OLS(Y_train,X_train1).fit()
Printing the coefficient with Intercept:
Printing the summary of the model:
Calculating VIF using Statsmodelsfrom statsmodels.stats.outliers_influence import variance_inflation_factorvif = [variance_inflation_factor(X_train1.values, i) for i in range(X_train1.shape)]vif_s =pd.Series(vif, index =X_train1.columns)print(vif_s)VIF for the Intercept(constant) can be ignored.
In the model, “INDUS” and “AGE” are insignificant with p values 0.86 and 0.77 respectively. Model is re-run by dropping them.
Now, model is built using log(target) as target.
Calculate Measures for the models
R2 and RMSE are calculated for the models. Scikit-learn library is used.
Model-1 (non-log model)lrmodel2 =lr.fit(X_train2, Y_train)r2 = r2_score(Y_train, lrmodel2.predict(X_train2))print(r2)RMSE_train =np.sqrt(mean_squared_error(Y_train, lrmodel2.predict(X_train2)))Print(RMSE_train)Model-2 (log model)
Scikit-Learn is used to plot Residual.
lrmodel2 =lr.fit(X_train2, Y_train)pred_train = lrmodel2.predict(X_train2)plt.scatter(pred_train, Y_train - pred_train)plt.xlabel("Predicted Y")plt.ylabel("Residual")plt.title("Residual Plot using Training data")plt.show()Residual around the line at 0 should be randomly scattered. If pattern is found that means model needs improvement.
“INDUS” and “AGE” were dropped from the model and they will be dropped from test data as well. Predict the target and calculate the Root Mean Square Error (RMSE) and R2 on test data for two models with log(target) and other without log of target.
Model-1 (non-log model)X_test1 =X_test.drop(['AGE','INDUS'], axis =1)lrmodel2 =lr.fit(X_train2, Y_train)RMSE_test =np.sqrt(mean_squared_error(Y_test, lrmodel2.predict(X_test1)))print(RMSE_test)
Calculating R2 on test data (1 denotes perfect prediction):print(r2_score(Y_test, lrmodel2.predict(X_test1)))Model-2 (log model)lrmodel3 = lr.fit(X_train2, Y_log_train)pred_y_test3 =np.exp(lrmodel3.predict(X_test1))RMSE_test3 =np.sqrt(mean_squared_error(Y_test, pred_y_test3))print(RMSE_test3)print(r2_score(Y_test, pred_y_test3))
Model Train – R2 Train – RMSE Test – R2 Test – RMSE Model-1 (without log) 0.75 0.36 0.70 5.82 Model-2 (with log) 0.78 4.10 0.77 5.13
Plotting the actual vs. predicted on test data using model-2.plt.scatter(pred_y_test3, Y_test)plt.xlabel("Predicted Price")plt.ylabel("Actual Price")plt.title("Predicted vs Actual Plot")plt.show()
Looking at all the metrics of the model and performance on the test data, model fit is better with log(target) compared to model without taking log on target.
The data was split into training and test into 80:20 ratio using the function train_test_split. Two models were built, one without taking log on target variable and the other one after applying log on target. Models were trained on training data and were evaluated on test data.
A better way of getting an estimate of the RMSE on a test data is to hold out many different test sets and then average. Cross validation is efficient way of doing this. In the below section, cross validation will be demonstrated for 1st model using K-Fold Cross Validation.
In the K-Fold Cross Validation, the data will be split randomly into k equal sized subsamples. Out of k subsample, one subsample will be kept for validation and remaining k-1 subsample will be used as training data to fit the model. The cross-validation process will be repeated k times. Average of the root mean squared error/mean squared error for all k sample of the model will be computed. This metric can be further used to compare different set of models.
from sklearn.model_selection import cross_val_scoreRMSE = np.sqrt(-cross_val_score(lrmodel2, X_train2, Y_train, scoring="neg_mean_squared_error", cv = 5))print(RMSE)
Mean value of RMSE for 5-foldsprint(np.mean(RMSE))
Boston housing data was explored and modeled to predict the median housing price using features. Two models were developed and compared on Root Mean Squared Error (RMSE) and R2. RMSE for model with log(target) was found lesser than the RMSE of the model without log target. Model fit considered better with log(target) than model without log target.