Do Airlines Lower Fares When More People Fly? – Sabbir's Data & Analytics Lab

Problem – As an airline manager, I want to know more passengers to give me the flexibility to reduce airfare.

Overview-

A simple linear regression model was applied to examine the relationship between average weekly airfare (FARE) and the number of passengers (PASSENGERS). The model suggests that for every additional passenger, airfare decreases slightly, indicating that higher demand may lead to competitive pricing. However, the low R-squared value indicates that passenger volume alone does not fully explain airfare fluctuations, suggesting that other factors (e.g., flight origin, landing point, distance) play a role.

Dataset link – https://drive.google.com/file/d/18qGmTqSQo6d80H8pySBxre1KVo8YFCaD/view?usp=sharing

I will not talk about the data preprocessing part in this post. But, you can assume the dataset is cleaned and ready for regression analysis. Regression models are as below –

Let’s read the data –

airfare = read.csv("your_path")

Now fit the model –

#Let's fit the model
airfare.model <- lm(FARE~PASSENGERS, data = airfare)
names(airfare.model)

Linear regression models are considered good when they meet the assumption of linearity, independence, normality, and equal variances (i.e., homoscedasticity). To assess the linearity, normality, and homoscedasticity assumptions of linear regression, we consider a scatterplot of average airfare vs. average weekly number of passengers as well as a normal Q-Q plot of residuals and scatterplot
of residuals vs. fitted values for a linear model of average airfare on the average weekly number of passengers.

# Let's check the linearity 
plot(airfare$PASSENGERS, airfare$FARE, xlab = "No of Passengers", ylab = "Airfare", main = "No of passengers vs Airfare")

abline(lm(FARE ~ PASSENGERS,  data = airfare), col = "red")

The scatterplot of average airfare vs. average weekly number of passengers does not show a clear non-linear trend, but this is admittedly difficult to assess, as much of the data are compressed toward the lower end of values for the average weekly number of passengers.

Let’s check the homoscedasticity and linearity assummption –

# Scatter plot of residuals vs. fitted values.
plot(airfare.model$fitted.values, airfare.model$residuals, xlab = "Fitted Values", ylab = "Residuals", main = "Scatterplot of Residuals vs. Fitted Values")
abline(h = 0)

The residuals show a pattern that does not fully satisfy the linearity assumption. For the linearity assumption to hold, the residuals should be randomly dispersed about the horizontal line with a constant spread. The plot clearly reveals a funnel-like shape, with the spread of residuals increasing as the fitted values grow. This shows that the connection between the variables is not strictly linear or that there are concerns with nonlinearity or heteroscedasticity (unequal variance in residuals)

#Let's check the normality
#Normal Q-Q Plot for residuals 
qqnorm(airfare.model$residuals, main = "Normal Q-Q plot")
qqline(airfare.model$residuals)

The Q-Q plot indicates a violation of the normality assumption. The residuals do not precisely follow the normal distribution, particularly at the extremes, indicating that the residual distribution may have larger tails(both lower and upper) or be skewed.

Let’s apply the log transformation on the FARE variable to satisfy the normality in data.

# Apply the logarithmic transformation to the Fare variable
log_fare_model <- lm(log(FARE) ~ PASSENGERS, data = airfare)
# Check the new residual plots to assess model assumptions
plot(log_fare_model)

After the log transform on the FARE variable, there is no significant change in the linearity assumption (1st image after log transform) and normality assumption(2nd image after log transform) and it doesn’t satisfy the assumption. Still, the plot clearly reveals a funnel-like shape. But the Q-Q plot shows slight improvement on the top value but still remains the fluctuation on the tail. It does not satisfy the normal distribution 100% but is slightly better than our original sample dataset. The assumption of Homoscedasticity (3rd image after log transform) and Independence (4th image after log transform) are violated as they still have the funnel shape on the data.

As, log-transformed model is not perform better, base model is the best model to choose.

#model fit summary
summary(airfare.model)

Passenger is an important feature as the p-value suggests p < 0.05).We can conclude that one passenger increase in the average weekly number of passengers causes a decrease of approximately 0.0091 units in the average airfare.
Let’s check the 95% confidence interval.

confint(airfare.model, level = 0.95)

With 95% confidence, we can say the mean airfare decreases by between 0.0047 and 0.0136 for each
additional passenger

Leave a Reply Cancel reply