Unraveling the Mysteries of Quantile Regression: A Comprehensive Analysis and Python Implementation

5 min readMay 12, 2023

The world of predictive modeling and machine learning is vast and filled with countless statistical techniques. One such technique, often overshadowed by its more popular counterparts, is Quantile Regression. In this article, we delve deep into the arcane world of Quantile Regression, elucidating its applications, advantages, and disadvantages. Additionally, we provide Python code to implement both a common Quantile Regression and a LightGBM-based approach.

Quantile Regression: A Brief Overview

Traditional linear regression techniques, such as Ordinary Least Squares (OLS), focus on predicting the conditional mean of a response variable given a set of predictors. Quantile Regression, on the other hand, aims to model the conditional quantile of the response variable, making it a powerful tool for understanding the relationship between variables across various quantiles of the response distribution.

Applications

Finance: Quantile Regression is used to model and predict financial risk, including Value-at-Risk (VaR) and Conditional Value-at-Risk (CVaR) for portfolio optimization and risk management.

Medicine: In medical research, Quantile Regression helps analyze relationships between variables across the entire distribution, offering insights into treatment effectiveness at various stages of a disease.

Ecology: Quantile Regression enables ecologists to study species distribution patterns and habitat preferences, providing valuable information for conservation efforts.

Applications of quantile regression when the alpha (desired percentile) is set to 0.90, 0.95, 0.99, 0.1, 0.05, 0.01

Quantile regression is a versatile technique that can be applied to various real-world problems by specifying different quantile levels (alpha). When alpha is set to 0.9 or 0.95, it can be particularly useful in applications where the focus is on the upper tail of the conditional distribution. Conversely, when alpha is set to 0.1 or 0.05, the interest lies in the lower tail of the distribution. Some practical applications for these specific quantiles include:

Risk management and finance: Quantile regression with alpha set to 0.9, 0.95, or even higher is often used to model Value-at-Risk (VaR) in financial risk management. VaR estimates the potential loss in a given investment portfolio over a specific time horizon and with a certain level of confidence (e.g., 95% confidence). By estimating the VaR, financial institutions and investors can better understand and manage the risks associated with their investments.

Health and medicine: When studying the effectiveness of a treatment or drug, researchers may be interested in quantifying the impact on both ends of the distribution (e.g., the 0.1 or 0.05 quantile for patients who respond poorly and the 0.9 or 0.95 quantile for those who respond well). Quantile regression can help uncover the heterogeneous effects of treatments or interventions across different segments of the population, which may not be captured by traditional mean regression methods.

Environmental sciences: In environmental monitoring and pollution control, quantile regression with alpha set to 0.9 or 0.95 can be employed to identify extreme values or rare events, such as unusually high concentrations of pollutants, extreme temperatures, or extreme precipitation events. These quantiles can help policymakers and researchers better understand and address the potential impacts of extreme environmental conditions.

Education: Educational researchers might be interested in assessing the impact of interventions or teaching methods on the lower (0.1 or 0.05) or upper (0.9 or 0.95) tails of the student achievement distribution. By focusing on specific quantiles, researchers can better understand the performance of students who struggle or excel and tailor educational programs to address their specific needs.

Quantile regression with alpha set to 0.1, 0.05, 0.9, or 0.95 can provide valuable insights in various fields where the focus is on the tails of the distribution. The technique allows for a more detailed understanding of the data, facilitating the development of targeted strategies and interventions.

Advantages

Robustness: Quantile Regression is less sensitive to outliers than traditional regression techniques, making it ideal for datasets with heavy-tailed distributions or extreme values.

Flexibility: By modeling different quantiles, Quantile Regression provides a more comprehensive view of the relationship between variables across the entire distribution.

Heteroscedasticity: Quantile Regression can handle heteroscedasticity, a common issue with traditional regression techniques when the variance of the response variable is not constant across different levels of predictor variables.

Disadvantages

Computational Complexity: Quantile Regression is more computationally intensive than traditional regression techniques.

Interpretability: Interpreting Quantile Regression results can be challenging, especially when comparing coefficients across different quantiles.

Implementing Quantile Regression in Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import lightgbm as lgb
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_absolute_error, make_scorer
from sklearn.linear_model import LinearRegression
from statsmodels.compat import lzip
from statsmodels.stats.diagnostic import het_breuschpagan

# Set random seed for reproducibility
np.random.seed(42)

# Simulate data with an asymmetric error distribution
n = 1000
x = np.random.uniform(-2, 2, n)
error = np.random.gamma(2, 1, n) * (x > 0) - np.random.gamma(2, 1, n) * (x <= 0)
y = 3 + 2 * x + 5 * (x > 0) - 3 * (x > 1) + error

# Create a DataFrame
data = pd.DataFrame({'x': x, 'y': y})

# Test for heteroskedasticity using the Breusch-Pagan test
ols_model = smf.ols(formula='y ~ x', data=data).fit()
bp_test = het_breuschpagan(ols_model.resid, ols_model.model.exog)
labels = ['LM Statistic', 'LM-Test p-value', 'F-Statistic', 'F-Test p-value']
print(lzip(labels, bp_test))

# Ordinary Least Squares (OLS)
ols_preds = ols_model.predict(X_test)

# Common Quantile Regression (0.5 quantile)
quantreg_model = smf.quantreg('y ~ x', data).fit(q=0.5)
quantreg_preds = quantreg_model.predict(X_test)

# LightGBM Quantile Regression
X_train, X_test, y_train, y_test = train_test_split(data[['x']], data['y'], test_size=0.2, random_state=42)
train_data = lgb.Dataset(X_train, label=y_train)

param_grid = {
    'objective': ['quantile'],
    'alpha': [0.5], # This value allows us to determine the desired percetil
    'learning_rate': [0.05, 0.1, 0.15],
    'num_leaves': [31, 63],
    'min_child_samples': [10, 20],
    'verbosity': [-1],
}

lgb_estimator = lgb.LGBMRegressor()
grid_search = GridSearchCV(
    lgb_estimator, 
    param_grid, 
    scoring=make_scorer(mean_absolute_error, greater_is_better=False),
    cv=5,
    verbose=0
)

grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
lgb_model = lgb.train(best_params, train_data, num_boost_round=200)
lgb_predictions = lgb_model.predict(X_test)

# Plot the results
plt.scatter(x, y, alpha=0.3, label='Data', color='gray')
plt.plot(np.sort(x), ols_model.predict(data[['x']].sort_values('x')), label='OLS', color='red', linestyle='--')
plt.plot(np.sort(x), quantreg_model.predict(data[['x']].sort_values('x')), label='Quantile Regression', color='green', linestyle='-.')
plt.plot(np.sort(x), lgb_model.predict(data[['x']].sort_values('x')), label='LightGBM', color='blue', linestyle=':')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.title('Comparison of OLS, Quantile Regression, and LightGBM')
plt.grid(True)
plt.show()

# Compare performance
print('OLS Mean Absolute Error:', mean_absolute_error(y_test, ols_preds))
print('Quantile Regression Mean Absolute Error:', mean_absolute_error(y_test, quantreg_preds))
print('LightGBM Quantile Regression Mean Absolute Error:', mean_absolute_error(y_test, lgb_predictions))

Bonus

In the designed simulation, the data has been generated with non-linearities and an asymmetric error distribution, which creates a challenging environment for linear models like Ordinary Least Squares (OLS) regression. In such a setting, the Quantile Regression method is expected to outperform OLS, as it is more robust against non-symmetric errors and better suited for estimating specific quantiles of the conditional distribution.

On the other hand, LightGBM, a gradient boosting framework that uses tree-based learning algorithms, is expected to outperform both OLS and Quantile Regression. The main advantage of LightGBM is its ability to capture complex relationships and non-linearities in the data. Furthermore, when estimating quantile regression with LightGBM, the model benefits from the combination of the powerful tree-based structure and the robustness of the quantile regression method. This combination allows LightGBM to efficiently handle the non-linearities and asymmetric error distribution present in the simulated data, resulting in better performance compared to the OLS and standalone Quantile Regression methods.

In summary, the simulation demonstrates the power of combining tree-based models like LightGBM with quantile regression, highlighting their ability to outperform traditional linear methods such as OLS, particularly in cases with non-linearities and asymmetric error distributions.