Fama–MacBeth regression

Given data on risk factors and portfolio returns, it is useful to estimate the portfolio's exposure, or how much a risk factor is driving portfolio returns, as well as how much is the exposure to a given risk factor worth. What is the market's risk factor premium?

The risk premium then permits to estimate the return for any portfolio provided the factor exposure is known or can be estimated.

More formally, we will have i=1, ..., N asset or portfolio returns over t=1, ..., T periods and each asset's excess period return will be denoted. The goals is to test whether the j=1, ...,M factors explain the excess returns and the risk premium associated with each risk factor. In our case, we have N=17 portfolios and M=5risk factors, each with =120 periods of data.

Inference problems will likely arise in such cross-sectional regressions because the fundamental assumptions of classical linear regression may not hold. Potential violations include measurement errors, covariation of residuals due to heteroskedasticity and serial correlation, and multicollinearity.

To address the inference problem caused by the correlation of the residuals, Fama and MacBeth proposed a two step methodology for a cross-sectional regression of returns on risk factors. The two-stage Fama—Macbeth regression is designed to estimate the premium rewarded for the exposure to a particular risk factor by the market. The two stages consist of:

  • First stage: N time-series regression, one for each asset or portfolio, of its excess returns on the factors to estimate the factor loadings. In matrix form, for each asset:

Alt text that describes the graphic

  • Second stage: T cross-sectional regression, one for each time period, to estimate the risk premium. In matrix form, we obtain a vector $\hat{\lambda_t}$ of risk premia for each period:

Alt text that describes the graphic

Now we can compute the factor risk premia as the time average and get t-statistic to assess their individual significance, using the assumption that the risk premia estimates are independent over time:

$$t = \frac{\lambda_j}{\alpha(\lambda_j) \div \sqrt{(T)}}$$

Why build a linear factor model?

Algorithmic trading strategies use linear factor models to quantify the relationship between the return of an asset and the sources of risk that represent the main drivers of these returns. Each factor risk carries a premium, and the total asset return can be expected to correspond to a weighted average of these risk premia.

There are several practical applications of factor models across the portfolio management process from construction and asset selection to risk management and performance evaluation. The importance of factor models continues to grow as common risk factors are now tradeable:

  • A summary of the returns of many assets by a much smaller number of factors reduces the amount of data required to estimate the covariance matrix when optimizing a portfolio
  • An estimate of the exposure of an asset or a portfolio to these factors allows for the management of the resultant risk, for instance by entering suitable hedges when risk factors are themselves traded
  • A factor model also permits the assessment of the incremental signal content of new alpha factors
  • A factor model can also help assess whether a manager's performance relative to a benchmark is indeed due to skill in selecting assets and timing the market, or if instead, the performance can be explained by portfolio tilts towards known return drivers that can today be replicated as low-cost, passively managed funds without incurring active management fees


In [104]:
from pprint import pprint
import numpy as np
import pandas as pd
import pyfolio as pf
import quantstats as qs
from datetime import date
from pandas_datareader.famafrench import get_available_datasets
from scipy.stats import spearmanr, pearsonr
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from statsmodels.api import OLS, add_constant
from pathlib import Path
from linearmodels.asset_pricing import TradedFactorModel, LinearFactorModel, LinearFactorModelGMM
import nance_db
In [8]:
import seaborn as sns 
from pandas_datareader import data as web
import warnings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 500)

Get Data

Fama and French make updated risk factor and research portfolio data available through their website, and you can use the pandas_datareader package to obtain the data.

Risk Factors

We will be using the five Fama—French factors that result from sorting stocks first into three size groups and then into two for each of the remaining three firm-specific factors.

Hence, the factors involve three sets of value-weighted portfolios formed as 3 x 2 sorts on size and book-to-market, size and operating profitability, and size and investment. The risk factor values are computed as the average returns of the portfolios (PF) as outlined in the following table:

Label Name Description
SMB Small Minus Big Average return on the nine small stock portfolios minus the average return on the nine big stock portfolios
HML High Minus Low Average return on the two value portfolios minus the average return on the two growth portfolios
RMW Robust minus Weak Average return on the two robust operating profitability portfolios minus the average return on the two weak operating profitability portfolios
CMA Conservative Minus Aggressive Average return on the two conservative investment portfolios minus the average return on the two aggressive investment portfolios
Rm-Rf Excess return on the market Value-weight return of all firms incorporated in the US and listed on the NYSE, AMEX, or NASDAQ at the beginning of month t with 'good' data for t minus the one-month Treasury bill rate

The Fama-French 5 factors are based on the 6 value-weight portfolios formed on size and book-to-market, the 6 value-weight portfolios formed on size and operating profitability, and the 6 value-weight portfolios formed on size and investment.

We will use returns at a monthly frequency that we obtain for the period 2010 – 2020 as follows:

In [10]:
start = pd.Timestamp('2010')
end = pd.Timestamp('2020-02')

ff_factor = 'F-F_Research_Data_5_Factors_2x3'
ff_factor_data = web.DataReader(ff_factor, 'famafrench', start=start, end=end)[0]
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 121 entries, 2010-01 to 2020-01
Freq: M
Data columns (total 6 columns):
Mkt-RF    121 non-null float64
SMB       121 non-null float64
HML       121 non-null float64
RMW       121 non-null float64
CMA       121 non-null float64
RF        121 non-null float64
dtypes: float64(6)
memory usage: 6.6 KB
In [11]:
count 121.000000 121.000000 121.000000 121.000000 121.000000 121.000000
mean 1.081488 -0.068182 -0.266198 0.127190 -0.001322 0.043388
std 3.730119 2.330899 2.343518 1.459477 1.456330 0.065340
min -9.550000 -4.550000 -6.260000 -3.990000 -3.330000 0.000000
25% -0.850000 -1.890000 -1.810000 -0.730000 -1.080000 0.000000
50% 1.290000 0.210000 -0.340000 0.170000 -0.020000 0.010000
75% 3.400000 1.330000 0.950000 1.100000 0.910000 0.070000
max 11.350000 6.810000 8.290000 3.480000 3.700000 0.210000


Fama and French also make numerous portfolios that we can illustrate the estimation of the risk factor exposures, as well as the value of the risk premia available in the market for a given time period. We will use a panel of the 17 industry portfolios at a monthly frequency.

We will subtract the risk-free rate from the returns because the factor model works with excess returns:

In [12]:
ff_portfolio = '17_Industry_Portfolios'
ff_portfolio_data = web.DataReader(ff_portfolio, 'famafrench', start=start, end=end)[0]
ff_portfolio_data = ff_portfolio_data.sub(ff_factor_data.RF, axis=0)
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 121 entries, 2010-01 to 2020-01
Freq: M
Data columns (total 17 columns):
Food     121 non-null float64
Mines    121 non-null float64
Oil      121 non-null float64
Clths    121 non-null float64
Durbl    121 non-null float64
Chems    121 non-null float64
Cnsum    121 non-null float64
Cnstr    121 non-null float64
Steel    121 non-null float64
FabPr    121 non-null float64
Machn    121 non-null float64
Cars     121 non-null float64
Trans    121 non-null float64
Utils    121 non-null float64
Rtail    121 non-null float64
Finan    121 non-null float64
Other    121 non-null float64
dtypes: float64(17)
memory usage: 17.0 KB

Equity Data

Vanguard Sector & specialty ETFs

In [98]:
symbols = ['VOX', 'VCR', 'VDC', 'VDE', 'VFH', 'VHT', 'VIS', 'VGT', 'VAW', 'VNQ', 'VPU']
In [94]:
def get_symbols(symbols,data_source,ohlc,begin_date=None,end_date=None):
    out = []
    new_symbols = []
    for symbol in symbols:
        df = web.DataReader(symbol, data_source,begin_date, end_date)\
        [['High','Low','Open','Close','Volume','Adj Close']]
        data = pd.concat(out, axis = 1)
        data.columns = new_symbols
    return data
In [100]:
prices = get_symbols(symbols,data_source='yahoo',ohlc='Close',\

SPY = web.DataReader('SPY', 'yahoo', start, end)\
      [['High','Low','Open','Close','Volume','Adj Close']]
In [101]:
prices.columns = secs
prices['SPY'] = SPY.Close.values
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2537 entries, 2010-01-04 to 2020-01-31
Data columns (total 12 columns):
COMM             2537 non-null float64
CONSUMER DISC    2537 non-null float64
CONSUMER ST      2537 non-null float64
ENERGY           2537 non-null float64
FINANCIALS       2537 non-null float64
HEALTH           2537 non-null float64
INDUSTRIALS      2537 non-null float64
TECHNOLOGY       2537 non-null float64
MATIREALS        2537 non-null float64
REAL ESTATE      2537 non-null float64
UTILITIES        2537 non-null float64
SPY              2537 non-null float64
dtypes: float64(12)
memory usage: 257.7 KB
In [107]:
returns0 = prices.pct_change()
returns0 = returns0.dropna(how='all').dropna(axis=1)


In [106]:
pricesx = returns0.copy()

n_secs = len(pricesx.columns)
colors = cm.rainbow(np.linspace(0, 1, n_secs))
pricesx.cumsum().plot(color=colors, figsize=(12, 6))# Normalize Prices 
plt.title('Cummulative Returns Series')
plt.ylabel('Returns (%)')
plt.grid(b=None, which=u'major', axis=u'both')
plt.legend(bbox_to_anchor=(1.01, 1.1), loc='upper left', ncol=1)

pricesx.hist(bins=20, figsize=(10,10))

g1 = sns.boxplot(data=pricesx)

plt.title('Mean Return')

g = sns.clustermap(pricesx.corr(), annot=True)
ax = g.ax_heatmap
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.title('Pairwise Correlations')

pricesx.drop('SPY', axis=1).rolling(63).corr(pricesx.SPY).dropna().plot(color=colors, figsize=(12, 6))
plt.legend(bbox_to_anchor=(1.01, 1.1), loc='upper left', ncol=1)
plt.title('Rolling Quarterly Correlation to SPY')

pricesx.drop('SPY',axis=1).corrwith(pricesx.SPY).sort_values(ascending=True).plot(kind='barh', color=colors, figsize=(12, 6))
plt.title('Correlation to SPY')

pricesx.rolling(63).var().dropna().plot(color=colors, figsize=(12, 6))
plt.legend(bbox_to_anchor=(1.01, 1.1), loc='upper left', ncol=1)
plt.title('Rolling Quarterly Variance')

g2 = sns.boxplot(data=pricesx.rolling(63).std().dropna())
plt.title('Rolling Quarterly Returns Volatility')

plt.title('Returns Volatility')