A key assumption for time series cross-validation is the independent and identical (iid) distribution of the samples available for training. For financial data, this is often not the case. Financial data is neither independently nor identically distributed because of serial correlation
and time-varying
standard deviation, also known as heteroskedasticity
.
TimeSeriesSplit
in the sklearn.model_selection
module aims to address the linear order of time-series data.sklearn.model_selection.TimeSeriesSplit
object implements a walk-forward test with an expanding training set, where subsequent training sets are supersets of past training sets.from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import KFold
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import seaborn as sns
import sklearn
from sklearn.linear_model import ElasticNet
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
np.random.seed(1338)
cmap_data = plt.cm.Paired
cmap_cv = plt.cm.coolwarm
n_splits = 5
import pandas as pd
from pandas_datareader import data as web
import warnings
warnings.filterwarnings('ignore')
100
random data points with 3
different imbalanced classes for X
and y
data.# random data points
n_points = 100
n_features = 10
X = np.random.randn(n_points, n_features)
# imbalanced classes
percentiles_classes = [.2, .3, .5]
y = np.hstack([[ii] * int(n_points * perc)
for ii, perc in enumerate(percentiles_classes)])
X = pd.DataFrame(X)
X.head()
y
df = X.copy()
df['class'] = y
total = len(df)
plt.figure(figsize=(13,5))
plt.subplot(121)
g = sns.countplot(x='class', data=df)
g.set_title("class Count", fontsize=14)
g.set_ylabel('Count', fontsize=14)
for p in g.patches:
height = p.get_height()
g.text(p.get_x()+p.get_width()/2.,
height + 1.5,
'{:1.2f}%'.format(height/total*100),
ha="center", fontsize=14, fontweight='bold')
plt.margins(y=0.1)
plt.show()
def plot_cv_indices(cv, X, y, ax, n_splits, lw=10):
"""Create a sample plot for indices of a cross-validation object."""
# Generate the training/testing visualizations for each CV split
for ii, (tr, tt) in enumerate(cv.split(X=X, y=y, groups=None)):
# Fill in indices with the training/test groups
indices = np.array([np.nan] * len(X))
indices[tt] = 1
indices[tr] = 0
# Visualize the results
ax.scatter(range(len(indices)), [ii + .5] * len(indices),
c=indices, marker='_', lw=lw, cmap=cmap_cv,
vmin=-.2, vmax=1.2)
# Plot the data classes and groups at the end
ax.scatter(range(len(X)), [ii + 1.5] * len(X),
c=y, marker='_', lw=lw, cmap=cmap_data)
# Formatting
yticklabels = list(range(n_splits)) + ['class']
ax.set(yticks=np.arange(n_splits+2) + .5, yticklabels=yticklabels,
xlabel='Sample index', ylabel="CV iteration",
ylim=[n_splits+1.2, -.1], xlim=[0, 100])
ax.set_title('{}'.format(type(cv).__name__), fontsize=15)
return ax
Blocked cross-validation
works by adding margins at two positions. The first is between the training and validation folds in order to prevent the model from observing lag values which are used twice, once as an estimator (regressor) and another as a response. The second is between the folds used at each iteration in order to prevent the model from memorizing patterns from one iteration to the next.
Sci-kit learn
gives the luxury to define new types of splitters as long as you abide by its splitter API and inherit from the base splitter.class BlockingTimeSeriesSplit():
def __init__(self, n_splits):
self.n_splits = n_splits
def get_n_splits(self, X, y, groups):
return self.n_splits
def split(self, X, y=None, groups=None):
n_samples = len(X)
k_fold_size = n_samples // self.n_splits
indices = np.arange(n_samples)
margin = 0
for i in range(self.n_splits):
start = i * k_fold_size
stop = start + k_fold_size
mid = int(0.5 * (stop - start)) + start
yield indices[start: mid], indices[mid + margin: stop]
cvs = [TimeSeriesSplit, BlockingTimeSeriesSplit]
for cv in cvs:
this_cv = cv(n_splits=n_splits)
fig, ax = plt.subplots(figsize=(10, 5))
plot_cv_indices(this_cv, X, y, ax, n_splits)
ax.legend([Patch(color=cmap_cv(.8)), Patch(color=cmap_cv(.02))],
['Testing set', 'Training set'], loc=(1.02, .8))
plt.tight_layout()
fig.subplots_adjust(right=.7)
plt.show()
The two split methods are depicted above. The horizontal axis is the training set size while the vertical axis represents the cross-validation iterations. The folds used for training are depicted in blue and the folds used for validation are depicted in orange. The final horizontal bar are the three class labels for the response variable.
I have obtained ETH/USD
exchange prices up to the year 2020
from cryptodatadownload
df = pd.read_csv('Gemini_ETHUSD_d.csv', skiprows=1, parse_dates=True, index_col='Date')
df = df.sort_index().drop('Symbol', axis=1)
df.head()
Here, I have used a lag of 58
days for regressors and a target of 58
days for responses. That is, given the past 58
days closing price/volume features forecast the next 58
days. Then the resulting nan rows are dropped to handle missing values.
STEPS = 9
for i in np.arange(1 ,STEPS):
col_name = '{}d_Fwd_Close'.format(i)
df[col_name] = df['Close'].shift(-i)
df = df.dropna()
Next, we split the data frame into two one for the regressors and the other for the responses. And then split both into two one for training and the other for testing.
Features = 6
X = df.iloc[:, :Features]
y = df.iloc[:, Features:]
split = int(len(df) * 0.7)
X_train = X[:split]
y_train = y[:split]
X_test = X[split:]
y_test = y[split:]
X.head()
y.head()
Let’s define a method that creates an elastic net
model from sci-kit learn and since we are going to forecast more than one future time step, we will use a multi-output regressor
wrapper that trains a separate model for each target time step. This is a simple strategy for extending regressors that do not natively support multi-target regression. However, this introduces more demand for computation resources.
Elastic Net
regression introduces both L1-regularization
and L2-regularization
to resolve overfitting and are also known as Lasso
and Ridge
regression respectively. Due to the trade offs of both Lasso and Ridge regression, Elastic Net regression was introduced to mix the two models. As a result, some of the variables coefficients are set to zero as per L1-norm and some others are penalized or shrunken as per the L2-norm.
This model combines the best from both regressors and the result is a more stable, robust, and sparse model. As a consequence, there are more parameters to be tuned.
def build_model(_alpha, _l1_ratio):
estimator = ElasticNet(
alpha=_alpha,
l1_ratio=_l1_ratio,
fit_intercept=True,
normalize=False,
precompute=False,
max_iter=16,
copy_X=True,
tol=0.1,
warm_start=False,
positive=False,
random_state=None,
selection='random'
)
return MultiOutputRegressor(estimator, n_jobs=4)
sklearn.metrics.SCORERS.keys()
Time series splitter
model = build_model(_alpha=1.0, _l1_ratio=0.3)
tscv = TimeSeriesSplit(n_splits=5)
rmse = np.sqrt(-cross_val_score(model, X_train, y_train, cv=tscv, scoring='neg_mean_squared_error'))
R2 = cross_val_score(model, X_train, y_train, cv=tscv, scoring='r2')
print(f"RMSE: {rmse.mean()} (+/- {rmse.std()}")
print(f"\nR2: {R2.mean()} (+/- {R2.std()}")
Blocking time series splitter
btscv = BlockingTimeSeriesSplit(n_splits=5)
rmse = np.sqrt(-cross_val_score(model, X_train, y_train, cv=btscv, scoring='neg_mean_squared_error'))
R2 = cross_val_score(model, X_train, y_train, cv=btscv, scoring='r2')
print(f"RMSE: {rmse.mean()} (+/- {rmse.std()}")
print(f"\nR2: {R2.mean()} (+/- {R2.std()}")
Notice how the loss is different among the different splitters. In order to interpret the results correctly, let’s put it to test by using grid search cross-validation to find the optimal values for both regularization parameter alpha and the ratio that controls how much norm contributes to the regularization.
GridSearchCV
works by exhaustively searching all the possible combinations of the model’s parameters, but it makes use of a loss function to guide the selection of the values to be tried at each iteration. That is solving a minimization optimization problem. However, in SciKit Learn it explicitly tries all the possible combinations which makes it very computationally expensive.
When cross-validation is used in the inner loop of the grid search, it is called grid search cross-validation. Hence, the optimization objective becomes minimizing the loss obtained on each of the k folds.
def plot_grid_search(cv_results, grid_param_1, grid_param_2, name_param_1, name_param_2, best_params):
# Get Test Scores Mean and std for each grid search
scores_mean = cv_results['mean_test_score']
scores_mean = np.array(scores_mean).reshape(len(grid_param_2),len(grid_param_1))
scores_sd = cv_results['std_test_score']
scores_sd = np.array(scores_sd).reshape(len(grid_param_2),len(grid_param_1))
# Plot Grid search scores
_, ax = plt.subplots(1,1)
# Param1 is the X-axis, Param 2 is represented as a different curve (color line)
for idx, val in enumerate(grid_param_2):
ax.plot(grid_param_1, scores_mean[idx,:], '-o', label= name_param_2 + ': ' + str(val))
ax.set_title(f"Grid Search Best Params: {best_params}", fontsize=12, fontweight='medium')
ax.set_xlabel(name_param_1, fontsize=12)
ax.set_ylabel('CV Average Score', fontsize=12)
ax.legend(loc="best", fontsize=15)
ax.grid('on')
ax.legend(bbox_to_anchor=(1.02, 1.02))
Time series splitter
model.get_params().keys()
params = {
'estimator__alpha':(0.1, 0.3, 0.5, 0.7, 0.9),
'estimator__l1_ratio':(0.1, 0.3, 0.5, 0.7, 0.9)
}
scores = []
for i in range(30):
model = build_model(_alpha=1.0, _l1_ratio=0.3)
finder = GridSearchCV(
estimator=model,
param_grid=params,
scoring='r2',
n_jobs=4,
iid=False,
refit=True,
cv=tscv, # change this to the splitter subject to test
verbose=1,
pre_dispatch=8,
error_score=-999,
return_train_score=True
)
finder.fit(X_train, y_train)
best_params = finder.best_params_
best_score = round(finder.best_score_,4)
scores.append(best_score)
Blocking time series splitter
scores0 = []
for i in range(30):
model = build_model(_alpha=1.0, _l1_ratio=0.3)
finder0 = GridSearchCV(
estimator=model,
param_grid=params,
scoring='r2',
n_jobs=4,
iid=False,
refit=True,
cv=btscv, # change this to the splitter subject to test
verbose=1,
pre_dispatch=8,
error_score=-999,
return_train_score=True
)
finder0.fit(X_train, y_train)
best_params0 = finder0.best_params_
best_score0 = round(finder0.best_score_,4)
scores0.append(best_score)
finder0.cv_results_.keys()
Grid-search cross-validation was run 30
times in order to objectively measure the consistency of the results obtained using each splitter. This way we can evaluate the effectiveness and robustness of the cross-validation method on the time series. As for the k-fold cross-validation, the parameters suggested were close to uniform. Meaning, it did not really help in discriminating the optimal parameters since all values were either equally good or bad.
scores1 = pd.DataFrame(scores)
bs = round(float(scores1.mean()),4)
print(f'\nTime series splitter best score: {bs}')
plot_grid_search(finder.cv_results_, params['estimator__l1_ratio'], params['estimator__alpha'],
'l1_ratio', 'alpha', best_params)
scores01 = pd.DataFrame(scores0)
bs0 = round(float(scores01.mean()),4)
print(f'\nBlocking time series splitter best score: {bs0}')
plot_grid_search(finder0.cv_results_, params['estimator__l1_ratio'], params['estimator__alpha'],
'l1_ratio', 'alpha', best_params0)
In both the cases of time series split cross-validation and blocked cross-validation, we have obtained a clear indication of the optimal values for both parameters. In case of blocked cross-validation, the results were even more discriminative as there is a clearer and more consistent drop off as l1 ratio values increase with respect to alpha values.
preds = pd.DataFrame(finder.predict(X_test), columns=df.iloc[:, Features:].columns)
preds.head()
fig, ax = plt.subplots(figsize=(12,5))
ax.scatter(preds.index, y_test['8d_Fwd_Close'], color='b', alpha=0.5, label='Actual', s=50)
ax.scatter(preds.index, preds['8d_Fwd_Close'], color='r', alpha=0.5, label='Perdicted', s=50)
ax.set_xticklabels(df[split:].index.strftime('%Y-%m-%d'))
ax.set_title('8d_Fwd_Close')
ax.legend()
plt.show()
After obtaining the optimal values for the models parameters, we can train the model and evaluate it on the testing set. The results, as depicted in the plot above, indicate smooth capture of the trend and minimum error rate.
model = build_model(_alpha=1.0, _l1_ratio=0.3)
rmse = np.sqrt(-cross_val_score(model, X_test, y_test, cv=tscv, scoring='neg_mean_squared_error'))
R2 = cross_val_score(model, X_test, y_test, cv=tscv, scoring='r2')
print(f"RMSE: {rmse.mean()} (+/- {rmse.std()}")
print(f"\nR2: {R2.mean()} (+/- {R2.std()}")
Trianing Dates for BlockingTimeSeriesSplit
btss = BlockingTimeSeriesSplit(n_splits=12)
for tr_idx, val_idx in btss.split(X, y):
X_tr, X_vl = X.iloc[tr_idx], X.iloc[val_idx]
print('Train:')
print(X_tr.info())
print()
print('Test:')
print(X_vl.info())
print()
Trianing Dates for TimeSeriesSplit
tss = TimeSeriesSplit(n_splits=23)
for tr_idx, val_idx in tss.split(X, y):
X_tr, X_vl = X.iloc[tr_idx], X.iloc[val_idx]
print('Train:')
print(X_tr.info())
print()
print('Test:')
print(X_vl.info())
print()