11-5 or 11-7 program

y5432 · April 17, 2023, 11:04am

The end of 11-5 is as follows, and the ‘predictions.h5’ file contains only ‘test’ data.

for lookahead in [1, 5, 10, 21]:
if lookahead > 1:
continue
print(f’\nLookahead: {lookahead:02}‘)
data = (pd.read_hdf(‘data.h5’, ‘stooq/japan/equities’))
labels = sorted(data.filter(like=‘fwd’).columns)
features = data.columns.difference(labels).tolist()
label = f’fwd_ret_{lookahead:02}’
data = data.loc[:, features + [label]].dropna()
・
・
・
by_day = test_predictions.groupby(level=‘date’)
for position in range(10):
if position == 0:
ic_by_day = by_day.apply(lambda x: spearmanr(x.y_test, x[position])[0]).to_frame()
else:
ic_by_day[position] = by_day.apply(lambda x: spearmanr(x.y_test, x[position])[0])

test_predictions.to_hdf(store, f’test/{lookahead:02}')

Also, the ML Predictions of 11-7 are as follows, and the ‘predictions.h5’ file contains ‘train’ and ‘test’ data.

def load_predictions(bundle):
t = 1
df = pd.concat([pd.read_hdf(results_path / ‘predictions.h5’, ‘train/{:02}’.format(t)),
pd.read_hdf(results_path / ‘predictions.h5’, ‘test/{:02}’.format(t))])
df = df[~df.index.duplicated()].drop(‘y_test’, axis=1)
predictions = df.iloc[:, :5].mean(1).to_frame(‘predictions’)
・
・
・
return (predictions
.unstack(‘ticker’)
.rename(columns=ticker_map)
.predictions
.tz_localize(‘UTC’)), assets

Which one should I consider for the ‘predictions.h5’ file?
If the program is missing or wrong, could you please update it with the correct one?

anthonberg · May 5, 2023, 3:42pm

Maybe @Stefan can point us in the right direction here? I got stuck with the same issue… Thanks!

Stefan · May 5, 2023, 7:34pm

@y5432 @anthonberg notebook 5 ch11 creates predictions.h5; you’re right the code only saves the train data and 11-7 then asks for train as well.

You can store the train data in 11-5 as well if you like and then use those in 11-7 to run the backtest over both train and test periods. It’s not difficult to adapt the code accordingly, hope this helps.

y5432 · May 6, 2023, 11:01pm

@Stefan
Thank you for your reply.
I also referenced the following.

Does it mean using the train data created in 11-5?

The train predictions are created during cross validation and stored in the last line of cell currently labeled 43 with pd.concat(predictions).to_hdf(cv_store, ‘predictions/’ + key).

Let’s rewrite 11-7 below.

df = pd.concat([pd.read_hdf(results_path / ‘predictions.h5’, ‘train/{:02}’.format(t)),

By the way, I bought the book.