Inconsistency in chapter about 101 formulaic alphas; data issues

HarmonicReflux · October 27, 2023, 12:36am

@Stefan @community
I wish to revive the back-test of the 101 formulaic alphas.

The original publication from
https://arxiv.org/pdf/1601.00991.pdf defines alpha 001 to be:
Alpha#1: (rank(Ts_ArgMax(SignedPower(((returns < 0) ? stddev(returns, 20) : close), 2.), 5)) -0.5)

The Python implementation of the book reads:

def alpha001(c, r):
“”“(rank(ts_argmax(power(((returns < 0)
? ts_std(returns, 20)
: close), 2.), 5)) -0.5)”“”
c[r < 0] = ts_std(r, 20)
return (rank(ts_argmax(power(c, 2), 5)).mul(-.5)
.stack().swaplevel())

To my understanding .mul(-.5) should be substituted by .sub(-.5).

The data used to test the alphas in 03_101_formulaic_alphas.ipynb is as follows:
data = (pd.read_hdf(‘data.h5’, ‘data/top500’)
.loc[:, ohlcv + [‘ret_01’, ‘sector’, ‘ret_fwd’]]
.rename(columns={‘ret_01’: ‘returns’})
.sort_index())

Unfortunately, I cannot create the data set. I tried to reconstruct the shape of the data frame from the line
data.info(null_counts=True)
and think it must be a multiindex frame looking something like this:

Can any one whose data ingestion still works, please confirm.
All the unstacking, swapping etc. in the code is confusing, and as we know, it is rubbish in, rubbish out, so can anyone please help.

The back testing results for my data are not great, hence I’d very much appreciate you help; either the alphas do not work or, more, likely, I am doing something wrong. It would be so great if anyone could provide the original data, so tracking errors while reproducing the results would be much easier.

Many thanks.