Ingest fundamental data in Zipline

octads · December 3, 2024, 5:30am

Hello! Someone know a way to import fundamental data in Zipline using an API or just csv files? Thanks!!

prince · December 8, 2024, 3:20pm

I guess here is one of the ways to ingest fundamental data from .csv file.

Download the data you want to ingest (as many stocks as you want). Or you can use already existing dataset, then just skip the step. I use yfinance library for it.

def download_data(tickers: List[str], start_date: str) -> pd.DataFrame:
    stocks = yf.Tickers(tickers)
    data = stocks.history(interval="1d", start=start_date)
    return data

def format_data(data: pd.DataFrame) -> pd.DataFrame:
    data.columns = ["_".join(col).strip() for col in data.columns.values]
    data = data.reset_index()

    tidy_df = pd.melt(
        data,
        id_vars=["Date"],
        var_name="Attribute_Ticker",
        value_name="Value"
    )
    tidy_df[["Attribute", "Ticker"]] = tidy_df["Attribute_Ticker"].str.split("_", expand=True)
    tidy_df.drop(columns=["Attribute_Ticker"], inplace=True)

    formatted_data = tidy_df.pivot(index=["Date", "Ticker"], columns="Attribute", values="Value")
    formatted_data.reset_index(inplace=True)
    formatted_data.columns = formatted_data.columns.str.lower()
    formatted_data.sort_values(by=['ticker', 'date'], inplace=True)

    return formatted_data

The first function downloads the data and the second formats it to the appropriate format OHCLV. Then set environmental variable ZIPLINE_ROOT to your project folder. It is not required, but it is convenient to keep all the files within your project’s folder.

def set_zipline_root():
    os.environ['ZIPLINE_ROOT'] = os.path.expanduser(f'{os.getcwd()}\\.zipline')
    # Verify it is set
    print(os.getenv('ZIPLINE_ROOT'))

And after that you ingest the data. The main idea is, as I understood it, you save separate .csv file for each stock in a folder, for instance, AAPL.csv, JNJ.csv, etc. And then you specify that folder with all the files in the register method. That is why you can use any dataset and follow this logic. There are probably other better ways to do it, but I thought, considering no one replied yet, it could be a start for you.

def ingest_custom_bundle(stocks: List[str], start_date: str) -> None:
    print("Start loading the data")
    data = download_data(stocks, start_date=start_date)
    print("loaded the data")
    data = format_data(data=data)

    data.set_index('date', inplace=True)
    output_dir = 'daily_data/daily'
    os.makedirs(output_dir, exist_ok=True)

    for ticker, group in data.groupby('ticker'):
        processed_data = group[['open', 'high', 'low', 'close', 'volume']]
        
        processed_data.to_csv(os.path.join(output_dir, f"{ticker}.csv"))

    register(
        'custom_bundle',  # Name of the bundle
        csvdir_equities(
            ['daily'], 
            'daily_data',  # Path to your preprocessed data
        ),
        calendar_name='NYSE', 
        
    )
    ingest('custom_bundle')

And then run it, for example, like that:

stocks = ["tsla", 'aapl', 'nvda', 'amzn', 'jnj']
start_date = "2018-01-01"
set_zipline_root()
ingest_custom_bundle(stocks, start_date)

Chris · December 17, 2024, 5:34pm

In my opinion, there are 2-3 ways to get fundamental data into your backtest (ordered by level of complexity…)

Literally just have a direct call to your DB within the rebalance method or whenever you need this. This is the easiest and fastest, but maybe not as clean and prone to errors. But it’s also very easy to debug, other than the other methods… You can’t use all the factor logic that zipline does on a pipeline output, but hey, you can easily built this in yourself I guess…
How zipline was designed to include this is to create a new DataSet, as well as a a custom PipelineLoader that moves the data from your DB to the DataSet Class defined. This is superbly complicated (personal opinion after having spent quite some time trying to do this ;)) but you will end up making calls to the data from within a pipeline as it was meant to be…
Even more time consuming, but you could also change the underlying database within the bundle and extend the sql schema to hold fundamentals data. This is probably the cleanest (for instance, I have some issues cause my fundamentals DB constantly updates, whereas the bundle data does not. so if there is a ticker change you will have problems matching securities, unless you also always update the bundle…)

@prince: I believe your code is just a normal ingest of pricing data no?

Hope this helps a little…