How to Handle Missing Data in Alternative Data Sources for Financial Models?

Hello

When working with alternative data sources like web scraping, satellite imagery / social media sentiment, missing or incomplete data is a common issue. For instance, datasets may lack records for certain timeframes, geographical regions, or market segments due to collection limitations or API restrictions. These gaps can disrupt model performance & introduce biases if not handled properly. :innocent:

How do you approach the problem of missing data in your trading models? Are there specific imputation techniques / data augmentation strategies you recommend? :thinking:

Additionally, how do you determine whether missing data is a result of systematic bias (e.g., underrepresentation of certain industries) or random occurrence? :thinking: I have checked Data - Machine Learning for Trading Java guide for reference .

Looking forward to insights from the community on effective tools, algorithms and frameworks to address this challenge.

Thank you ! :slightly_smiling_face:

There is a very interesting paper by Markus Pelger et al that was published about 2 years ago. This specifically looks at fundamental data for stocks, but it’s well worth a closer study:

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4106794

So it really depends if there are any patterns to the “missingness” that you can model…