Error in Chapter 2 Code

yjc95 · May 31, 2023, 4:53am

Hi all.

It is a basic problem in the first jupyter notebook file for you. However, I struggled for two days.

I tried to run …\02_market_and_fundamental_data\01_NASDAQ_TotalView-ITCH_Order_Book\01_parse_itch_order_flow_messages.ipynb, and it returns errors.

In cell #31 (I make the error in bold style):
Start of Messages
03:02:31.65 0

Start of System Hours
04:00:00.00 241,258

Start of Market Hours
09:30:00.00 9,559,279
09:44:09.23 25,000,000 00:00:52.34
Cannot serialize the column [primary_market_maker]
because its data contents are not [string] but [integer] object dtype
L
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 214749 entries, 0 to 214748
Data columns (total 7 columns):
Column Non-Null Count Dtype

0 stock_locate 214749 non-null int64
1 tracking_number 214749 non-null int64
2 timestamp 214749 non-null timedelta64[ns]
3 mpid 214749 non-null object
4 primary_market_maker 214749 non-null object
5 market_maker_mode 214749 non-null object
6 market_participant_state 214749 non-null object
…
D 9044692
A 10094291
dtype: int64
Duration: 00:00:54.44

In Cell #34:
KeyError: ‘No object named P in the file’

I tried to run this notebook on Windows 11 with Python 3.11.2 and Ubuntu 22.04.2 LTS with Python 3.10. It did not work on both OS.

Could anyone help me out?

Thank you.

yjc95 · May 31, 2023, 5:52am

In Chapter 3 example. “META” should replace “FB”.

auphucdup · September 12, 2023, 11:05am

I haven’t yet solved the issue, but just putting it out there that I’m also experiencing the same error message when running through the notebook.

"KeyError: 'No object named P in the file'"

The output from the script that leverages the store_messages() function does show that it’s parsing P messages however:

 Start of Messages
	03:02:31.65	           0

 Start of System Hours
	04:00:00.00	     241,258

 Start of Market Hours
	09:30:00.00	   9,559,279
	09:44:09.23	  25,000,000	00:00:43.43
S
R
H
Y
L
Cannot serialize the column [primary_market_maker]
because its data contents are not [string] but [integer] object dtype
L
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214749 entries, 0 to 214748
Data columns (total 7 columns):
 #   Column                    Non-Null Count   Dtype          
---  ------                    --------------   -----          
 0   stock_locate              214749 non-null  int64          
 1   tracking_number           214749 non-null  int64          
 2   timestamp                 214749 non-null  timedelta64[ns]
 3   mpid                      214749 non-null  object         
 4   primary_market_maker      214749 non-null  object         
 5   market_maker_mode         214749 non-null  object         
 6   market_participant_state  214749 non-null  object         
dtypes: int64(2), object(4), timedelta64[ns](1)
memory usage: 11.5+ MB
None
S    1
U    1
Q    1
C    1
I    1
V    1
P    1
E    1
X    1
R    1
F    1
D    1
A    1
L    1
Y    1
H    1
J    1
Name: count, dtype: int64
J           1
V           1
S           3
H        8885
Q        8887
R        8887
Y        8926
C        9176
P      108412
L      214749
E      364951
F      836655
I     1072326
X     1086393
U     2132765
D     9044692
A    10094291
dtype: int64
Duration: 00:00:45.49

For whatever reason it’s not being appended to the hd5 file - perhaps a consequence of this error in the output:

Cannot serialize the column [primary_market_maker]
because its data contents are not [string] but [integer] object dtype

Jasper · November 15, 2023, 5:15pm

Have anyone got the fix for this issue?

johncban · December 12, 2023, 4:56am

Hi, I also have the same problem but different case under anaconda (Python 3):

Cannot serialize the column [primary_market_maker]
because its data contents are not [string] but [integer] object dtype
L
<class 'pandas.core.frame.DataFrame'>

Here is the full output

Start of Messages
	03:02:31.65	  25,000,000

 Start of System Hours
	04:00:00.00	  25,241,258

 Start of Market Hours
	09:30:00.00	  34,559,279
	09:44:09.23	  50,000,000	00:01:13.99
Cannot serialize the column [primary_market_maker]
because its data contents are not [string] but [integer] object dtype
L
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 429498 entries, 0 to 429497
Data columns (total 7 columns):
 #   Column                    Non-Null Count   Dtype          
---  ------                    --------------   -----          
 0   stock_locate              429498 non-null  int64          
 1   tracking_number           429498 non-null  int64          
 2   timestamp                 429498 non-null  timedelta64[ns]
 3   mpid                      429498 non-null  object         
 4   primary_market_maker      429498 non-null  object         
 5   market_maker_mode         429498 non-null  object         
 6   market_participant_state  429498 non-null  object         
dtypes: int64(2), object(4), timedelta64[ns](1)
memory usage: 22.9+ MB
None
S    1
U    1
Q    1
C    1
I    1
V    1
P    1
E    1
X    1
R    1
F    1
D    1
A    1
L    1
Y    1
H    1
J    1
Name: count, dtype: int64
J           2
V           2
S           6
H       17770
Q       17774
R       17774
Y       17852
C       18352
P      216824
L      429498
E      729902
F     1673310
I     2144652
X     2172786
U     4265530
D    18089384
A    20188582
dtype: int64
Duration: 00:01:20.92

nghnam · December 18, 2023, 11:01am

Because pandas could not infer encoded column’s type correctly, so we must convert explicitly
Update format_alpha() like this

def format_alpha(mtype, data):
    """Process byte strings of type alpha"""

    for col in alpha_formats.get(mtype).keys():
        if mtype != 'R' and col == 'stock':
            data = data.drop(col, axis=1)
            continue
        data.loc[:, col] = data.loc[:, col].str.decode("utf-8").str.strip()
        if encoding.get(col):
            data.loc[:, col] = data.loc[:, col].map(encoding.get(col))
            data[col] = data[col].astype(int)  # convert to int
    return data

JB31 · January 16, 2024, 2:05pm

I found a solution


            try:
                if 'primary_market_maker' in data.columns:
                    data['primary_market_maker'] = data['primary_market_maker'].astype('category')
                if 'buy_sell_indicator' in data.columns:
                    data['buy_sell_indicator'] = data['buy_sell_indicator'].astype('category')
                    
                store.append(mtype,
                         data,
                         format='t',
                         min_itemsize=s,
                         data_columns=dc)

Now, it is ok.

JB31 · January 19, 2024, 11:14am

A better way
numpy is required

            try:
                for col_name in data.columns:
                    if pd.api.types.is_object_dtype(data[col_name].dtype):
                        try:
                            data.get(col_name).str
                        except AttributeError:
                            data[col_name] = pd.Series(data[col_name], dtype=np.int8)

oscarllerena · February 9, 2024, 11:54pm

The adding of the code following code-line before return data, did the trick:

data[col] = data[col].astype(int) # convert to int

Thanks @nghnam

oscarllerena · February 10, 2024, 1:16am

Also, after using @nghnam’s contribution, I got another error in the last code-lines related to “Top Equities by Traded Value” section:

AttributeError: 'DataFrame' object has no attribute 'append'

It is highly likely that the error is due to using a newer pandas version in which the method append has changed to _append to not be mistaken with append method from list. So, in short, try _append instead of append

Jacob · June 17, 2024, 12:50pm

Yes, I agree with @auphucdup.

YuweiUltra · June 25, 2024, 1:55pm

Here is my solution(add it to store_messages function):

for col_name in data.columns:
if pd.api.types.is_object_dtype(data[col_name].dtype):
data[col_name] = data[col_name].astype(str)

itsjustausername · August 13, 2024, 8:39pm

JB31:

 try:
                for col_name in data.columns:
                    if pd.api.types.is_object_dtype(data[col_name].dtype):
                        try:
                            data.get(col_name).str
                        except AttributeError:
                            data[col_name] = pd.Series(data[col_name], dtype=np.int8)

Tried this:
def format_alpha(mtype, data):
“”“Process byte strings of type alpha”“”

for col in alpha_formats.get(mtype).keys():
    if mtype != 'R' and col == 'stock':
        data = data.drop(col, axis=1)
        continue
    data.loc[:, col] = data.loc[:, col].str.decode("utf-8").str.strip()
    if encoding.get(col):
        data.loc[:, col] = data.loc[:, col].map(encoding.get(col))
        data[col] = data[col].astype(int)  # convert to int
return data

Did not work.

And got this:

Start of Messages
03:02:31.65 25,000,000

Start of System Hours
04:00:00.00 25,241,258

Start of Market Hours
09:30:00.00 34,559,279
09:44:09.23 50,000,000 00:00:49.38
invalid combination of [values_axes] on appending data [name->primary_market_maker,cname->primary_market_maker,dtype->int32,kind->integer,shape->(1, 429498)] vs current table [name->primary_market_maker,cname->primary_market_maker,dtype->int64,kind->integer,shape->None]
L
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 429498 entries, 0 to 429497
Data columns (total 7 columns):

Column Non-Null Count Dtype

0 stock_locate 429498 non-null int64
1 tracking_number 429498 non-null int64
2 timestamp 429498 non-null timedelta64[ns]
3 mpid 429498 non-null object
4 primary_market_maker 429498 non-null int32
5 market_maker_mode 429498 non-null object
6 market_participant_state 429498 non-null object
dtypes: int32(1), int64(2), object(3), timedelta64ns
memory usage: 21.3+ MB
None
S 1
U 1
Q 1
C 1
I 1
V 1
P 1
E 1
X 1
R 1
F 1
D 1
A 1
L 1
Y 1
H 1
J 1
Name: count, dtype: int64
J 2
V 2
S 6
H 17770
Q 17774
R 17774
Y 17852
C 18352
P 216824
L 429498
E 729902
F 1673310
I 2144652
X 2172786
U 4265530
D 18089384
A 20188582
dtype: int64
Duration: 00:00:57.77

Is there a working solution?

Update 08/13/2024
Deleted the .h5 file and restarted the process and it seems to work with the solution I tried first.

Branck · January 15, 2025, 1:19pm

Hello,

I encountered the same issue.
In my opinion, it is a bit disappointing we all get to lose time on this.
Mr Jansen says it is not required to run this to use the data, but it is seems the data is already corrupted in the repo.
Without checking much, it seems the code would try to add to the file regardless if the data is already there, so you may be looking at duplicates.
Also, this is weird that the parsing has bugs and does not work out of the box, I assume it had to work for the author at some point.
To be fair, there is a lot of code, and not many books/author would go to that extent and provide so much.
I guess it also assumes we are all decently proficient with Python.

Anyway, I am planning to refactor a lot of that code to make sense of it. Hopefully i stay commited to it.

Kind regards,

itsjustausername · February 3, 2025, 5:05pm

def format_alpha(mtype, data):
“”“Process byte strings of type alpha”“”

for col in alpha_formats.get(mtype).keys():
    if mtype != 'R' and col == 'stock':
        data = data.drop(col, axis=1)
        continue
    data.loc[:, col] = data.loc[:, col].str.decode("utf-8").str.strip()
    if encoding.get(col):
        data.loc[:, col] = data.loc[:, col].map(encoding.get(col))
        data[col] = data[col].astype(int)  # convert to int
return data

That should work. Well…has worked for me

itsjustausername · February 3, 2025, 6:36pm

with pd.HDFStore(itch_store) as store:
stocks = store[‘R’].loc[:, [‘stock_locate’, ‘stock’]]
# trades = store[‘P’].append(store[‘Q’].rename(columns={‘cross_price’: ‘price’}), sort=False).merge(stocks)
trades = pd.concat([store[‘P’], store[‘Q’].rename(columns={‘cross_price’: ‘price’})], ignore_index=True, sort=False).merge(stocks)

trades[‘value’] = trades.shares.mul(trades.price)
trades[‘value_share’] = trades.value.div(trades.value.sum())

trade_summary = trades.groupby(‘stock’).value_share.sum().sort_values(ascending=False)
trade_summary.iloc[:50].plot.bar(figsize=(14, 6), color=‘darkblue’, title=‘Share of Traded Value’)

plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda y, _: ‘{:.0%}’.format(y)))
sns.despine()
plt.tight_layout()

append has been deprecated…I commented it out and replaced with pd.concat.

YsmaelCastro · June 30, 2025, 12:48pm

You need to replace append by pd.concat

#trades = store['P'].append(store['Q'].rename(columns={'cross_price': 'price'}), sort=False).merge(stocks)
trades = pd.concat([store['P'],store['Q'].rename({"cross_price":"price"})],sort=False).merge(stocks)

This is because, starting with pandas 2.0, the DataFrame.append() method has been removed. It was a common (though not very efficient) way to concatenate DataFrames. The solution is to replace it with pd.concat().