I’m a beginner with RL and wanted to ask a silly question about the following aspect of the Trading Agent in the q learning notebook:
minibatch = map(np.array, zip(*sample(self.experience, self.batch_size)))
This minibatch being a random sample, shall produce sequences where the done array may have flipping entries like:
[0, 1, 1, 0, 1, …]
Since the done flag is used to determine the rewards
td_target = rewards + done * self.gamma * target_q
I’m trying to understand if this will affectlearning adversely, since the underlying problem has a sequential nature?