Arrows of Time for Large Language Models
arxiv.orgIsn't it obvious that since LLM are trained to predict the next word they do better than to predict the previous one?
In the paper it is mentioned that the LLMs predicting the previous token are indeed pre-trained in this way, so it is not true that the difference is obvious.
There is a link with entropy creation?