A way to evaluate context windows of LLMs


I came across this competition in Kaggle that encourages users to stress-test the context window of the a newly released LLM model has a context of 1M tokens. One idea that came to me was to take long passages of text in different languages that are easy to detect, mix them up and place random alphanumeric characters in between the text. I would then ask the LLM to detect these alphanumeric words that are out of context. I call this ‘the document prankster problem’.

For a human, such a task would be easy whether one knows these languages or not. LLMs tend to compress data when the context window exceeds the limits it can handle. This test was designed to see if LLMs can really maintain this window without losing context. My recent tests indicate mixed results and show that LLMs still have difficulty completing such tasks which are simple to a human. The slides summarising the results can be found here. To get access to the Kaggle notebook, request in the comments below. Overall, it was a great experience learning to work with APIs and coming up with creative ways of stress testing an LLM.


Leave a Reply

Your email address will not be published. Required fields are marked *