Playing chess against the Transformer

The Transformer playing chess

We trained a Transformer autoregressive language model to learn to play chess; read on to find out how the experiment played out. Our goal was to provide insights into the type of learning Transformers are capable of, beyond the well-known text generation examples we’ve seen before.

The transformer architecture

Even if you don’t follow the field of Natural Language Processing, you’ve probably heard of the Transformer by now: a neural network architecture that relies on an attention mechanism to understand sequential data. Although not necessarily restricted to text generation, the Transformer has caused a disruption in the NLP landscape. In recent years, various neural network architectures, all based on the initial Transformer architecture, have achieved unprecedented results on various NLP tasks. In one of its most ambitious incarnations, the Transformer has surfaced to the mainstream media as OpenAI’s GPT line. Given enough training data, and a sufficiently large model, GPT’s are capable of writing texts that are (on the surface) difficult to distinguish from those written by humans.

At InterTextueel, we regularly train Transformer models (either from scratch, or by fine-tuning existing language models) to perform a host of natural language related tasks. These tasks include topic modeling, emotion recognition, text clustering and text classification. It is a great asset to develop a personal intuition for the fundamental capabilities and limitations of various algorithms. For many straightforward projects, it could be sufficient to build something on top of OpenAI’s toolset. However, the process of building models from scratch almost always yields new insights in either the datasets we work with, the deeper nature of a problem or the peculiarities of the available architectures and models.

Parameters and training

You can safely skip this paragraph, unless you’re interested in the technical details of the training process.

We trained a GPT2-large model with 24 layers and 289 million trainable parameters on a dataset of 400.000 historical chess master games. At 644 tokens, the BPE vocabulary size is – obviously – significantly smaller than the dictionaries created to train on natural language.

We opted to train the GPT with a maximum sequence length of 256 tokens. A larger sequence length would have required more training time. By setting our sequence length to 256, we were able to fully model a game up till around 200 moves. The vast majority of the games in our dataset were resolved in fewer than that. (See chess records).

We trained for a single epoch with an effective batch size of 64. Training took 15 hours in total. Certainly, a more powerful model could be trained with either more training data, more parameters or both.

The model will be uploaded to Huggingface soon, enabling others to experiment with it, and potentially improve it.

Exploring the games

We started evaluating our language model by prompting it with several opening moves to explore its full output. The model generated a game with a classic Ruy Lopez opening, a type of opening that is abundant in the training data. The output of the language model looks quite natural. Nevertheless, on the 20th play of the game, black played a first illegal move 20 … Nc7 which would position the knight into a square already occupied by the black queen. In terms of game play, the most obvious correct continuation would have been axb4, resulting in black taking a pawn. Interestingly, the language model makes its first mistake at a point in the game where computer (Stockfish) evaluation of the position suddenly spikes.

This is the raw data from the language model to give you a sense of what the model was thinking.

Input sequence:

e4 e5 Nf3

Output sequence:

Nc6 Bb5 a6 Ba4 Nf6 O-O Be7 Re1 b5 Bb3 d6 c3 O-O h3 Na5 Bc2 c5 d4 Qc7 Nbd2 cxd4 cxd4 Nc6 d5 Nb4 Bb1 Bd7 Nf1 Rfc8 Ng3 g6 Bd3 a5 a3 Na6 b4 Nc7 Bd2 Qb7 Rc1 Na6 Bb1 Nb8 Bd3 Bd8 Rc2 Na6 Ne2 Nc7 Ng3 Nb5 Bb1 Na7 Bd3 Nc4 Bxc4 dxc4 d6 Qxd6 Qxd6 Qxd6 … <repetition>

Comparing our language model to GPT-3

OpenAI’s GPT-3 is also capable of playing chess, without fine-tuning. Inside its vast training corpus, there are plenty of chess games available. When prompted with 1. e4 e5 2. Nf3 GPT-3 also played the Ruy Lopez, albeit with a different variation. It faltered on move 15, which is substantially earlier in the game. Comparing both games is akin to comparing apples and oranges, as GPT-3 wasn’t specifically trained to play chess. On the other hand, it is a generic purpose language model with a whopping 60,553% more trainable parameters. Our small-scale project at the very least suggests it is feasible to train a task-specific language model without having to resort to astronomically sized pre-trained models.

Here’s the GPT-3 game up til first the illegal move:

Playing versus a human

We were excited to play against our Transformer chess engine ourselves. In the game below, I play an unconventional series of moves against the Transformer’s Sicilian defense, in order to probe its actual chess knowledge. We expected the model would struggle more easily with our unconventional opening, and to some extent, this turned out to be accurate. Towards the end of the opening, at move 15, the Transformer suggested a blunder that led to producing illegal moves. Because a language model is nothing more than a means to produce a probability distribution for the next token given a series of proceeding tokens, we were able to simply skip the illegal move and look at the next most likely output to continue the game.

Surprisingly, with the exception of move 15, all moves in the above game were legal and also our language model’s top suggestions. It appears the Transformer model has a basic grasp of the game of chess. It didn’t solely follow well-established patterns, although it did overlook several tactical opportunities. Crucially, it tended to ignore the rules of the game when it reached a losing position. At the final position, after our move: O-O (short castling), it even completes the sequence by suggesting a different move for us, namely O-O-O – transforming our short castle into a long castle by adding a subtoken! While not illegal from the perspective of a language model, it’s certainly creative. The Transformer’s subsequent proposed move was the 1-0 token, indicating surrender.


Comparing chess to a natural language, we could consider the rules of chess itself as the syntax, and the purpose behind the moves as the semantics of the language. In the many text fragments of GPT-2 sized models we’ve studied, it’s remarkable how often the syntax remains correct, with regards to using proper punctuation etc., even when the semantics start to become incoherent. How did our model perform in this area? A coherent semantic pattern in chess would be steadily developing positional advantages, identifying weaknesses in the opponent’s position, and defending against attack, for example. Our Transformer chess model seems to have a basic positional understanding of chess. It builds up towards a useful position, but it doesn’t seem to properly detect threats or take advantage of opportunities. Just like similarly sized language models for natural language, our language model produces meaningful looking chess moves without fully grasping the ‘subject matter’ of the game it’s playing.

Towards a real chess engine

So how can we turn this Transformer into a model capable of playing chess at a reasonably competitive level? The bar is high for chess, given that it’s AI’s most studied game by far. By dramatically increasing our model size and the size of our dataset, we’d likely achieve perfect syntactical understanding. Our language model should obey the rules of the game and produce legal moves without fail. It should also yield more convincing chess play in general. Thus by sizing up our model we can expect the Transformer to produce more convincing gameplay overall. To further improve the quality of play we should consider adjusting the learning to predict the next token only for the winning side. In our modest project, the language model trained for both sides (including draws).

An alternate approach could be to fine-tune a Roberta model for classification of sequences (i.e., entire chess games). Our Roberta classifier would classify games as belonging to one of three classes: black wins, draw, or white wins. The class probabilities could be transformed into an evaluation function, which could then be retrofitted into an existing chess engine – provided the state is no longer represented as the board position itself, but the sequence leading up to it. This should allow use to perform a minimax tree search as is common in regular chess engines. However, the inference speed of the Transformer lies far below that of an evaluation function written in native machine code, such as the one used in Stockfish. Thus, in order to be competitive, our Roberta evaluations would need to become exceptionally proficient. Of course, Transformer models have shown the potential of being just that.

Transformers and other games

Natural Language Processing is not normally associated with game AI. Despite this, our experiment showcases that NLP can still be utilized for playing games, rather than designing them (which is a different subject).

We know of very few NLP game-playing projects, but one noteworthy exploration of games and Natural Language Processing is found with Microsoft’s TextWorld. TextWorld is a platform in which researchers create autonomous agents that can explore traditional text-based adventure games and solve the puzzles embedded in them. Although it predates the Transformer era, text-based games remain a fascinating domain for those interested in agency and Natural Language Processing.

It was a pleasure training the Transformer to play chess, and we hope you enjoyed reading about our experience just as much.