## SPRT testing

One of the best ways to test one engine against another is by using SPRT,
which stands for "Sequential Probability Ratio Test." A program which can
run such matches is *cutechess-cli*, by using the SPRT parameter.

It is beyond the scope of this book to explain all the mathematics behind
these formulas, even *if* I'd understand everything in detail. I'm a
programmer, not a Statistician.

It is sufficient to understand the basics.

You run a match between two engines. This match does not have a set length,
but to make sure it doesn't run 'forever', we normally set a very large
number of games, such as 20.000. To make SPRT work, we need to set two
hypotheses: one we hope that comes true (H1), and one we hope that isn't
true (H0, the NULL hypothesis). We allow for a chance of 5% that the
SPRT-test gives us the wrong result. So, we are 95% confident that the
result is correct. (You can lower the 5% margin, but the test will run *a
lot longer*.)

Let's say we have a new engine, called version NEW. We also have an old version, we call version OLD.

Now we state:

- H1: Engine NEW is at lesat 1 Elo stronger than engine OLD.
- H0: Engine NEW is NOT more than 5 Elo stronger than Engine OLD.
- Error margin: 5%.

When running cutechess_cli, we give it the SPRT parameter:

```
-sprt elo0=1 elo1=5 alpha=0.05 beta=0.05
```

(Alpha an Beta have nothing to do with Alpha/Beta searching. In this case, Alpha is the chance we accept H1 while we shouldn't, and Beta is the chance we accept H0 while we shouldn't; i.e., a chance of 5% we get the wrong result from the test.)

A cutechess_cli command could look like this:

```
cutechess-cli \
-engine conf="Rustic Alpha 3.15.100" \
-engine conf="Rustic Alpha 3.1.112" \
-each \
tc=inf/10+0.1 \
book="/home/marcel/Chess/OpeningBooks/gm1950.bin" \
bookdepth=4 \
-games 2 -rounds 2500 -repeat 2 -maxmoves 200 \
-sprt elo0=0 elo1=10 alpha=0.05 beta=0.05 \
-concurrency 4 \
-ratinginterval 10 \
-pgnout "/home/marcel/Chess/sprt.pgn"
```

With this command, we're testing 3.15.100 (NEW) against 3.1.112 (OLD), where we run 2500 rounds with 2 games each, so each engine plays the same opening with white and black. Time control is 10 seconds + 0.1 increment.

Now we start the match between our NEW and OLD (previous) engine version, and cutechess_cli will start to play games.

Let's say that, after 400 games, Alpha 2 is 100 Elo stronger than Alpha 1.
This could still change, if you play enough games... but that is the point
of SPRT testing. As soon as cutechess_cli is 95% sure that the difference
between the two engines is not going to change anymore, it will abort the
match, which saves *a lot of time*.

At that point, cutechess_cli compares the result in Elo against the stated hypotheses. With a result of +100 Elo for engine NEW, we can see:

- H1: NEW is at least 1 Elo stronger than OLD. True, because 100 > 1.
- H0: NEW is
*NOT*more than 5 Elo stronger than OLD. False, because it IS more than 5 Elo stronger (100 Elo is more than 5 Elo).

So H1 is accepted, and we have determined that NEW is a stronger engine than OLD, and by how much (self-play) Elo.

If NEW scored -20 Elo, then we would have had this result:

- H1: NEW is at least 1 Elo stronger than OLD. False, because -10 < 1.
- H0: NEW is NOT more than 5 Elo stronger than Old. True, because -10 is indeed less than 5 Elo.

So H0 is accepted, which means that our NEW engine is not stronger than our OLD engine; it is actually weaker, and we should not include this feature. (At least, not yet: a feature which causes a strength loss now, could cause a strength gain when added on top of other features, so it's worth it to try again later.)

As long as the difference between NEW and OLD is between 1 and 5 Elo, the SPRT-test keeps running, because both hypotheses are still true: 3 Elo is "at least 1 Elo stronger", but it is also "not more than 5 Elo stronger." If this result doesn't change, cutechess_cli would run forever; that is the reason why we set a match limit of something like 20.000 games.