Creating an AI strategy: 5. Assessing quality of models

Once the model is trained, you have some data to analyze. Here are 3 things you’ll have to work with.

Analyze “Confidence vs Win%” chart.
Explore input parameters' importance.
Backtest your model on the data it has never seen before.

In general, first hing you want to do is to discard bad models quickly. Models which are not promising can either be discarded or transformed (i.e, by tweaking the inputs). Models which are promising should be explored further. Ultimately, all the good models you find should be backtested on some previously unseen data. Once you get a good backtest, you’re ready to go with forward testing.

Confidence vs Win% chart

During the training process, the AI model is being trained on a portion of your data, but the “quality of its predictions” is being assessed on a different data set (see “1. Define the learning data set”). Checking the model quality on a test data set is not as biased as analyzing its results on the training data set. That’s not completely unbiased either, but still better (see “Backtest your models” below). Analyzing your model results on the test data set is a good first step.

We display all the data about how the model performs on the test data set as one chart. This is the Confidence vs Win% chart.

Any time a model generates a signal, it typically has a sense for “how confident a model is about the fact that this signal must exist” (opposed to “no signal here”). For example, for a Random Forest, if 40% of trees vote a signal, then the confidence level would be 40%. While 40% is not looking good (in the end, 60% of the ensemble has voted “no signal”!), please remember that fixed RR signals are all about Win%, not about beign right all the time.

Confidence is a crucial part of the equation. Naturally, the higher confidence you demand for your signals, the less of them you will receive. At a confidence level of 0% you would take any prediction of a model for “there’s a signal”. In the vast majority of cases that won’t work well. At a confidence level of 100% you might have no signals at all or 2 signals a year. It depends.

It’s very important not to misinterpret “being confident” for “being right”. Sometimes models will be very much positive about legitimately bad decisions.

Remember that you’re assessing your model quality on a particular data set (the “test data” set). Once training a model is done, we run it a few times on this test data set, and every time we demand higher confidence signals. We start from 0% confidence, then 8% confidence, then 16% and so on — all the way up to 100%. At every level of confidence, we record 2 things: the amount of signals we’ve got, and what portion of these signals were winning signals. That’s how we accumulate data about “win% at each level of confidence”. We use data directly to paint a chart.

Here’s one example of a chart like that. The point with a white circle around it literally tells you that if you accept 50%+ for signal confidence, then this model would give you 21 winning signals on the test data set, with the win rate of 53%. So you’d get 40 signals, 21 of which would result in gains (according to your combination of TP/SL).

You can see that there are colored areas on this chart, and the data points are colored as well. Colors go as follows:

Red: no go. Dots in this area represent combinations which are mathematically not viable. For example, on this chart above, the model has Rew/R of 3.0, which means that the minimum viable Win% for it is around 25%. Any smaller win% will lead to decline for your portfolio. So the area of this chart to the left from 25% is red.
Orange: meh. Orange area contains data points which have win% above the minimally required one, but the leeway of win% between these data points and the minimal required win% is not big. That’s somewhat subjective, so we’ve had to go with our experience and define the minimum required leeway of win% as “at least 5% above the min win% for a given RR”. So the area containing all the dots which are not a mathematical no go, but still not very far from being a no go, is colored orange.
Green: good. This area contains data points which illustrate win% greater than “min viable win% for the RR plus 5%”. These are typically the data points you want to work with.

Confidence vs Win% charts can take a lot of shapes. Here are a few examples, explained.

Chart	Explanation
	You can see that the higher the confidence, the worse win% the signals have. That’s a disturbing sign. From this very chart you can tell that the things are not looking awesome. This model is likely to be consistently bad. Only proceed and backtest a model like that if you have no better options at all.
	That’s how a healthy man model looks like. The chain of dots goes “lower left to upper right”, which means that as the level of confidence increases, win% of your signals increases as well. The overall amount of good signals (green dots) it gave at a test data set is not huge, but not tiny either. Definitely backtest it further. Also, consider crossbreeding it with good models of the same kind.
	You can see that there’s no conclusive shape to this chain of data points. They start from going to the top-right but then ultimately the dots are all over the place. This means that this model might be onto something, but not quite there yet. Analyze its features and either apply manual feature engineering, or crossbreed it with other models.
	This model is consistent at being bad. At the same time, it’s akin to a broken clock, as it happens to be right twice a day. That’s a no go.

You might have noticed that there’s more to this chart than data points only. There are a few additional items here, which help you to assess the quality of your model.

2 solid gray vertical lines illustrate how conclusive the model is. These illustrate “Win% at confidence of 50%” in 2 special cases. Once we’re done training your model, we train it again, each time train/test data sets being distributed as 80%/20%. However, in one case, the test data set is “first 20% of the data” and in another case it’s “last 20% of the data”. The closer these lines are to each other, the more conclusive the model is. The further they are apart, the more likely this model is to have captured noise instead of the “real pattern”.
Dotted area illustrates random signals win%. Tossing a coin is always an alternative to following a complex strategy! Markets have different directions at different times, and sometimes your model having a lot of green dots does not mean that the model is good, but rather means that the market was consistently going up, in which case even tossing a coin would have made you money. Any time we train a model, we also emulate a bunch of randomly placed signals (which still have TP/SL/horizon identical to yours) and measure their Win%. The dotted area illustrates how good or bad the random signals performed on a given set of test data. If your model has the majority of its data points within the dotted area, then the odds are that the chart is looking nice because of the overall direction of a market, and not necessarily because of the model doing a good job capturing “real patterns”.
The pale red line is a regression line. It only appears on charts when the overall data points chain goes in the wrong direction (that is, “higher confidence yields worse win rate”). This is not necessarily a no go, as lower confidence levels still might yield sufficient Win%. But it’s a reminder that things are not quite right with this model.

Here are a few examples of charts with these elements added.

Chart	Explanation
	This model has quite some green dots, right? However, all of them yield Win% within the ballpark of what a dice roll could give you. That’s a bad sign and this model is probably not capturing any signal. Overall chain of dots is going in the wrong direction (higher confidence gives worse signals). Gray vertical lines are like 10% apart, which is a lot. This model is a mess.
	Now that’s a model! All of the data points are well outside of the random control area. Gray vertical lines are only 3% apart. A lot of data points are in the orange area, which is not awesome, but the overall shape of the chain of dots is right. This model is definitely worth backtesting at Confidence above 60% or so. It’s worth crossbreeding with some good peers, too.
	While this model does not even have the green area, note that both vertical gray lines are in the orange area. They are at the upper end of the randomized signal win% area. This model is not overly attractive, but it might be capturing a pattern which is consistently slightly better than random. This model is worth some feature engineering. The odds are you can make it better.
	While this is green and worth backtesting, the overall direction of the data points chain is not right. Higher confidence leads to worse signals. You still can backtest that, but there’s a pale red regression line painted on this chart, only to draw your attention to the fact that something is not right here.

Input parameters’ importance

Models do not pay equal attention to all of your inputs when making decisions. Some inputs always have a bigger impact than others. Once your model is trained, you can see the input importance data in your interface.

In this example above, the model turned out to rely on RSI14 the most. ROC12 was material for its decision making as well, but not as much. MOM10 was even less important, and this fancy combo of RSI and MOM was just useless, which means that if you remove it from the list of inputs, then nothing will change for this model.

Leveraging knowledge about input importance can go a long way. You can use it in a number of ways, and we also use it in our model crossbreeding algo. Here are the typical use cases:

Is your model any good?	yes	You might be onto something! Create new models where you discard the least important inputs and add more variations of the good ones. Tweak indicator parameters, apply some feature engineering and see what happens. Crossbreed with other good models.
Is your model any good?	no	If you are not to abandon this model completely, then create a new one where you remove all the most “important” inputs. Because that’s mainly where your model derives its poor signals from. Replace them with something else, or simply wipe.

Remember that “importance” is not a concept which applies to all the models. If the feature A has importance of 100% for a given model, then it does not mean that this feature will be so very much important for another model. Be very careful when comparing feature importance between models. Here are a few examples:

Feature importance has no relation to quality of signals. The meaning of an important feature for your trading can take the opposite directions, depending on whether the model gives you good signals or bad. See the “typical use cases” table above for an example.
Feature importance can’t be compared between different model types. Feature A being important for a KNN can not tell anything about RF.
Across different configurations of the same model type (i.e., 2 KNNs one with N=5 and the other with N=10), feature importance can be compared with great caution, and only “in a broad sense”. If you can see a bunch of different KNNs paying attention to RSI14 and providing signals which are not bad, then you might assume that you want to do more of RSI in a ballpark of 14 for KNNs at a given market.

Backtest your model on unseen data

During the machine learning process, we separate your data set into 2 pieces.

Training data set. That’s what the model is trying to curve-fit itself to.
Test data set. That’s where we assess the quality of models and collect all these “Confidence vs Win%” data points.

In the world of machine learning, they call it hold-out validation. The general idea is that you can’t trust metrics which your model generates on its training data set, because the purpose of training a model is to make it predict signals we want as good as it can. That’s very close to what they call curve fitting in trading. Model results computed at the training data set are biased.

Trusting metrics of a model received from the test data set is better, because the model did not use this data to learn. However, if you’re using the “automated model optimization” option (and you should!) then it means that despite not being used for training, the test data set still creates biased results. That’s because this test data set is an arbiter in the overall process of discovering the best model, and ultimately “results on the test data set” is what defines the fittest model.

That’s why you should backtest your AI models using the Strategy Tester. While doing so, you should make sure that you backtest it using data which was not a part for neither training nor test data set. For example, if you select “June 2020 to June 2024” when training a model, then you can backtest it either on data prior to June 2020 or on data after June 2024. Or on a completely different symbol, in which case it might be alright to overlap with “June 2020 to June 2024”.

The first step in backtesting your AI model is to deploy your model and go run the strategy which had been automatically created during the deployment. This strategy already has all the parameters pre-set to mirror what the strategy had been trained for. However, that’s not all you can do. There are other options too; remember that basically you can use an AI model anywhere you can use an SMA. Here are a few interesting examples:

Backtest “enter when signal emerges and exit at SL/TP”. That’s the default option.
Design your model with tight SL/TP and then backtest “enter when there’s a signal and exit once signal disappears”. Add a stop loss or a trailing stop to control your DD.
Backtest a strategy which uses AI model signals from one symbol to trade on another symbol.
Accompany your AI model signals with algorithmic conditions, if you want to weed out bad entries of a certain kind and you can see clearly how to do so. If that's doable, then consider transforming these algo conditions into the AI model inputs, too — maybe that will give you even better results.

Creating an AI strategy: 5. Assessing quality of models

Confidence vs Win% chart

Input parameters’ importance

Backtest your model on unseen data

Next Article in Topic

Previous Article in Topic

Related Articles

Was this article helpful?

Contact Us