How to win hattrick matches using data science techniques

8 min readMar 12, 2021

Hattrick is a soccer-based strategy game in which you build and administer your team in the long term, playing against other human users in a competitive environment.

The variables you control when you play hattrick are:
- Every matches’ line-up (list and position of the players, orders, tactics, even your set pieces taker).
- You can buy and sell players in the transfermarket
- You can decide which skill to train to your players
- Decide on your stadium, your staff, your trainer and your youth squad, amongst others

The match engine is based on probabilities which determine which of the two playing teams get the chance to score a goal and then, the comparison between defence and attack takes place to see wheter the goal will be converted o not.

Off course, there is an oversimplification in this explanation: there are many more variables which influence the outcome of a match or a possible goal outcome, just as special events, injuries, red cards, events of lack of experience, etc.

At the end of this article, the following four questions will be answered:

Is there a correlation (linear or non-linear) between a team’s ratings and the possible outcome of the match?
Which variable has the biggest correlation with the outcome?
Could domain-knowledge be used to obtain variables more correlated with the outcome of the match?
Is there a specific tactic more likely to obtain victories? Is there a specific tactic more likely to obtain defeats?

The system

Hattrick’s official matches are played twice a week, for Colombia, the days are sundays & wednesdays.

I would like, first, to write about the league system, important for understanding the rest of Hattrick environment:

Hattrick’s league system is pyramidal, which means a crescent number of team compete for places in upper divisions.

For every division below the first the best team has the opportunity of ascending directly or to play a promotion match against one of the worst four teams in the upper division, in that way, and throught training, hattrick’s team advance in the league system. The best teams are decided after 14 weekly matches in which every team face 7 other teams in a simulated but narrated match.

The odds of a match

The details are far more complex than explained in these short lines, but could be explained with three steps:

The battle for the midfield, in which both teams decide (based on the ratio between both teams’ midfield) who gets the chance to score a goal. Of the eleven skills a player can have, is playmaking the one which matters the most for deciding the level of your team’s midfield.

When one of the two midfields has won the next goal opportunity, the zone for the attack is decided, it could be given to: right, middle, left or could be a free kick or a penalty kick.

The attack from that area is then compared to the opposite’s team defence for the same area. If the attack of the selected side is sufficiently better than the defence, a goal is scored.

The full cycle is repeated several times in a match, resulting on several chances that could or could not be converted into goals, forming, in this way the engine in which the game runs.

Match after match, the same engine runs and ultimately decides league championships, national cup championships and national team’s matches.

The numbers which indicate the strenght of a field (right defence, middle defence, left defence, midfield, right attack, middle attack, left attack) in the match are called ratings. And, as the ratios between opposing ratings are the cornerstone of a hattrick match, we could use those numbers as predictors of a match outcome before is played.

Is important to note, also, that both users who will play the match have access to a preview of the ratings before the match is played, there is symmetry in the information, which makes hattrick also a kind of psychological game.

Using data science

This phenomenon (the match) is the one we are analyzing with the help of Data Science: the probability of a team to win a match given the average ratings obtained after the match, the ratings after the match are a proxy of the ratings before the match.

In terms of Data Science, one could think that there is nothing to predict as we are observing the ratings and odds after the matches are played, but the ratings are available since the time the match is being prepared by the manager, there is an incorporated rating calculator and comparator which allow the users to visualize the ratios between zones before the match is played.

After and during the match the ratings are public, but before it they are not public but approximated and known by his opposing team based on previous matches.

This, as explained before, is a simplification of the underlying phenomenon given the different possibilities which scape the accountability that ratings give: special events, which are events generated and scored by players with special characteristics, tiredness of old players, tactical choices of both teams, substitutions. In terms of data science, these are the events that will most likely scape the predictions.

The dataset

With the help of a web-scrapping algorithm, designed by the author of this article, 6193 matches, their reported ratings, and goals scored by both teams were retrieved into a table, and were uploaded to kaggle.com so other users could run their own analyses.

https://www.kaggle.com/juandelacalle/hattirckorg-matches-dataset

Columns

The web-scrapping algorithm went over 6193 different match IDs, changing the url of the website allowing to retrieve certain information of the matches.

This information corresponds to:

Match ID
Home team’s midfield ratings: scale 0 to 100[0, 100+] or more (values greater than 100 are very difficult to achieve)
Away team’s midfield ratings: [0, 100+]
Home team’s right defence ratings: [0, 100+]
Away team’s right defence ratings: [0, 100+]
Home team’s middle defence ratings: [0, 100+]
Away team’s middle defence ratings: [0, 100+]
Home team’s left defence ratings: [0, 100+]
Away team’s left defence ratings: [0, 100+]
Home team’s right attack ratings: [0, 100+]
Away team’s right attack ratings: [0, 100+]
Home team’s middle attack ratings: [0, 100+]
Away team’s middle attack ratings: [0, 100+]
Home team’s left attack ratings: [0, 100+]
Away team’s left attack ratings: [0, 100+]
Home team’s defence to indirect free kicks: [0, 100+]
Away team’s defence to indirect free kicks: [0, 100+]
Home team’s attack with indirect free kicks: [0, 100+]
Away team’s attack with indirect free kicks: [0, 100+]
Home team’s attitude (Attitude is the effort the manager tells his player to have in the match): “Play it cool”, “Play it normal” or “The match of the season”, the attitude affects the strenght of the midfield of a team. The information is not available due to its private nature for the team’s owner.
Away team’s attitude: “Play it cool”, “Play it normal” or “The match of the season”. The information is only available for the team owner.
Home team tactic name (Tactic is defined as the way the match is going to be played by the team): Pressing, Attack on sides, attack on middle, long shots, counter-attacks or play creatively.
Away team tactic name: Pressing, Attack on sides, attack on middle, long shots, counter-attacks or play creatively.
Home tactic level: The level of the tactic selected by the manager (depends on the skill of the players): [0, 100+]
Away tactic level: [0, 100+]
Home Goals: Number of goals scored by the home team
Away Goals: Number of goals scored by visitor team
Home_is_winner (not included in the original dataset): This variable decides wheter the home team won or not his game. Calculated as
Home goals > Away goals and the result is boolean.

As you could foresee, the proposed objective variable is a boolean which is true if the home team won and false otherwise.

y = {1, if the home team won | 0, otherwise}

Data preparation

Data preparation is very simple, as much of the difficulties of the data retrieval were dealt with the web scrapping algorithm.

In some cases, there is a detail which difficults the correct reading of the information retrieved by the package beautifulsoup: the algorithm reads the following regular expression:

[Name of the Home Team] [Home Goals] — [Away Goals] [Name of the Away Team].

Sometimes, the Name of either teams contain and expression with “-”, which modifies makes the code think the Score ([Home Goals] — [Away Goals]) has already been found, which is not true.

The data preparation process must only take into account the elimination of these mistakes and the creation of the boolean-type objective variable, this process leaves us with 6017 matches to model.

When the dataset is obtained, we can begin answering the questions:

Is there a correlation (linear or non-linear) between a team’s ratings and the possible outcome of the match?

As can be seen in the graph, there is a strong non-linear relationship between some of the variables and the outcome, specifically, Home_Central_Defence and both Home & Away Midfield.

Mutual Information of all variables vs the outcome

Which variable has the biggest correlation with the outcome?

Histogram of Home Central Defence in matches where the Home team loses

Histogram of Home Central Defence in matches where the Home team wins

As seen in the Mutual Information graph, the Home Central Defence is the raw variable most related to the outcome.

In the histograms showed previously, it can be seen that when the Home team’s defence is low, is more likely that the team loses the match.

Could domain-knowledge be used to obtain variables more correlated with the outcome of the match?

Yes, for showing this statement, we will create a feature which will be defined as the percentage of domine between Home & Away midfields and compare it with both raw variables and check which reports more Mutual Information.

Midfield_ratio = Home_midfield/(Home_midfield+Away_midfield)

Comparison between midfield ratio and both components of midfield ratio separately

As expected, the ratio of Midfields (a variable obtained from domain-knowledge) outperforms any of the raw variables available from the dataset, including Home_Central_Defence which was the previous best.

Is there a specific tactic more likely to obtain victories? Is there a specific tactic more likely to obtain defeats?

Yes, it looks like there are tactics more prone to wins and others are prone to defeats.

Attack in Middle looks very effective obtaining victories, and the Normal (no tactic) look poorly effective.

Probability of a victory by Tactic of the Home team

Bibliography

Hattrick: el juego de mánager de fútbol | Únete a un mundo de fútbol gratis

¡Hattrick en un juego de mánager de fútbol online en el que ejerces la función de mánager con la misión de llevar a tu…

www.hattrick.org

Learn Feature Engineering Tutorials

Better features make better models. Discover how to get the most out of your data.

www.kaggle.com

sklearn.feature_selection.mutual_info_regression - scikit-learn 0.24.1 documentation

Estimate mutual information for a continuous target variable. Mutual information (MI) [1] between two random variables…

scikit-learn.org