A data-driven way of identifying tactical styles of soccer teams

FS
5 min readJun 21, 2020

The problem

Characterizing the tactical style of teams is a crucial task of modern football analysis. However, in media coverage often narratives are used which are not suited for a proper understanding of a team’s tactics.

For example, Barcelona will always be described as a team focused on short passes and possession — regardless of how they actually play. Secondly, teams are classified simply based on the country they play in (English teams are physical, Spanish teams playful and technically talented etc.).

These narratives are easy to understand or base on a long history. But are they adequate?

On the other extreme, video analysis demands a huge effort in time in order to identify patterns and properly assess the style of a team.

Hence, it would be beneficial to have a cheap (respective to time effort) method of characterizing team styles which delivers fairly useful hypotheses for a subsequent detailed video analysis.

Modern football analysis has developed many metrics and statistics which are collected and (more and more so) published. Thus, we want to design such a method based on data and some statistical modelling.

In this post I want to present a technique for identifying teams which are similar in style. I posted about this method on Twitter some time ago — but using it for players. Independently from me, others developed very similar techniques — one example linked below:

However, the method can be used almost identically on team levels. One of the main challenges of such a method is to differentiate between quality and style.

The result

Let us demonstrate the method with an example. I collected team statistics from Fbref for the top 5 European leagues. The plot below shows a network which connects teams with their two closest neighbors respective to team style.

Team style network for season of 2019/20 (so far).

This graph shows each team as a node connected to the teams it is closest in terms of style. The color of the connection encode the level of similarity (the brighter, the more similar). The color of the node encodes the cluster it is assigned into.

It is interesting to observe that teams from the same country often are in the same cluster. Furthermore, very good teams are connected mostly to other elite teams. As the quality of a team is implictly encoded in the way it plays (good teams in general play more offensive than bad teams), it is almost impossible to fully remove this effect. I think one should be very careful with interpreting the result for the European top teams or maybe redo the analysis with data from Champions League. Still, some interesting observations can be made: Real Madrid for example seem to play very similar than top Italian teams.

However, I do not want to interpret the result too much as I did not tune the feature selection very well. My intention was rather a quick demonstration of the methodology.

An overview of every category that I used and the standardized score of each team is visualized here:

Visualization of the feature space.

The method

I want to describe each step from the raw data to the above network plot briefly. However, technical details and more discussions can be found in the PDF in the repository linked below. As some mathematical/statistical tools like distance metrics and clustering techniques are involved, I can recommend (from my own experience when developping this) to read into the details in order to understand them properly.

  1. Collect statistics for each team of the Top 5 leagues from Fbref.com
  2. Select an adequate subset of the statistics for analyzing team style (rather than quality). In case, you immediately thought: “But what means adequate?” see the disclaimer below.
  3. Standardize each statistic to zero-mean and unit-variance and interpret the list of statistics of each team as a vector.
  4. Compute pairwise distances of these vectors.
  5. Construct a network by connecting each team to its k-nearest neighbors with respect to the distance matrix from above.
  6. Cluster the teams using DBSCAN with respect to the distance matrix.

As a metric, I use the cosine metric as it does not take into accounts the lengths of vectors and thus is able to reduce the impact of different team qualities. I mainly used DBSCAN because it is based only on a precomputed distance matrix.

More details can be found — again — in the docu of the below repository. It also contains the code for reproducing the results (data is not included, but can be easily downloaded from Fbref).

Disclaimer: I want to mention that — as in many statistical problems — the selection of features has huge impact on the outcome. I did not spend much time on constructing features but rather filtered out crudely some features from Fbref that — in my opinion — are adequate for analyzing the tactical style of a team. I am aware that for some features a normalization could be discussed or new features (e.g. ratios) could be constructed from the raw data.
This post is rather created in order to present the main idea and application of the methodology.

Conclusion

Tactical styles in football are hard to identify. However, with data collection growing, we can think of new ways of solving this task in order to get basic insights quickly and generate hypotheses for video analysis.
The methodology I presented does not claim to be flawless nor was it tested in any productive environment. However, it produces reasonable results without too much effort on feature selection (as well on player as on team level).

Note: If you want to use my code and face any issue, let me know.

--

--

FS

Interested in football, mainly analytics and tactics.