Capturing context and space in football match analysis

Football is traditionally a sport of few statistics, and the bulk of the statistics that have been available, such as shots, fouls, and corner kicks, were neutral and summarizing in nature.  (Aside from the obvious positive statistic of goals, of course.) Thanks to companies such as Opta, Prozone, and Match Analysis, football clubs, football leagues, and media organizations are awash in large volumes of finely-grained data that describe every event that occurs on a football pitch.  End-users now know who initiated a specific play in a match, the match time at which it occurred, where on the pitch it occurred, and additional event-specific information.  In the case of Prozone, in addition to the above play-by-play data, every movement of the 22 players, the ball, and the referees can be tracked at speeds of up to 100 frames per second with high-resolution cameras.

Despite the leap in technology over the last 15 years that provides the football industry with a rich set of match data, team and player analysis remains relatively unchanged. Such analysis is little more than tabulations of events that occur on the pitch within a period of time.  A common example is the image of the football pitch gridded in zones with team passing percentages overlaid on it.  Another example is a cloud of the spatial location of player touches, movements, or passes.  If we animate the cloud over the course of the match, we can call that advanced match analysis!

Nevertheless, such ‘simple’ analysis does retain some advantages.  To start, these analyses are easy to compile. The most complicated procedure is the filtering of match events by player, zone, or time, for either a single match or multiple matches over one or several seasons.  These functionalities are standard features in the analysis packages supplied by the sports data companies.  Second, these analyses are easy to understand, whether by a decision maker such as a manager or a sporting director, or by the general public.  This particular advantage is quite significant, and has implications for those who want to develop more sophisticated analyses, which I will return to later.

Yet this type of analysis has one critical flaw: it neglects contextual and spatial information associated with the match data and produces a misleading picture of team and player performance as a result.  To illustrate this flaw, let’s discuss the anatomy of a pass.

A football pass, to give a very precise description, is an attempt to intentionally redirect the path of the ball from one player to a teammate using either the head or feet.  That event is conditional on a number of parametric variables, including:

  •     the player who initiates the event
  •     the match time
  •     the player’s spatial position on the pitch
  •     the relative location of any opposing players
  •     the body part used to strike the ball
  •     the state of play (open or set-piece)
  •     the final spatial position of the ball when touched

The final outcome of a pass is either success (retained team possession) or failure (loss of possession).  Some of these parameters, such as the player, the body part, and the state of play, have a limited number of values.  One can only attempt a legal pass with either foot, the head, or the chest, for example.  But the inclusion of a spatial coordinate and time results in an infinite number of possible conditions and situations for a pass.

Yet the extent of current match analysis is a summary of pass events, or other field events, as if all pass events are performed under identical conditions and situations.  If we accept that a pass is a function of spatial, temporal, and contextual parameters, why do we continue to insist that a tabulation of all field events tells us anything meaningful?

It is much more challenging to fully incorporate contextual and spatial data in an analysis than it is to develop a map of passing summaries and positional clouds.  Even though some football clubs and sport data companies are starting to hire analytical talent to extract more information from their data, analysis that makes full use of context and space remains beyond the limits of expertise within these organizations. The common refrain from the conventional wisdom is that football is a sport of continuous play between players who act cooperatively and interdependently, and as such is much more resistant to statistical analysis than other sports such as baseball.  Yet there are complex systems in nature, society, and technology with a dynamic structure similiar to football that have been analyzed extensively, and from which meaningful insight has been extracted.

Networked systems are characterized by a large number of individual actors who interact through either communication or physical action. An individual unit may perform a simple action, but multiple numbers of these units interacting cooperatively or competitively produce a rich and complex set of behaviors.  Examples of such systems abound in the natural and physical world.  The field of systems biology is devoted to studying the complex functions and behaviors of cells, which involves complex signaling networks between proteins and genes.  Communication is practically synonymous with a network, whether wired or wireless, static or mobile, or transmitting or sensing.  A third network that has taken on increased interest from researchers and the general public is the social network that describes the interactions and influences of people.  All of these systems involve continuous or semi-continuous flows of information between actors, who may be modeled individually with physical laws or statistical models but interact with other actors in complex ways.  Football, and other field invasion sports like it, are no different from these networks.

Networked systems, whether from biology, sociology, communication, or sport, are analyzed using statistical network models. These models have been studied for over 50 years by sociologists and statisticians, but in the last 20 years there has been increased research from the physics, computer science, and broader mathematics communities. Researchers describe a network in terms of a collection of nodes (actors) and edges (relationships).  The edges are then given weights that correlates with some measure of importance.  The importance of the nodes is estimated by examining the number of links between a node and its neighbors and the importance of those links in the network.

We relate network models to football by viewing players as the actors and their passes and other interactions as relationships. The common approach by researchers is to create a matrix of the passes made by one player to another and create a map that compares the influence, or centrality, of a player on the run of play.  Such maps can also be used to determine the zones or channels through which a team’s play develops.  While these maps represent a significant leap forward in football analysis, they still don’t make full use of match contexts or spatial data.  Moreover, there is much talk about influence, but not enough, in my opinion, about effectiveness or value.  One value measurement that I would like to see is the idea of an “expected goal value” — the probability that a specific action will create a goal, for any action on the field during a game. Such a value can be made a function of spatial and contextual variables and updated through a statistical network model. This metric would turn out to be quite powerful as it would enable new levels of analysis on the effectiveness of players, tactics, set-piece plays and playing formations.

While statistical network modeling presents new avenues for match analysis, there are some research and presentation issues that need to be addressed.  The first is the modeling structure: are the results capturing true differences in performance or do they capture artifacts of the algorithm?  In professional sport, the difference between players, and especially the differences within elite players, are not very great.  The second issue is initialization.  Statistical network models often require an initialization of shot or passing probabilities which can either be derived from real data or estimated from a statistical approximation.  The third issue, and perhaps the most critical one, is presentation: results have to be meaningful and understandable to be accepted and then adopted by a decision maker.  The ability to communicate the meaning will improve as researchers understand better the implications of the network analysis, but proper explanation of the analysis will be a significant task.

Football is a game of simple rules that permits a rich set of plays and interactions between players, which has made it fascinating to watch and challenging to analyze.  The use of more sophisticated data provides the football industry with a much more comprehensive view of the game than ever before.  It’s time for football analysis to reach a new level of sophistication by incorporating methodologies from other communities dealing with complex and dynamic networks in order to make full use of the advanced datasets now available.