TL;DR Cointegration For Time Series Analysis
“Correlation is not causation” is seen by many as the statistician’s hippocratic oath, which is why Granger coined the term “spurious correlation” upon realising that the linear regression model we all love could lead many in misleading directions, such as concluding that a recently discovered correlation between video game sales and nuclear energy production, could somehow be causal.
What is a Spurious Correlation?
So why isn’t Rockstar games investing as much into cold fusion as its marketing to for GTA6? Switching over to a bit of algebra for a moment, let’s assume video game sales and nuclear energy production follow the simplest of time series models, an independent random walk:
If we gather enough observations of both series and measure the covariance, it should be zero (since both series are independent). However, if we bootstrap samples of 100 observations from both time series, the stochastic nature of the process means we can always find time periods where the series is guaranteed to exhibit zero, positive and negative correlation:
Theese correlations are spurious, a fugazi, and lets you confirmation bias any theory you want if you know but when to look and how long to look for.
“But random walk models are an insult to modern day machine learning!” I hear you exclaim; To make our model more realistic, let’s assume that Video Game Sales and Nuclear Energy Production are both a function of the independent variable EconomicGrowth:
An independent variable that influences both simultaneously, giving the illusion of a direct relationship is known as a confounding variable. This shared independent variable imposes a covariance structure between between two the dependent variables. It’s now obvious to see how a spurious correlation arises from the non-zero covariance:
But if correlations are ambiguous, what other measure can be used to characterise the relationship between two time series? Granger and Engle won the Nobel Prize in Economics for a inventing new type of technique aimed at delineating if two time series truly have a different, long-running type of relationship: cointegration.
Order of Integration
Before we talk co-integration let’s first examine what it means for a time series to be integrated. We first need a time series model to work with and there is no better example than a first order autoregressive (AR(1)) process. There are many ways to season an AR(1) model, we can add a constant, or a trend, and even both a constant and a trend:
But what determines if a series is integrated or not relates to how stationary it is.
Integrated Time Series
Time series come in two flavours: stationary and non-stationary. A time series is stationary if its statistical properties (mean, variance, autocorrelation) are constant over time. A series is non-stationary if it has a unit root (∣ϕ∣ ≥ 1 for an AR(1) process) or a deterministic trend, causing properties to change over time.
To make a non-stationary series stationary, we take the first differences (subtracting consecutive observations), which removes the unit root and stabilizes the trend. From this we arrive at our definition of order of integration (n) as the number of first differences required for an I(n) series to become stationary. More formally:
Some I(1) series exhibit a unique linear combination whereby the resultant combination is actually has a lower order of integration. After I(1), the next lowest order of integration is I(0), which literally corresponds to a stationary series. A special relationship exists between I(1) series that combine together to form a I(0) one and those series are known as…
Cointegrated Series and Long Term Equilibrium
Cointegration tries to answer the question: do my time series move together in a way such that average distance between them always remains more or less the same (or stationary)? More formally, X, Y ~ I(1) are said to be cointegrated if Z exists such that:
Note — it’s possible for two time series to display weak or even no correlation but still be cointegrated — cointegration focuses on long term equilibrium relationships.
Engle-Granger Two-Step Method
What we’ve done is describe the basis of the Engle-Granger Two-Step Method for cointegration testing. It’s as easy as (1) computing Z and (2) verifying that Z is indeed stationary using the Augmented Dickey-Fuller (ADF) test. This works well for two variables, but what if we want to generalise to N variables?
Johansen Cointegration Test
Johansen generalised the ADF test to N variables by using a Vector Error Correction (VECM) model (explained elsewhere for brevity) to capture the dynamics and relationships among multiple time series.
Whereas the ADF test does not test for cointegration directly (it only tests for the presence of a unit root in a single series), Π is the key matrix that captures long-term equilibrium relationships among the variables and we can actually obtain the number of cointegrating vectors by considering the rank of Π:
- If rank(Π) = 0: There are no cointegrating relationships.
- If 0 < rank(Π) < N: There are rank(Π) cointegrating vectors, meaning rank(Π) long-term equilibrium relationship among the time series.
- If rank(Π) = N: This indicates that all the time series are stationary on their own and the system does not require cointegration to be stable.
Now that we know how many cointegrating relationships exist, how do we quantify them in a meaningful way? It’s as easy as Π….
Π = αβ′
Lets decompose Π into αβ′ using some clever matrix factorisation. Without getting too much into linear algebra, it turns out that these factors characterise the long-term dynamics in a way analagous to how the pearson correlation coefficient captures short term correlations.
β′
If X,Y are cointegrated, β’ is a 1x2 column byproduct of the factorisation Π = αβ′, representing our cointegrating relationship (remember the rank rules we talked about before). The values in the the column describe the proportional contribution of each variable to the long-term equilibrium.
More generally (n > r > 0), each column in β’ describes a single cointegrating relationship between each variable (y1, …, yN). We are interested in the proportional relationship between yi and yj, in terms of sign and magnitude:
- Magnitude: If abs( βi )> abs( βj ), it suggests that the y_i variable plays dominant role in the cointegrating relationship.
- Sign: βi / βj > 0 suggests that the ( yi, yj ) move in the same direction in the long term, whereas βi / βj < 0 indicates they move in opposing directions.
α
If X,Y are cointegrated, the values of α represent the speed at which each variable adjusts to deviations from the long-term equilibrium.
- Magnitude: If abs( αi )> abs( αj ) indicates that the corresponding variable y_i adjusts more quickly and plays a more significant role in correcting disequilibrium.
- Sign: if αi > 0, yi increases to restore equilibrium whereas when α_i < 0, it decreases instead.
Since α dominates short-term dynamics in our long-term equilibrium relationship, it has a direct impact on another crucial measure of time-series relationships….
Granger Causality
X is said to “granger-cause” Y if predictions of the value of Y that consider past values of X and Y are better than considering past values of Y alone. Moreover, if X, Y are cointegrated, it can be shown that one series is guaranteed to granger-cause the other. If we factorise our VECM above as shown:
The error correction term (β1Xt−1 + β2Yt−1) must affect at least one of the variables Xt or Yt to restore equilibrium. Since at least one of αX or αY must be non-zero to maintain the cointegration, there is Granger causality in at least one direction.
What this implies is that a long-term equilibrium relationship guarantees a short-term predictive relationship. Beware: the converse isn’t true. Short-term predictive relationships such as granger-causality do not imply long-term cointegration.