Introduction to Big Data Analytics
by Haris Karagiannakis
After a brief coding refresher, the course starts by introducing the main concepts in data analytics and econometrics for data inferencing and forecasting. It focuses primarily in addressing contemporary issues arising in macroeconomics and finance when forecasting, such as the curse of dimensionality and the use of real-time datasets. We investigate various stylized facts in macro and finance using data downloaded in real-time, and we then turn into adjusting our models to account for those facts.
The course ends with a detailed description of state-of-the-art techniques and datasets used by professional forecasters to forecasts real and nominal macroeconomic variables, and economic activity generally. Specifically, we implement two real-time forecasting exercises, one using a standardized macro dataset and the other with our own-created dataset where we hand-pick the predictors. We set as our target to forecast 1-month ahead US CPI headline inflation using various models, including a bi-variate benchmark estimated via OLS, as well as regularized regressions and factor-augmented regressions with PCA.
The material is structured as follows:
- Week 0: Coding and MATLAB refresher
- Week 1: Monte Carlo Simulation
- Week 2a: Bootstraping
- Week 2b: Cross-Validation (CV): Recursive, Rolling, K-fold
- Week 3: Time Series Decomposition
- Week 4: Outlier Detection and Correction
- Week 5: Penalized Regressions
- Week 6: PCA and PLS
All the files and datasets needed to follow and replicate the material in this tutorial are created herein. You will only have to download one 3rd party toolbox and the function listed below. To reproduce any of the material in this tutorial please acknowledge the author and the source.
Coding and MATALB Refresher
cd 'C:\Users\UserXYZ\Desktop\Intro2BDA' %Change root/working directory (WD)
%We can now use 'relative paths' to save/access stored files and folders WITHIN our WD
save('.\Data\x_data.mat')
save('C:\Users\UserXYZ\Desktop\Intro2BDA\Data\x_data.mat')
%A) .. is like pressing the backspace ONCE
%B) ..\.. is like pressing the backspace TWICE
cd 'C:\Users\UserXYZ\Desktop\Intro2BDA'
winopen(pwd) %open the WD programmatically
In these sessions we will be downloading a few 3rd party MATLAB functions and toolboxes. In order to be able to call those functions through our 'Command Window', we will need to add them to MATLAB's 'search path'. To do that, first create a folder named 'MyFunctions' inside your working directory (WD). Any 3rd party function will need to be placed inside that folder. Then, by calling function addpath(), the folders specified as input to that function are added to the MATLAB search-path. Every time you add a new set of functions or toolboxes inside 'MyFunctions', you will need to re-run the addpath() command. Also, whenever MATLAB is re-opened, you will again need to re-run the addpath() command in order to be able to call any 3rd party functions.
addpath(genpath(fullfile(pwd, 'MyFunctions'))); %Add folders to MATLAB search-path
%- addpath() adds the specified FOLDER to the MATLAB search path.
%- fullfile() creates the full path to a folder or file, using the appopriate
% separator [i.e. backslash (\) on Windows; forward slash (/) on UNIX].
% fullfile('folder','subfolder','file.m') -> 'folder\subfolder\file.m'
%- calling genpath() inside addpath() adds ALL SUBFOLDERS (of the specified
% folder) to the search path.
web('https://uk.mathworks.com/help/matlab/matlab_env/what-is-the-matlab-search-path.html')
In these sessions we will make use of the following toolboxes/functions:
Other useful resources (-we are not going to be using those in class):
Fundamental MATLAB Classes
web('https://uk.mathworks.com/help/matlab/matlab_prog/fundamental-matlab-classes.html')
Matricies or arrays (usually we refer to them as 'variables' or 'data object') can contain any of the following data types, or classes available in MATLAB:
strM = ["Hello!", "I <3 KCL"]
IM = [true true false true]
xs = rand(5,2); %draw values from the uniform distribution
tn = datetime('now') %Returns a DATETIME object of the form '28-Feb-2020 22:30:59'
%You can now use functions made for datetimes
[num, myday] = weekday(t, 'long')
%Check the Class of any of the above-created variables:
Another set of classes/objects (structs, cells, tables, and timetables) can be used to store multiple data types (numbers, strings, chars, logicals) under a single variable.
L = {[2 3]; 'Char'; ["Hi!","Big Data"]}
%Structure (or stuct) array with 3 fields named a, b, c
X2 = ["A","B","C","D","E","F","G","AA","BB","CC"]'; %transpose to have same dimensions
X = table(X1,X2, 'VariableNames',["Var1","MyLetters"])
%Access column/variable named 'Var1' using either:
X(:,1) %this is of class 'table'
X{:,1} %this is of class 'double'
tt = datetime(2022,1:10,1)'
XT = timetable(X1,X2, 'VariableNames',["Var1","MyLetters"], 'RowTimes',tt)
%Timetable holds the Time variable as a metadata
XT(:,1) %<- The 1st variable is 'Var1'
%Table holds the Time variable as simply another variable
X(:,1) %<- The 1st variable is the dates-variable
%Check the classes of each variable
%Delete variables from 'workspace' (that's the window on the upper right)
clear tt L K IM I X1 X2 str strM
%Alternatively select the variables NOT to delete
You can change the displayed format of a datetime. Two approaches to do it:
%1. Change the format of the DATETIME (supresses the time) after creating it
%2. Set the desired format, when creating the datetime (using Name-Value pair 'Format')
tn = datetime('now', 'Format','yyyy-MM-dd')
tn_2 = char(tn); %convert from datetime into character
Finally, let's now take a look on how we can use iterative procedures (loops) to implement repeatitive tasks in MATLAB.
for i = 1:2:10 %<- loop over values 1,3,5,7,9
disp("Currently iterating over i=" + num2str(i) + ", and z=" + num2str(z) + ".")
if mod(i,2)==0 %Skip iteration if the remainder from the division with 2 is equal to 0
disp("Currently iterating over i=" + num2str(i) + ", and z=" + num2str(z) + ".")
Throughout the course we will also do some basic plotting. To create more advanced figures, you can take a look here. Let us now download and save the time series we will be working with in the next few weeks, the S&P500 index.
y = fred_api("SP500", 'StartDate',datetime(2012,1,1));
y = diff(log(y{:,1}))*100; %continuously companted returns (aka log-returns)
Week 1: Monte Carlo (MC) Simulation
Construct the distribution of a statistic/parameter/quantity of interest by sampling values from a known distribution. We need to make an assumption about the underlying distribution of your random variable or our stochastic process (in a time-series setup).
The MC procedure can be described as follows:
- Simulate (i.e. draw) R independent and identically distributed (iid) random samples for X, where X has a know distribution, say . Therefore i = 1, .., R are the number of replications.
- For every random sample i, evaluate the statistic of interest (e.g. sample mean or variance) .
- The collection of all provides the approximation to the distribution of the statistic.
%Draw random numbers from a known distribution
%1) Standard Normal x~N(0,1)
x = randn(T,1); %Tx1 vector of random scalars drawn from the standard normal distribution
histogram(x, 'Normalization','pdf')
x = mu + sigma*randn(T,1);
%x = normrnd(mu,sigma,T,1) %Same as above
histogram(x, 'Normalization','pdf')
histfit(x) %fits a normal density function together with the histogram
Example 1: Simulate Normal variable - Construct the Mean () Distribution We have a time series y, for instance the S&P500 daily returns, which we can assume it is normally distributed, and we want to construct a distribution for the mean estimator. This will then allow us to create the confidence interval for the mean estimator. We assume that the log-returns follow the process:
xlim([1 length(y)]); title('S&P500 Daily Log-Returns');
x = mu + sigma*randn(T,R); %TxR matrix of random scalars drawn from the N(mu, sigma) distribution
MCmu = mean(x); %Distribution for the mean estimator
histfit(MCmu,[],'normal')
prctile(MCmu, [2.5 97.5]) %Confidence interval (for mu_hat) at the 95% confidence level
Is the Normality assumption underlying the MC simulation for the mean estimator a valid one for the timeseries at hand? Plot the histogram of your series against the Normal with the same mean and variance to see.
histfit(y,[],'normal'); title('S&P500 Daily Log-Returns');
%NB: Earlier we plotted the simulated means, here we plot yt itself
Take-home Exercise
We saw from the MC experiment above, that the mean for daily S&P500 log-returns is approximately 0.04% with 95% CI [-0.01%, 0.08%] (assuming Normality). On an annual basis, the log-returns accumulate on average to around 7%. Re-run the MC simulation exercise on the monthly and/or annual S&P500 log-returns and calculate the confidence interval at the 95% confidence level for the (monthly/annual) mean. To do this, download the daily data from FRED using fred_api() and then accumulate the log-returns to the desired frequency (monthly/annual) by summing the observations in the corresponding period. Hint: To downsample from daily to a lower frequency use MATLAB command retime() applied on a timetable.
Example 2: Simulate AR(1) process - Construct the Distribution We have the same timeseries y (i.e. the S&P500 daily returns) but we now believe that it follows a first-order autoregressive process:
We now want to estimate the vector and then (MC) simulate the process y, multiple times, in order to estimate multiple AR(1) models, one for each of the simulated paths of y, and consequently obtain a distribution of 's. %------------------------%
% Estimation on actual y %
%------------------------%
%Estimate the parameters of the ARIMA(p,D,q) model via maximum likelihood given the observed time series y
% If P ~ ARIMA(1,1,0) then y=diff(P) ~ ARIMA(1,0,0)
mdl = arima('Constant',NaN,'ARLags',1);
%mdl = arima('Constant',NaN,'ARLags',1,'MALags',1) %Also adding MA
[ESTreg, vcov_hat, logLikelihood, info] = estimate(mdl, y);
ARIMA(1,0,0) Model (Gaussian Distribution):
Value StandardError TStatistic PValue
________ _____________ __________ ________
Constant 0.046713 0.022536 2.0728 0.038192
AR{1} -0.14964 0.0071289 -20.991 0
Variance 1.1512 0.012025 95.729 0
summarize(ESTreg); %summary of ML regression results
%Extract the estimated coefficients
%ma1 = ESTreg.MA{1}; %If we also had MA component
sigma2 = ESTreg.Variance;
% Now the objective is to SIMULATE y (using Monte Carlo) %
%Simulate R yt paths, each having T observations
%NB: To be able to simulate, the model-object needs to have the variance specified
mdl = arima('Constant',phi0,'AR',phi1,'Variance',sigma2);
[YR, ER, V] = simulate(mdl,T,'NumPaths',R); % Simulate 1000 y~AR(1)
%- YR: Simulated paths for y
%- ER: innovations from the ARIMA model
%- V: conditional variances
%-----------------------------%
% Estimation on simulated y's %
%-----------------------------%
%For each of the simulated y paths/columns in YR, estimate the AR(1) model
%and obtain the coefficients
mdl = arima('Constant',NaN,'ARLags',1) ;
[ESTreg, ~, ~, ~] = estimate(mdl,YR(:,i),'Display','off');
phi(i,:) = [sphi0 sphi1];
histfit(phi(:,1),[],'normal')
title('Phi0 Distribution')
histfit(phi(:,2),[],'normal')
title('Phi1 Distribution')
CI = table(); %table preallocation
CI.Pct = string([2.5; 97.5; 50]);
tmp = prctile(phi, [2.5 97.5 50]); %Confidence interval at the 95% confidence level
CI =[CI array2table(tmp,'VariableNames',["phi0"; "phi1"])];
%Add estimated simulated-mean
CI{end,2:end} = mean(phi);
%Add estimated phi's [Conditional means: E(phi/yt)] from original sample
CI{end+1,1} = "E(phi/yt)";
CI{end,2:end} = [phi0 phi1];
CI %Return the table
CI = 5×3 table
| Pct | phi0 | phi1 |
---|
1 | "2.5" | 0.0041 | -0.1909 |
---|
2 | "97.5" | 0.0874 | -0.1120 |
---|
3 | "50" | 0.0479 | -0.1509 |
---|
4 | "Mean" | 0.0472 | -0.1505 |
---|
5 | "E(phi/yt)" | 0.0467 | -0.1496 |
---|
Take-home Exrecise:
Potentially the AR(1) assumption for the daily S&P500 log-returns is not a good one, because as we saw from the time-series plot earlier, the actual S&P500 daily returns exhibit volatility clustering. You can repeat the simulation exercise but instead of simulating an AR(1) process (for y), you can now simulate a GARCH(1,1) process and construct the Distribution (of course, now, the 's will be the GARCH coefficients). The GARCH(1,1) model is given by: (1) The conditional mean equation:
with and (2) The conditional variance equation:
. The μ parameter in the conditional mean equation is called the 'mean-offset'. Modify the code given below to (MC) simulate R paths of as a GARCH(1,1) process. Recall that GARCH(0,1) = ARCH(1). plot(y) %observe the volatility clustering
title('S&P500 Daily Log-Returns');
%------------------------%
% Estimation on actual y %
%------------------------%
mdl = garch('Constant',NaN,'GARCHLags',1,'ARCHLags',1,'Offset',NaN);
[ESTmdl, vcov_hat, logLikelihood, info] = estimate(mdl, y);
GARCH(1,1) Conditional Variance Model with Offset (Gaussian Distribution):
Value StandardError TStatistic PValue
________ _____________ __________ __________
Constant 0.046132 0.0044393 10.392 0
GARCH{1} 0.74141 0.016811 44.102 0
ARCH{1} 0.2197 0.016364 13.426 0
Offset 0.07973 0.013702 5.8188 5.9258e-09
mu = ESTmdl.Offset; %parameter 'mu' in the conditional mean equation
%Simulate R yt paths, each having T observations
mdl = garch('Constant',phi0,'GARCH',phi1,'ARCH',phi2,'Offset',mu);
[VR, ER] = simulate(mdl,T,'NumPaths',R); % Simulate 1000 y~GARCH(1,1)
%- VR: Simulated paths for conditional variance (NOT y)
%- ER: Simulated response y. Specifically:
% 1) If (conditional mean) model contains offset (mu), then:
% ER = innovations plus an offset
% 2) If (conditional mean) model does NOT contain offset (mu), then:
%NB: You can also simulate by directly passing the estimated model object
%(ESTmdl), instead of specifying it.
[VR, ER] = simulate(ESTmdl,T,'NumPaths',R);
%NB: We simulated R=1000 paths, but let's plot only 2 (out of the 1000).
title('Conditional Variances')
Week 2: Bootstraping & Cross-Validation (CV)
Bootstraping
Construct the distribution of a statistic/parameter/quantity of interest by sampling values/rows from our own dataset and then putting them back (i.e. with replacement). Therefore, one row can be selected multiple times. Specifically, we construct B bootstrap samples (aka replicas), with each bootstrap replica created by randomly selecting T out of T observations with replacement from our original series or dataset. No distributional assumptions are required. However, we should carefully choose which bootstrap procedure to use (block vs IID) depending on whether the series exerts time-dependence or not.
In IID Bootstrap, we shuffle and pick individual observations. In Block bootstrap we shuffle and pick blocks of consecutive observations. The general procedure can be described as follows:
- Draw B random samples with replacement from data X (each of which having the same number of obs as the initial matrix/series X, say T), and collect their bootstrap counterparts {}. Therefore, b = 1, .., B are the number of Bootstrap resamples.
- For every resample b, evaluate the statistic of interest (e.g. sample mean or variance) .
- The collection of all provides the bootstrap approximation to the distribution of the statistic.
IID & Block Bootstrap
addpath(genpath(fullfile(pwd, 'MyFunctions'))); %Add folders to MATLAB search-path
B=10; %# of Bootstrap samples
[~, b8t] = bootstrp(B, [], (1:T)'); %MATLAB's built-in function
%For bootstrapping stationary, dependent series
w = ceil(T^1/3); %Block size
b8t = block_bootstrap((1:T)',B,w); %MFE Toolbox function
b8t(:,1:4) %Print the indicies for the first 4 (IID) bootstrap samples
5 6 8 2
6 7 9 3
7 8 10 4
8 9 11 5
9 10 12 6
10 11 13 7
11 12 14 8
6 11 14 16
7 12 15 17
8 13 16 18
In IID bootstrap sampling (-when sampling with replacement), about one-third of the observations are left out of the sample, or in other words, 2/3 are included at each Bootstrap sample. To see that:
[~, b8t] = bootstrp(B, [], (1:T)'); %IID bootstrap
obsIN(i) = length(unique(b8t(:,i)))/T;
Machine Learning models such as Linear Bagging, and Random Forests (which are Bagging Decision/Regression Trees), are based on bootstrap sampling followed by (prediction) aggregation (which essentially is ensemble of forecasts), hence the name 'bagging' coming from the combination of the words bootstrap and aggregation.
In those models, the 1/3 of observations that are left out, are called oob (out-of-bag) data and can be used as the equivalent of the test-set or out-of-sample observations to estimate the unbiased prediction error (aka out-of-bag (oob) error). As such, for each bootstrap sample (i.e. each column of 'b8t'), we can think of the (initial) sample being separated into 'training data' and 'out-of-bag (oob) data' or 'test data'.
Series Time-Dependence
To choose between block or IID bootstrap we need to first check the series dependence. For stationary but dependent series we need to use block boosting, instead of IID as the latter would break the series' pattern wrt time.
There are two main routes one can take to assess the presence of autocorrelation (aka serial correlation) in a series:
- The sample autocorrelation function (ACF) and partial autocorrelation function (PACF) are useful qualitative tools to visually assess the presence of autocorrelation at individual lags.
- The Ljung-Box Q-test is a more quantitative way to test for autocorrelation at multiple lags, jointly.
Using the Ljung-Box Q-test you can test for instance the null hypothesis that the first 10 autocorrelations are jointly zero. This is done in MATLAB using function lbqtest() by setting the Number of Lags to . The usual choice for the lags is 20, unless the series has less observations, in which case you can set the lags to be tested equal to T-1. The degrees of freedom (DOF) of the null distribution must be set equal to the number of lags (unless the residuals come from a model with K parameters, in which case, the DOF should be set equal to the number of lags being tested minus K (). The Ljung–Box Q test is appropriate for assessing autocorrelation in any series with a constant mean, therefore, the series needs to be transformed to stationary before passed to the lbqtest(). The Q test statistic (under the null) is defined as follows: testing the null hypothesis:
where m denotes the number of autocorrelations calculated (i.e. the number of lags being tested), is the estimated sample autocorrelation for lag j, and T the sample size. If the null hypothesis is rejected, then this indicates volatility clustering in the residual series, since it is an indication that the residuals are autocorrelated. You can also test for conditional heteroscedasticity by conducting a Ljung-Box Q-test on a squared residual series. An alternative test for conditional heteroscedasticity is Engle’s ARCH test. If there is no significant autocorrelation present (as shown by the Ljung-Box Q-test), then this suggests that a conditional mean model is not needed for modelling the observed series.
%%% Ljung–Box Q test for Serial Correlation %%%
%lags2test = min([20,T-1])
%[h,pValue,testStat,critValue] = lbqtest(residuals,'DOF',lags2test,'Lags',lags2test,'Alpha',0.05);
%pValue: P-value from the Ljung–Box Q test for Serial Correlation in the residuals
% [h,pValue,teststat,critValue] = lbqtest(residuals);
%- H0: Residuals are not autocorrelated.
%- if h=0 -> There is not enough evidence to reject the H0 (at the
% 5% level of significance) that the residuals are NOT autocorrelated.
%- Note: when the 'residual' series to be tested is constructed using out-of-sample
% predictions, the degrees of freedom must be set to: lags2test-0.
Let's now use the S&P500 daily log-returns to see if the series exhibits serial correlation.
y = fred_api("SP500", 'StartDate',datetime(2005,1,1));
p = p{:,1}; %S&P500 daily index
y = diff(log(p))*100; %log-returns
subplot(2,1,1); plot(p); xlim([0,length(p)]); title('S&P500 Index')
subplot(2,1,2); plot(y); xlim([0,length(y)]); title('Daily S&P500 log-return')
residuals = y - mean(y); %Compute the centered return series.
lags2test = min([20,T-1]);
[h,pValue,teststat,critValue] = lbqtest(residuals,'DOF',lags2test,'Lags',lags2test,'Alpha',0.05);
%h=1: We reject the no autocorrelation null hypothesis in favor of the alternative.
pValue %print the pValue for the Ljung–Box Q test on daily S&P500 log-returns
Since the daily log-returns series is autocorrelated, we then need to use block bootstrap.
Example: Block Bootstraping on the daily S&P500 return timeseries
Assume AR(1) process - Construct the Distribution %Estimate the AR(1) model using yt
mdl = arima('Constant',NaN,'ARLags',1);
ESTreg = estimate(mdl, y);
summarize(ESTreg); %summary of ML regression results
ARIMA(1,0,0) Model (Gaussian Distribution)
Effective Sample Size: 2514
Number of Estimated Parameters: 3
LogLikelihood: -3744.16
AIC: 7494.33
BIC: 7511.82
Value StandardError TStatistic PValue
________ _____________ __________ ________
Constant 0.046713 0.022536 2.0728 0.038192
AR{1} -0.14964 0.0071289 -20.991 0
Variance 1.1512 0.012025 95.729 0
%Extract the estimated coefficients
%------------------------------------------%
% Create the distributions for phi0 & phi1 %
%------------------------------------------%
B=1000; %# of Bootstrap samples
b8t = block_bootstrap((1:T)',B,w); %MFE Toolbox function
for i = 1:B %loop over the B bootstrap samples
disp(['Bootstrap Iter i=', num2str(i), ' out of B=', num2str(B)]);
mdl = arima('Constant',NaN,'ARLags',1); %yt ~ AR(1)
ESTreg = estimate(mdl,y(smpl),'Display','off');
b_phi0 = ESTreg.Constant;
phi(i,:) = [b_phi0 b_phi1];
histfit(phi(:,1),[],'normal')
title('Phi0 Distribution')
histfit(phi(:,2),[],'normal')
title('Phi1 Distribution')
CI = table(); %table preallocation
CI.Pct = string([2.5; 97.5; 50]);
tmp = prctile(phi, [2.5 97.5 50]); %Confidence interval at the 95% confidence level
CI =[CI array2table(tmp,'VariableNames',["phi0"; "phi1"])];
CI{end,2:end} = mean(phi);
%Add estimated phi's [Conditional means: E(phi/yt)] from original sample
CI{end+1,1} = "E(phi/yt)";
CI{end,2:end} = [phi0 phi1];
CI %Return the table
CI = 5×3 table
| Pct | phi0 | phi1 |
---|
1 | "2.5" | 0.0327 | -0.2506 |
---|
2 | "97.5" | 0.0626 | 0.0028 |
---|
3 | "50" | 0.0456 | -0.1498 |
---|
4 | "Mean" | 0.0463 | -0.1315 |
---|
5 | "E(phi/yt)" | 0.0467 | -0.1496 |
---|
Cross-Validation (CV): Recursive, Rolling, K-fold
Cross-validation refers to methods for model comparison (and hyperparameter selection). The main idea is to dichotomize the sample into training set and test set, where you estimate your model using the former, and then use the latter to assess the forecasting ability of your model. There are a few possible ways to do that.
For cross-sectional data we can use K-fold (or LOO-CV, which is K-fold with K=T). However, for timeseries data the test set can only include observations after the training set. If the test set was before the training set, then it would be as if we use the future to 'predict' the past. Therefore, for timeseries data many practitioners will use recursive or rolling CV, and avoid K-fold or LOO. Just like in block vs IID bootstrap, the idea is that the sequence of the observation matters (when having time series).
In recursive and rolling CV the date that corresponds to the last observation of the training set is called 'the forecast origin', because this is (supposedly) the point in time we stand, when we make our forecast. In those two methods, recursive and rolling, we effectively roll the forecast origin forward in time. Recursive and rolling CV are also commonly referred to as classical out-of-sample (OOS) evaluation.
1. Recursive CV
Recursive estimation is also known as expanding-window estimation. The process is as follows: We start by taking the desired minimum number of observations as our training set (-estimation sample), and we then recursively add one additional observation to the sample, until all (but one) observations have been added to the training set. Therefore, at each iteration the sample size (T) increases by one.
%%% Recursive CV Indicies %%%
frac = 0.2; %NB: 'frac' contains the proportion of the TEST set => 1-frac = % of TRAINING set
Ntrain = floor((1-frac)*T);
Ntest = numel(Ntrain+1:T);
sp.test = false(T,Ntest);
sp.train = false(T,Ntest);
%Training Set (in-sample obs)
train_idx = 1:Ntrain-1+i; %Recursive (expanding-window)
%train_idx = i:Ntrain-1+i; %Rolling (fixed-window)
trainset(train_idx) = true;
sp.train(:,i) = trainset;
%Test Set (out-of-sample obs)
testset(test_idx) = true;
Please note that sometimes the process/exercise of evaluating the out-of-sample predictability of competing specifications is referred to as cross-validation (CV), and sometimes as pseudo out-of-sample (POOS) forecast evaluation exercise. Machine learning practitioners will usually use the former term, while econometricians the latter. Out-of-sample (OOS) observations are also referred to as test-set, or simply 'unseen' data.
2. K-fold CV
In K-fold cross-validation we create a partition that divides the T observations into K disjoint subsamples (or folds), chosen randomly but with roughly equal size. This produces K test sets containing approximately T/K obs each, and K training sets containing T-T/K obs each.
The generic procedure for comparing the candidate models using the K-fold CV can then be described as follows:
- Split the data in X (and y) into K=10 groups, each of which has approximately the same number of observations.
- Use the 1st group (out of the K groups) of data as the test set (Ztest = ytest,Xtest). Use the remaining (K-1) 9 groups of data as training set (Ztrain = ytrain,Xtrain).
- Compute the loss [e.g. (R)MSE, MAE] over Ztest using the model estimated over Ztrain.
- Repeat steps 2 & 3 but with the rest groups. E.g. use the 2nd group of data as the Ztest. Use the remaining as Ztrain.
- Proceed in a similar manner until each group is used as Ztest set exactly once.
- Return the 10 computed values (MSE's MAE's etc) as the vector values.
How can we use K-fold CV to select the optimal model, i.e. to choose between different λ's in LASSO or different p's in AR(p)? The lag order p and the penalty parameter λ are called hyperparameters.
The way to go about optimizing hyperparameters is usually by creating a grid of potential values and and then CV'ing over each of those values. - For each of the K test/train splits of the sample:
- (1) loop over all lambda values , and estimate the LASSO (obtained using the training part of the sample).
- (2) Predict the Y for the TEST set (using the betas estimated over the TRAINING set).
- Repeat the procedure for each of the K sets.
- You will then obtain a matrix of CV MSEs (ntests x L) (where ntests=K).
- Averaging column-wise (i.e. taking the mean of K values) you will obtain the so-called CROSS-VALIDATION ERROR. It will be 1-by-L vector. Essentially, the CV Error will be the Average of the K Mean MSE's for each λ.
- We select the (optimal) lambda by picking the one with the minimum CV-Error.
c = cvpartition(T,'Kfold',K); %For K-fold & LOO use MATLAB's resampling algorithm cvpartition()
%c = cvpartition(T,'LeaveOut');
Ntest = c.NumTestSets; %<- This is equal to 'K'
sp.test = false(T,Ntest);
sp.train = false(T,Ntest);
%Training Set (in-sample obs)
sp.test(:,i) = c.test(i); %<- The 1st out of the K groups
%Test Set (out-of-sample obs)
sp.train(:,i) = c.training(i); %<- The remaining K-1 groups
%Each test set will contain T/K obs
A Note on Other Resampling Techniques: LOO-CV and Jackknife
Leave-One-Out CV is a special case of K-Fold CV in which the number of folds (K) equals the number of observations (T).
Jackknife is another resampling method, very similar to Leave-One-Out CV. The difference between the two is that Leave-One-Out CV uses the observation that was left out as a TEST set (i.e. for the purposes of out-of-sample prediction), whereas Jackknife does not utilise the left-out observation in any manner. Therefore, even though the underlying resampling procedure results in the same 'training sets', the setup in which each of the two methods is using these training sets, is different.
Forecasting Performance Evaluation
The Kfold-CV and recursive/rolling estimation procedures above have provided us with one forecast for each observation in each of the test sets created by the corresponding procedure. We can use these out-of-samlpe (aka test-set) forecasts to compare the candidate forecasting models and pick the best one (and/or choose the optimal hyperparameters, as mentioned above).
To compare candidate models, we need to create a measure reflecting how well each of the those competing models perform in an out-of-sample (OOS) forecasting setup. We refer to those as 'error measures'. There are multiple such measures. Denote the pseudo-out-of-sample (POOS) forecast error (FE) coming from model m, with . Some relearchers refer to the RMSE as the out-of-sample (root) MSE, or (root) MSFE.
Question & Take-home Exercise 1
Now that we know what are the different methods for model comparison (and hyperparameter selection), and bearing in mind the concept of time-dependence (we talked about during the IID/block bootstrap section), would you use K-fold CV to select the optimal lag order (i.e. the optimal number of lags, p*) in an simple univariate AR(p) model?
The answer can be found in Bergmeir, Hyndman & Koo (2018).
Take-home Exercise 2
Simulate different stochastic processes and test whether they exert any signs of depedence (Hint: use either the Ljung-Box Q-test or ACFs). Use the 7 processes given below.
%------------------------------------------%
%Simulation of popular stochastic processes%
%------------------------------------------%
%y(t) = a + e(t) -> ARMA(0,0)
mdl = arima('Constant',0.5,'Variance',1);
%mdl = arima(0,0,0); %Equivalent to above
[YR, ER, V] = simulate(mdl,T,'NumPaths',R);
%- YR: Simulated paths for y
%- ER: innovations from the ARIMA model
%- V: conditional variances
%y(t) = a + b*y(t-1) + e(t) -> AR(1)
mdl = arima('Constant',0.5,'Variance',1,'AR',0.8);
[YR, ER, V] = simulate(mdl,T,'NumPaths',R);
%y(t) = a + b*y(t-1) + e(t) -> AR(1)
mdl = arima('Constant',0.5,'Variance',1,'AR',-0.8);
[YR, ER, V] = simulate(mdl,T,'NumPaths',R);
%y(t) = a + e(t) + h*e(t-1) -> MA(1)
mdl = arima('Constant',0.5,'Variance',1,'MA',0.2);
[YR, ER, V] = simulate(mdl,T,'NumPaths',R);
%y(t) = a + b*y(t-1) + e(t) + h*e(t-1) -> ARMA(1,1)
mdl = arima('Constant',0.5,'Variance',1,'AR',0.8,'MA',0.2);
[YR, ER, V] = simulate(mdl,T,'NumPaths',R);
Mdl_RW = arima('D',1,'Constant',0,'Variance',1);
[YR, ER, V] = simulate(Mdl_RW,T, 'Y0',0,'NumPaths',R);
%%% GARCH(1,1) with offset (offset = constant 'a' in the conditional-mean equation) %%%
%y(t) = a + e(t) -> ARMA(0,0)
%e(t) = sigma(t)Z(t) with Zt~N(0,1)
%sigma(t)^2 = k + g*sigma(t-1)^2 + g*e(t-1)^2 -> GARCH(1,1)
CMmdl = arima('Constant',2.5); %ARMA(0,0) conditional mean model
CVmdl = garch('Constant',0.05,'GARCH',0.8,'ARCH',0.15); %GARCH(1,1) conditional variance model
%Recall: For stationarity we want: (GARCH + ARCH < 1)
[YR, ER, V] = simulate(CMmdl,T,'NumPaths',R); %Simulate R y~GARCH(1,1) with offset
subplot(M,1,1); plot(Ys(:,1)); xlim([0,T]); title('WN')
subplot(M,1,2); plot(Ys(:,2)); xlim([0,T]); title('AR(1) b=0.8')
subplot(M,1,3); plot(Ys(:,3)); xlim([0,T]); title('AR(1) b=-0.8')
subplot(M,1,4); plot(Ys(:,4)); xlim([0,T]); title('MA(1)')
subplot(M,1,5); plot(Ys(:,5)); xlim([0,T]); title('ARMA(1,1)')
subplot(M,1,6); plot(Ys(:,6)); xlim([0,T]); title('RW')
subplot(M,1,7); plot(Ys(:,7)); xlim([0,T]); title('GARCH(1,1)')
Week 3: Time Series Decomposition (Anomaly Detection I)
In the main lecture we decomposed the time series into all its main components. Specifically, a time series, , can be have the following 3 components: - Trend component, Tt
- Seasonal component, St with some known periodicity s
- Irregular (stationary) stochastic component, It
Sometimes there is also a fourth component, the Cyclical component (Ct) that you saw in the lecture. Assuming an additive decomposition, the time series is then given by:
. In what follows, we will assume that series are obtained Seasonally Adjusted (SA) from the source and we will focus on dealing with the trend (whether this trend being stochastic or deterministic) and see how to test for the existence of a unit root I(1). Then, we will proceed to make the series compatible with CLRM modelling assumptions, by transforming it into stationary (if the Unit root test concludes the series has a unit root).
Seasonality
Here's an example and a definition for the concepts of seasonality and seasonal adjustment, from a FT article on the challenges posed by the pandemic in accurately measuring economic statistics (including the distortion of commonly observed seasonality patterns that are factored into, when seasonally adjusting statistics): "Economists look closely at the “seasonally adjusted” figures, because they are thought to offer the most direct view of ongoing trends once regularly occurring fluctuations in the data tied to events like the start of the school year are stripped out."
"Retail businesses in November (2021), for example, reported 331,600 more people on payroll compared with October. But the BLS adjusted that number downward because recruitment in the retail sector tends to pick up just before the start of the holiday season. The agency reported that the sector had “lost” 20,400 jobs after taking into account seasonal adjustment."
addpath(genpath(fullfile(pwd, 'MyFunctions'))); %Add folders to MATLAB search-path
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Download SA & NSA series from FRED %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%1. Industrial Production (Monthly)
TT = fred_api(["INDPRO","IPB50001N"], 'StartDate', datetime(2006,1,1));
TT.Properties.VariableNames = ["SA","NSA"];
p.Title = 'Industrial Production';
plot(TT.Time, TT{:,1}, TT.Time, TT{:,2})
title('Industrial Production'); legend(["SA","NSA"])
%2. Vehicle Miles Traveled (Units: Millions of Miles; Monthly)
TT = fred_api(["VMTD11","TRFVOLUSM227NFWA"], 'StartDate', datetime(2006,1,1));
TT.Properties.VariableNames = ["SA","NSA"];
p.Title = 'Vehicle Miles Traveled';
plot(TT.Time, TT{:,1}, TT.Time, TT{:,2})
title('Vehicle Miles Traveled'); legend(["SA","NSA"])
Recall that the seasonal component captures level shifts that repeat systematically within the same period (e.g. month or quarter) between successive years. To clearly see the seasonality effect take a look at a 2-year period of the above series.
%-> Peak in July's; Trough in February's
tr = timerange(datetime(2018,1,1),datetime(2019,12,31));
plot(TT.Time, TT{:,1}, TT.Time, TT{:,2})
title('Vehicle Miles Traveled - 2-year Window'); ; legend(["SA","NSA"])
%3. Quarterly Series: Real GDP
%Real Gross Domestic Product: Billions of Chained 2012 Dollars, SA Annual Rate (GDPC1) -> 2020Q2 $17282.1
%Real Gross Domestic Product: Billions of Chained 2012 Dollars, NSA (ND000334Q) -> 2020Q2: $4329.3
[TT,Meta] = fred_api(["GDPC1","ND000334Q"], 'StartDate', datetime(2006,1,1));
TT.Properties.VariableNames = ["SA","NSA"];
%plot(TT.Time, TT{:,1}, TT.Time, TT{:,2})
There are numerous ways to deal with seasonality in the timeseries:
- STL (Seasonal and Trend Decomposition using Loess) filtering,
- X13-ARIMA seasonal adjustment program of the U.S. Census Bureau.
Instead of de-seasonalizing a time series, you can also model the seasonality directly by employing models that account for seasonality. For instance, use plain OLS and add time dummies (4 for Q series; 12 for M series) to capture the seasonality. Of course, you must then remove one dummy variables as a reference group (and in order to avoid the dummy variable trap), if your model also contains a constant. Another method is to estimate a Seasonal ARIMA (SARIMA) model. Also note that annual frequency data, or year-on-year (YoY) differences of Q or M frequency series, are free of seasonality.
Trend
To test for the existence of a unit root (UR), we can employ the Augmented Dickey-Fuller (ADF) test, which is an improved version of the Dickey-Fuller (DF; or sometimes referred to as Said-Dickey-Fuller) test. The difference between the DF and the ADF tests is that in the latter the underlying ADF regression contains the lagged differences of the series in order to account for the possibility of serial correlation in the innovations. The nested specification which represents the alternative hypothesis, and hence, the one that is estimated is the following:
Under follows a stationary process. The null hypothesis of a unit root is:
→ The true process is a RW (i.e. is a unit root process = contains stochastic trend) As such, the process under : NOTE:
Different software use different versions of the the ADF regression. In some cases you will find the LHS of the ADF regression to be instead of . However the test result remains the same. - When regressing on , test .
- When regressing on , test .
Also note that when p=0 the test is equal to the DF test.
%Augmented Dickey-Fuller (ADF) test for a unit root
%[h,pValue,testStat,critValue,reg] = adftest(y, 'Model','AR', 'Lags',0, 'Alpha',0.05)
%h==1 rejection of the unit-root null model in favor of the alternative model at the 0.05 level of significance
%The 'Model' Name-Value pair defines the alternative hypothesis (Ha):
% - AR = Ha: stationary AR model
% - ARD = Ha: stationary AR model with drift
% - TS = Ha: trend stationary
%-------------------------------------------%
% How to choose the optimal number of lags? %
%-------------------------------------------%
%'Lags' Name-Value pair defines the number of lagged y values in the ADF regression.
%We can use the information criteria (e.g. BIC) to choose the optimal lag order.
y = fred_api("GDPC1", 'StartDate',datetime(2005,1,1));
[h,pValue,testStat,critValue,ADFreg] = adftest(y{:,:}, 'Model','AR', 'Lags',0:5, 'Alpha',0.05);
allBIC = [ADFreg.BIC]; %The BIC's
[minval, idx] = min(allBIC);
%Look/keep the test results of the ADF model with the (optimally)
%BIC-selected lag structure. NB 'idx' reflects the optimal 'p'
If we do not remove the (stochastic) trend from the series then we would encounter a spurious regression problem, where the underlying estimated betas could be misleadingly large and significant, without that reflecting the actual relationship between y and X, but a (spurious) correlation that is solely due to the trending nature of the two series.
Cyclical Component
The Hodrick-Prescott filter (aka HP filter) decomposes the time series into trend and cyclical components. The broad time-series/econometrics literature condemns the use of the filter because of 'spurious dynamics in the filtered series, unrelated to the underlying true data-generating process', as well as anomalous endpoint dynamics. See for instance Hamilton (2017, Why You Should Never Use the Hodrick-Prescott Filter). However, it is worth taking a look at the filter as it gives an idea of what a cyclical component looks like in a trending series.
The HP filter chooses the trend estimates by minimizing an objective function that is based on a smoothing parameter, λ. The smaller the smoothing parameter, the smoother the trend. When λ is arbitrarily large, the resulting estimated trend in a linear trend. The choice for the smoothing parameter depends on the periodicity of the data. The recommended values are:
- 14400 for monthly data
- 1600 for quarterly data
- 100 for yearly data
MATLAB provides a built in function, hpfilter(), for applying the Hodrick-Prescott filter. When Smoothing = Inf, function hpfilter() applies maximum smoothing, and the estimated trend component is the linear time trend computed by least squares.
TT = fred_api("GDPC1", 'StartDate',datetime(2010,1,1), 'EndDate',datetime(2019,12,30));
[trend,cyclical] = hpfilter(TT{:,1},1600);
TC = array2timetable([trend cyclical],"RowTimes",TT.Time,"VariableNames",["Trend","Cyclical"]);
p = stackedplot([TT TC],{["GDPC1" "Trend"] "Cyclical"});
p.Title = 'HP-filter on US Real GDP';
p.DisplayLabels = {'GDP','Cycle'};
p.AxesProperties(1).LegendLabels = ["GDP","Trend"];
p.AxesProperties(1).LegendLocation = 'SouthEast';
[trend,cyclical] = hpfilter(TT{:,1},1600);
TC = array2timetable([trend cyclical],"RowTimes",TT.Time,"VariableNames",["Trend","Cyclical"]);
plot(TT.Time, TT{:,1}, TT.Time, TC{:,1});
title("HP-filter on US Real GDP"); legend(["GDP","Trend"]); axis tight
% Changing the Smoothing Parameter %
TT = fred_api("GDPC1", 'StartDate',datetime(2008,1,1));
smoothing = [1 50 100 1600 14400 Inf]; %grid for smoothing parameter lambda
for j = 1:numel(smoothing)
[trend,cyclical] = hpfilter(TT{:,1},smoothing(j));
TC = array2timetable([trend cyclical],"RowTimes",TT.Time,"VariableNames",["Trend","Cyclical"]);
title("\lambda = " + string(smoothing(j)));
plot(TT.Time, TT{:,1}, TT.Time, TC{:,1});
The trend variable can be thought of as the measure of potential GDP which is a notion used in macroeconomics (and the Phillips curve in particularly) to construct what is known as the output gap, which is the percent deviation of real GDP from potential output. Similarly, we can define the notion of unemployment gap which is the deviation of the unemployment rate from the natural rate of unemployment. These notions will become relevant later when we will deviate from the bivariate Phillips curve regression and use a data-rich environment to forecast inflation.
Take-home Exercise
Use the Hodrick-Prescott filter to de-trend the monthly Industrial Production (FRED series code: INDPRO). Then make a graph and add shaded areas reflecting the NBER Recession dates both on the cyclical component and on the trend graphs. Does the cyclical component capture business cycles (recessions-expansions, troughs and peaks)?
Week 4: Outlier Detection and Correction (Anomaly Detection II)
%Create noisy data with 3 outliers:
x = sin(z) + 0.5*(rand(N,1)-0.5);
x(ceil(N*rand(2,1))) = 3; %Two outilers (-the high value of the outlier is 3)
x(ceil(N*rand(1,1))) = 4; %1 more outlier (-taking the value of 4)
histogram(x, 'Normalization','pdf');
%-----------------------%
%%% Outlier Detection %%%
%-----------------------%
%A. Outliers defined as any value exceeding 2 St.Dev's from the median
%This method will ensure removal of the extreme 5% ONLY if normality can be
%Recall that for a normal population:
%- 68% of the population is within 1 std of the mean.
%- 95% of the population is within 2 std of the mean.
%- 99.7% of the population is within 3 std of the mean.
toobig = median(x) + w*std(x);
toosml = median(x) - w*std(x);
outlier = (x >= toobig) | (x <= toosml); %Indicator/logical vector for the detected outliers
%B. Outliers defined as observations that deviate from the sample
% median by more than 10 interquartile (IQR) ranges
iqr = prctile(x,75) - prctile(x,25); %Interquartile range (IQR). prctile() omits NaNs by default
xmm = abs( x - median(x,'omitnan') );
thr = 10; %threshold for outliers detection (i.e. multiples of IQR)
outlier = (xmm > thr*iqr);
%------------------------%
%%% Outlier Adjustment %%%
%------------------------%
%1. In case you only have a single series x
x_adj(outlier) = median(x,'omitnan');
%2. If the predictor-matrix X is of size T-by-N.
if sum(outlier)==0 %no outliers detected
if adjm==0 %set outliers to missing
x(outlier) = median(x,'omitnan');
Here, we used the median to correct the outliers in the series. Other corrections methods include taking the (rolling) median of the 5 preceding obs, or the Expectation-Maximization (EM) method of Stock-Watson (2002).
Week 5: Variable Selection & Variable Reduction: Penalized Regression
The main reason for using penalized regressions is to deal with the curse of dimensionlity (small T, large N problem), and to be able to accomodate a large number of potential predictors into our setup.
The shrinkage estimator of the linear regression model is given by:
The difference between the different penalized regressions goes down to the selection of the penalty function P.
Ridge Regression (RR) due to Hoerl and Kennard (1970):
Least Absolute Shrinkage and Selection Operator (LASSO) due to Tibshirani (1996):
Elastic Net (EN) due to Zou and Hastie (2005):
where , nestsing LASSO () and RR (). Both λ and α are hyperparameters.
Therefore, EN is a hybrid between the two abovementioned approaches, the Ridge Regression, and the LASSO regularization. The difference between LASSO and RR is that LASSO coefficients can be equal to 0 (with more 0's for larger λ values), but RR does not set coefficients exactly to 0 (regardless of the λ value), however, coefficients become very-very small. Note also that for small λ values the LASSO coefficients are close to the least-squares (OLS) estimates . Empirical studies suggest that the EN can outperform LASSO on data with highly correlated predictors. The FRED-MD Monthly Macroeconomic Dataset
In this section we will setup the forecasting problem with the objective to forecast US headline CPI inflation (one month ahead), using a standardized macroeconomic dataset that emulates a monthly set of indicators that have been widely adopted in the academic literature for forecasting (particularly inflation, but other response variables too, such as industrial production, non-farm payrolls etc). This dataset is known as 'the Stock-Watson dataset', and has been reproduced by McCracken-Ng (2016) and currently updated by Michael McCracken (known as the FRED-MD dataset).
The links to access the latest FRED-MD data are permanent. Therefore, we can download the dataset programmatically, without opening the internet browser. You can always access the current and past vintages via Michael McCracken's webpage. %web('https://research.stlouisfed.org/econ/mccracken/fred-databases/')
url_fredMD = 'https://files.stlouisfed.org/files/htdocs/fred-md/monthly/current.csv'; %FRED-MD
%url_fredQD = 'https://files.stlouisfed.org/files/htdocs/fred-md/quarterly/current.csv'; %FRED-QD
%Download and save the FRED-MD dataset
pathNname = ['.\Data\', filename];
dir_data = websave(pathNname, url_fredMD);
clear url_fred* filename dir_data
winopen('.\Data') %Check the file we downloaded
%Load the FRED-MD data in the workspace
Xt = readtimetable(pathNname); %load the csv into a timetable
%Xt = readtable(pathNname); %load the csv into a table
Xt.Properties.DimensionNames{1} = 'Time';
Xt.Time.Format = 'dd-MM-yyyy'; %Change the display format of dates
%Keep the date you downloaded the data
downdatefredXD = char(datetime('now', 'Format','yyyyMMdd'));
%Extract Transformation codes
%tcode = Xt(1, :); %creates a timetable
tcode = Xt{1, :}; %creates a numerical vector (double array)
Xt(1 ,:) = []; %Remove transformation codes from Xt
dates = Xt.Time; %Extract the dates from the timetable
%Select Target variable yt
Yname = "CPIAUCSL"; %Headline CPI
Yidx = find(Xt.Properties.VariableNames == Yname); %the column number of the LHS
y_untr = Xt(:,Yname); %Keep a copy of the untransformed LHS
y(2:end) = 12*100*diff(log(Xt{:,Yname})); %Annualized MoM% change
TTy = array2timetable(y(2:end), 'RowTimes',Xt.Time(2:end), 'VariableNames',Yname);
plot(TTy.Time, TTy{:,:}); title('US Headline CPI Inflation (Annualized MoM%)');
TTy(end-4:end,1) %Display the last 5 obs
Xt(:, Yname) = []; %Remove yt from X matrix
tcode(:, Yidx) = []; %Remove corresponding transformation code
The FRED-MD transformation codes (see McCracken & Ng, 2016) correspond to the following transformations:
- tcode=1 → x~I(0); No transformation: x(t)
- tcode=2 → x~I(1); First difference: x(t)-x(t-1)
- tcode=3 → x~I(2); Second difference: (x(t)-x(t-1))-(x(t-1)-x(t-2))
- tcode=4 → lnx~I(0); Natural log: ln(x)
- tcode=5 → lnx~I(1); First difference of natural log: ln(x)-ln(x-1)
- tcode=6 → lnx~I(2); Second difference of natural log: (ln(x)-ln(x-1))-(ln(x-1)-ln(x-2))
- tcode=7 → x~I(2); First difference of percent change: (x(t)/x(t-1)-1)-(x(t-1)/x(t-2)-1)
X = Xt{:,:}; %Convert timetable to array
%Which variables to log before transforming?
tolog = (tcode>=4 & tcode<=6);
small = 1e-6; % Value close to 0
canlog = (min(X) > small); %series NOT close to 0
takelog = tolog & canlog; %same as: tolog.*canlog
%Check if all X's selected to be logged, can actually be logged.
[tolog; canlog; takelog];
canlog(find(tolog==1)); %All these must be 1's
X(:,tolog) = log(X(:,takelog)); %Take the natural logarithm for the selected series
%Transform to Stationary based on provided transformation codes
if tc==1 || tc==4 % Level (i.e. no transformation): x(t)
elseif tc==2 || tc==5 % First difference
z(2:T) = x(2:T) - x(1:T-1);
elseif tc==3 || tc==6 % Second difference: (x(t)-x(t-1))-(x(t-1)-x(t-2))
z(3:T) = x(3:T) - 2*x(2:T-1) + x(1:T-2);
elseif tc==7 % First difference of percent change: (x(t)/x(t-1)-1)-(x(t-1)/x(t-2)-1)
y1(2:T)=(x(2:T)-x(1:T-1))./x(1:T-1);
z(3:T)=y1(3:T)-y1(2:T-1);
clear x tc z y1 i tolog canlog takelog small
X = Xst; %Make X the transformed matrix
%Check if we have achieved Stationarity with the proposed transformations
[h,pValue,~,~,ADFreg] = adftest(X(:,i), 'Model','ARD', 'Lags',0:5, 'Alpha',0.1);
allBIC = [ADFreg.BIC]; %The BIC's
[minval, idx] = min(allBIC);
%Keep the results of the ADF model with the (optimally) BIC-selected lag order
isStationary(i) = h(idx); %h=1 implies stationarity at the 10% level of significance
sum(isStationary); %check if all are stationary
Xt(:,find(isStationary==0)); %check any problematic series
tcode(find(isStationary==0)); %check suggested transformation code
clear h pValue ADFreg allBIC minval idx i
%------------------------------%
% Outlier Detection/Adjustment %
%------------------------------%
iqr = prctile(X,75) - prctile(X,25); %Interquartile range (IQR). prctile() omits NaNs by default
xmm = abs( X - median(X,'omitnan') );
thr = 10; %threshold for outliers detection (i.e. multiples of IQR)
outlier = (xmm > thr*iqr);
sum(outlier); %column-wise sum to check # of problematic obs per indicators
%Step 1: Replace detected outliers with missing (NaN)
if sum(out_i)==0 %no outliers detected
X(out_i,i) = NaN; %Replace with missing
%Before we fix the outliers we could remove the rows with too many NaNs
row_out = sum(isnan(X),2); %row-wise summation of units
idx = 12; %Remove the upper-edge up to this observation
table(dates, row_out); %NB: We will have a spike in the number of outliers/NaNs around the COVID-lockdowns period
dates = dates(idx+1:end);
%Step 2: Correct Outliers (by taking the median)
X(tofill,i) = median(X(:,i),'omitnan');
%Save the dataset for future use
save(fullfile([pathNname, '.mat'])); %saves the entire workspace
Forecasting Inflation
1. Forecsting with a simple bi-variate OLS model (-benchmark)
load(fullfile('./Data/fredMD.mat'));
%How do we forecast a variable of interest?
%Align Y and X so that to:
%1. Regress Y(T) on X(T-h), AND then
%2. Use X(T) to forecast Y(T+h).
%NB: In a pseudo OOS exercise, step 1 reflects the training set
% (from which we obtain b_train), and step 2 the test set.
%How to calculate Confidence Interval (for the OLS-based forecast)
%OUT-OF-SAMPLE FORECAST: yf/xT ~ N(.,.)
% where xT is the last row of matrix X
%i. E(yf/xT) = xT*b_ols <- (pseudo or actual) out-of-sample forecast
%ii. VCOV(yf/xT) = xT*VCOV(b_ols)*xT' + s_ols^2
%A (very) simple Phillips curve-resembling example using OLS
%(without gap variables: output gap and unemployment gap)
Xnames = Xt.Properties.VariableNames;
idx = find(Xnames == "UNRATE"); %UNRATE = Unemployment rate
h=1; %h-steps ahead forecast (h=1 -> 1-month ahead)
%To include a constant term in the model add a column of ones
b = regress(yh, [ones(T,1) Xh]); %OLS
yforc(1) = [1 X(end, idx)]*b; %Forecast
frcstDate = dates(end) + calmonths(h);
disp(['Annualized MoM% CPI Inflation forecast for ' char(frcstDate) ': ' num2str(round(yforc(1),4))])
Annualized MoM% CPI Inflation forecast for 01-09-2022: 3.7188
2. Forecsting with LASSO using the FRED-MD data (i.e. in a data-rich environment)
%Input 'Alpha' is a scalar in the interval (0,1] that defines the
%weight of lasso (L1) versus ridge (L2) optimization.
%- Alpha = 1 represents LASSO regression,
%- Alpha close to 0 approaches Ridge Regression,
%- other Alpha values represent ELASTIC NET optimization.
[B,S] = lasso(Xh,yh,'Alpha',1,'CV',5); %LASSO with 5-fold CV to choose lambda
CVmse = S.MSE'; %Cross-validated mean squared error (CV-MSE) for each lambda
lambdas = S.Lambda'; %-> the grid for hyperparameter lambda
%-----------------------------------------------------------------------------------------------------%
%%% Other Regularized Regression Models %%%
%1. Elastic Net with alfa=0.5
%[B,S] = lasso(Xh,yh,'Alpha',0.5,'CV',5); %Elastic Net with 5-fold CV to choose lambda
%scaled = 1; %Standardize the X's
%B = ridge(yh,Xh,lambdas,scaled); %Ridge without lambda optimization
%Mdl = fitrlinear(Xh,yh,'KFold',5,'Lambda',lambdas,'Learner','leastsquares','Regularization','ridge'); %Ridge with 5-fold CV to choose lambda
%-----------------------------------------------------------------------------------------------------%
B(:,:)~=0; %-> which of the betas are non-zero for each lambda
idx1 = S.IndexMinMSE; %CHOOSE the lambda that minimises the CV-MSE
idx2 = S.Index1SE; %CHOOSE the lambda that minimises the CV-MSE + 1 StErr (-sparsest solution)
b = B(:,idx2); %<- coefficients of the predictors
b0 = S.Intercept(idx2); %<- intercept
non0betas = (b~=0); %-> which of the betas are non-zero
Xnames(non0betas)' %LASSO variable-selection
'AWHMAN'
'ACOGNO'
'BUSINVx'
'ISRATIOx'
'M2REAL'
'GS1'
'GS5'
'COMPAPFFx'
'TB3SMFFM'
'AAAFFM'
Recall that larger penalization parameter values (λ's) result in less non-zero coefficients (i.e. more zeros).
Given that:
- idx1 = minimum CV-MSE,
- idx2 = minimum CV-MSE plus 1 standard error,
then, by construction, (idx2) will be larger than (idx1) . This implies that idx2 corresponds to: - a higher CV-MSE,
- sparsest model (i.e. less X's picked).
The recommended value for the LASSO hyperparameter is idx2. In this way we also end up with a more parsimonious model. Instead of the K-fold CV, we could also select the optimal λ by minimizing the BIC.
%%% PLOT 1: CV-MSE VS (decending) lambda
%Plot the cross-validated (CV) MSE for the different lambda values.
lassoPlot(B,S,'PlotType','CV','PredictorNames',Xnames); %-> plots CV MSE for each lambda
%%% PLOT 2: estimated betas VS (decending) lambda
%Plot the non-zero coefficients in the regression for various values of the Lambda
lassoPlot(B,S,'PlotType','Lambda','XScale','log','PredictorNames',Xnames);
legend('show', 'Location','bestoutside')
By default lasso() will center and scale (i.e. standardize) the predictors for , and will center (i.e. de-mean) the response . However, even though the LASSO betas are estimated using the standardized X's, most functions (including MATLAB's lasso() and glmnet()) will return the estimated LASSO betas on the original data scale. Keep in mind that if you come across any regularized regression function that returns the scaled betas instead, you should then standardize your X's or un-scale the estimated betas, before calculating your predictions. %When making predictions you must use the betas on the original scale (i.e. the ones provided by lasso()).
%To produce plots where the coefficients are displayed on the same scale you can (re-)calculate the betas that (would have) come from the standardized Xs.
b_scaled = b.*s; %The beats when X's are standarsized
Tb = array2table([b(non0betas) b_scaled(non0betas)], 'VariableNames',["Unscaled","Scaled"]); %Table with non-0 betas
Tb = [table(string(Xnames(non0betas)'),'VariableName',"Regressor") Tb];
ans = 12×3 table
| Regressor | Unscaled | Scaled |
---|
1 | "AWHMAN" | -0.8355 | -0.6200 |
---|
2 | "ACOGNO" | 5.5562 | 0.0912 |
---|
3 | "BUSINVx" | 143.2206 | 0.7884 |
---|
4 | "ISRATIOx" | -11.3428 | -0.2234 |
---|
5 | "M2REAL" | -77.4243 | -0.4002 |
---|
6 | "GS1" | 0.4087 | 0.1462 |
---|
7 | "GS5" | 0.7771 | 0.2463 |
---|
8 | "COMPAPFFx" | -1.4013 | -0.6008 |
---|
9 | "TB3SMFFM" | -0.1525 | -0.1069 |
---|
10 | "AAAFFM" | -0.2714 | -0.5170 |
---|
11 | "OILPRICEx" | 2.2048 | 0.1869 |
---|
12 | "CPITRNSL" | 32.1070 | 0.3755 |
---|
%LASSO betas are biased, but think of the OLS interpretation:
%- b -> 1 unit increase in X_i, increases y by b_i
%- b_scaled -> 1 s.d. increase in X_i, increases y by b_scaled_i
%Calculating fitted values
yhat = b0 + Xh*b; %-> fitted y
%%% Forecasting with LASSO %%%
yforc(2) = b0 + X(end, :)*b; %Forecast
frcstDate = dates(end) + calmonths(h);
disp(['Annualized MoM% CPI Inflation forecast for ' char(frcstDate) ': ' num2str(round(yforc(2),4))])
Annualized MoM% CPI Inflation forecast for 01-09-2022: 3.6188
clear yhat b0 b idx* B S lambdas CVmse
%---------------------------%
% OLS on LASSO-selected X's %
%---------------------------%
%Use the LASSO-selected variables and compute the least-squares (OLS) estimate.
%-> OLS will provide UNBIASED betas which we can use for inferencing
% (hypothesis testing and confidence intervals).
% However, doing so might lead to unrealistically small std. errors/p-values.
b = regress(yh, [ones(T,1) Xh(:,non0betas)]); %OLS
yforc(3) = [1 X(end, non0betas)]*b; %Forecast
frcstDate = dates(end)+calmonths(h);
disp(['Annualized MoM% CPI Inflation forecast for ' char(frcstDate) ': ' num2str(round(yforc(3),4))])
Annualized MoM% CPI Inflation forecast for 01-09-2022: 3.7167
%Recall the LHS transformation (annualized MoM% change):
% -> y_t = 12*100*diff(log(CPI_t)) = 12*100*log(CPI_t/CPI_t-1)
%Let's un-transform our forecast to go from inflation to CPI level
yforc_untr = exp(yforc/(12*100))*y_untr{end,1}
296.5375 296.5128 296.5370
%NB: The code for transforming the LHS back to levels will only work for h=1.
%Create a table with all the forecasts
Models = ["Bi-variate OLS","LASSO(CV)","LASSO-OLS"];
frcstDate = dates(end) + calmonths(h);
forcT = array2timetable(yforc, 'VariableNames',Models,'RowTimes',frcstDate)
clearvars -except forcT y_untr downdatefredXD
save(fullfile('./Data/forecasts.mat'));
Can we improve our forecasting model(s)?
Possibly yes. We can add period-dummies to dummy-out certain (extreme) periods of the LHS. We could also consider adding lags of all the predictors, as well as AR lags. Possibly, we could improve by taking different routes when detecting and/or correcting outliers. Finally, another improvement could come from trying different methods for choosing the optimal hyperparameters of the model.
ML Interpretability
Once we have successfully estimated our regularized regression model, then we can utilize its feature-selection capabilities and get a sense of what is important for forecasting the target variable, and how those underlying determinants change across time. In other words, feature-selection of ML methods allow us to open the black box and make the model interpretable. For instance, we can use word clouds to visualize the results (as reflected by the estimated LASSO betas. We will also see next week an example of making factors, obtained by PCA, 'interpretable'). Below I used the FRED-MD dataset to forecast next-month CPI inflation (i.e. 1-step ahead; h=1) using LASSO, and then I extracted the betas and summarized the variable descriptions (by year) in a word cloud.
Interestingly, based on the results from the LASSO variable-selection, and the resulting word clouds, one of the most pronounced words throughout the different years covered, is 'employee'. This points out the fact that one of the most important determinants for inflation is (the change in) the number of employees, which is in line with past and well established theoretical and empirical findings highlighting the importance and relevance of the 'Phillips curve', and specifically unemployment, in defining inflation dynamics.
Take-home Exercise
As it was mentioned earlier in the course, instead of the K-fold CV, we can select the optimal λ by minimizing the BIC. Fill the code below to select the LASSO hyperparameter by minimizing the BIC and/or AIC. Hint: In LASSO the degrees of freedom (df) are equal to the number of non-zero coefficients in the regression). You should have 1 BIC for each value of λ in the grid (S.Lambda).
load(fullfile('.\Data\fredMD.mat'));
[B,S] = lasso(Xh,yh,'Alpha',1,'CV',5); %LASSO with 5-fold CV to choose lambda
lambdas = S.Lambda'; %-> the grid for hyperparameter lambda
b_all = [S.Intercept; B]; %-> (N+1)xL
yhat = [ones(T,1) Xh]*b_all; %-> [Tx(N+1)]x[(N+1)xL]
e = repmat(yh,[1 L])-yhat; %Residuals
dfe = T-dof; %(Recall: constant is counted in dof)
%Complete the BIC formula and the minimize it to take the optimal lambda
%Hint: The BIC is a function of MSE and dof.
Week 6: Variable Selection & Variable Reduction: PCA and PLS
Principal Component Analysis (PCA)
PCA is an unsupervised machine learning approach for reducing the dimension of the data matrix X to tackle the 'curse of dimensionality'. Unsupervised (in the Machine Learning jargon) means that there is no target-variable y involved.
The underlying idea of applying PCA, in a large matrix X containing multiple economic time-series, is that all these series are driven by a small number of common unobserved (aka latent) factors F. The factors are also referred to as the (principal) components or scores. In an economics setting, where we use multiple macroeconomic variables, factors can be thought of as indices that capture the overall business-cycle activity, usually referred to as coincident indicators.
The k PCA factors of matrix X are simply the first k eigenvectors of the matrix , with Z denoting the normalized data matrix X. Alternatively, PCA factors and factor loadings can be computed by extracting the first k eigenvectors of the covariance matrix of the normalized dataset X. Denoting the i-th normalized indicator with for , and with Z, the matrix that contains all the normalized indicators, then, the covariance matrix is given by . In what follows we will be using MATLAB's built-in pca() function. However, it is worth noting that the basic PCA algorithm can be easily implemented using the following (essentially) 3-line code:
load(fullfile('.\Data\fredMD.mat'))
Z = (X - mean(X))./std(X); %-> X and Z is TxN
[U, S, V] = svd(Z*Z'); %-> U is TxT (singular value decomposition)
F = U(:, 1:k)*sqrt(T); %-> F is Txk
Applying PCA on a matrix X with dimensions , will result in: - T factors in total, if , or
- N factors in total, if .
This implies that the F matrix will be of dimensions or . However, our objective is to reduce the dimension of the problem be selecting the first k. One method for selecting the optimal number of principal components (), is by setting the desired level of the timeseries' variance explained. Alternatively, information criteria (ICs) can be used, such as the generalizations of Mallow's proposed by Bai & Ng (2002), which has over 4k citations. Factors are organized in decreasing order of their respective eigenvalues. This implies that the underlying ordering reflects their importance in explaining the set of predictors X. In other words, the 1st factor will explain more of the variability of X compared to the 2nd factor, and the 2nd more than the 3rd, and so on. For that reason, sometimes for simplicity we only consider the first factor.
However, note that those methods for selecting the optimal number of factors () do not take into account the response variable y. Once has been established, (say for instance ) we can then start involving the response variable y and conduct a cross-validation (CV; recursive or rolling) exercise or use information criteria in a regression-based framework (such as F-AR(p)) in order to choose which of the factors (and possibly which of the p lags of those factors) better forecast our variable of interest. Depending on how the CV exercise is set up, this can result in multiple factors being excluded. In other words, the pervious step only tells us the number of factors () we should consider, meaning that we do not necessarily have to include all factors into our final model. For example, the recursive-CV exercise or BIC might conclude that only the 1-st and 5-th factors out of the factors need to be included in our model when forecasting y. The factor-augmented regression (F-AR) framework takes the form: where and . Let us now download a set of monthly (and daily) series and see how we can forecast inflation (or any other variable) using PCA in a factor-augmented autoregressive (F-AR) framework. F-AR is sometimes referred to as Principal Components Regression (PCR). Since we download the most up-to-date data using the API, this is actual forecasting. Therefore, if we re-run this code every month or even every day, we will be getting real-time updated forecasts, exactly like professional forecasters (and meteorologists) in national and international organizations do.
A Real-Time Monthly Dataset
%Download a set of Monthly & Daily series via the FRED API
addpath(genpath(fullfile(pwd, 'MyFunctions'))); %Add folders to MATLAB search-path
%The 3rd-party MFE toolbox contains a function named pca().
%We need to remove it from the search-path in order to use MATLAB's built-in pca() function.
rmpath(fullfile('.\MyFunctions\mfe-toolbox-main\crosssection'))
%25 US Monthly Macro Series
%-> Downloaded in real-time using the FRED API
fred_codes = ["CONSUMER";"RPI";"DPCERA3M086SBEA";"INDPRO";"IPMANSICS";"TCU";"CUMFNS";"PAYEMS";"SRVPRD";"UNRATE";"HOUST";"CMRMTSPL";"PCEPI";"PPICMM";"CPIAUCSL";"CPILFESL";"MCOILWTICO";"FEDFUNDS";"TB3MS";"GS10";"TWEXAFEGSMTH";"EXUSUK";"EXUSEU";"UMCSENT";"USEPUINDXM"];
[TT, Meta] = fred_api(fred_codes, 'StartDate',datetime(1995,1,1));
%Take a look on the selected indicators
%Let's compare Headline and Core CPI inflation
%NB: Core CPI excludes food and energy prices
infl = ["CPIAUCSL"; "CPILFESL"];
y = 12*100*diff(log(TT{:,infl})); %Annualized MoM% change
TTinfl = array2timetable(y, 'RowTimes',TT.Time(2:end), 'VariableNames',infl);
plot(TTinfl.Time, TTinfl{:,:});
title('US CPI Inflation (Annualized MoM%)');
legend(["Headline","Core"]); axis tight
p.Title = 'US CPI Inflation (Annualized MoM%)';
p.DisplayLabels = ["Headline","Core"];
clear fred_codes infl y TTinfl p
%Keep the date we downloaded the data
downdatefredAPI = char(datetime('now', 'Format','yyyyMMdd'));
%Saving a row (unprocessed) version of the data is always a good idea when using APIs!
save(fullfile('./Data/fred_raw.mat'));
dates = Xt.Time; %Extract dates
%Select the target variable (yt) to forecast
Yname = "CPIAUCSL"; %Headline CPI
y_untr_api = Xt(:,Yname); %Keep a copy of the untransformed LHS
y(2:end) = 12*100*diff(log(Xt{:,Yname})); %Annualized MoM% change
TTy = array2timetable(y(2:end), 'RowTimes',Xt.Time(2:end), 'VariableNames',Yname);
TTy(end-4:end,1) %Display the last 5 obs
%Xt(:, Yname) = []; %NB: For PCA we do NOT have to remove yt from X matrix!
X = Xt{:,:}; %Convert timetable to array
%--------------------------%
% Treating the Ragged-edge %
%--------------------------%
%Unlike in the FRED-MD, there might be NaNs for 'CPIAUCSL' at the end of the sample
%depending on when we are downloading the data (because some of the other indicators
%are released with less delay than CPI). The missing values at the end of
%the sample are called 'ragged-edge' or 'jagged-edge'.
LHS_Tidx = find(~isnan(Xt{:,Yname}), 1,'Last');
LHS_T = Xt.Time(LHS_Tidx);
raggededge = T - LHS_Tidx; %How many NaNs there are at the end of my LHS
%Adjust/re-align the matricies to account for the ragged edge.
%Here I decided to retain the ragged-edge (i.e. the last few rows of X),
%because there is potentially important timely information in those last obs.
%Alternatively, I could have removed those last few rows containing NaNs.
%NB: We will now be forecasting yT+1 using XT+1 (instead of XT), or even using XT+2.
%NB: The code works even if there is no ragged-edge in 'CPIAUCSL', as it
% will set raggededge to 0, and therefore, no re-alignment will take place.
X = X(1+raggededge:end,:);
dates = dates(1:end-raggededge);
%Which variables to log before transforming?
%Recall: log(-3)=complex; log(0)=Inf; log(1)=0; xt/0=Inf
% => if x<=0 do NOT log()
%SW(2002) log all non-negative series that were not already in rates or percentage units.
%Rule 1: Series that are not negative or 0
small = 1e-6; % Value close to zero
canlog = (min(X) > small); %series NOT close to 0
%Rule 2: Series that are not already in rates or percentage units
%Simply go through the table to see which variables are already in %
Meta(:, ["SeriesID","Title","Units"])
ans = 25×3 table
| SeriesID | Title | Units |
---|
1 | "CONSUMER" | "Consumer Loans, All Commercial Banks" | "Billions of U.S. Dollars" |
---|
2 | "RPI" | "Real Personal Income" | "Billions of Chained 2012 Dollars" |
---|
3 | "DPCERA3M086SBEA" | "Real personal consumption expenditures (chain-type quantity index)" | "Index 2012=100" |
---|
4 | "INDPRO" | "Industrial Production: Total Index" | "Index 2017=100" |
---|
5 | "IPMANSICS" | "Industrial Production: Manufacturing (SIC)" | "Index 2017=100" |
---|
6 | "TCU" | "Capacity Utilization: Total Index" | "Percent of Capacity" |
---|
7 | "CUMFNS" | "Capacity Utilization: Manufacturing (SIC)" | "Percent of Capacity" |
---|
8 | "PAYEMS" | "All Employees, Total Nonfarm" | "Thousands of Persons" |
---|
9 | "SRVPRD" | "All Employees, Service-Providing" | "Thousands of Persons" |
---|
10 | "UNRATE" | "Unemployment Rate" | "Percent" |
---|
11 | "HOUST" | "New Privately-Owned Housing Units Started: Total Units" | "Thousands of Units" |
---|
12 | "CMRMTSPL" | "Real Manufacturing and Trade Industries Sales" | "Millions of Chained 2012 Dollars" |
---|
13 | "PCEPI" | "Personal Consumption Expenditures: Chain-type Price Index" | "Index 2012=100" |
---|
14 | "PPICMM" | "Producer Price Index by Commodity: Metals and Metal Products: Primary Nonferrous Metals" | "Index 1982=100" |
---|
⋮ |
---|
pos = [6 7 10 18 19 20]; %Manually add the indicies
idx = false(N,1); %Convert into logical vector
%Alternatively, use string fuctnion contains() to quantify this decision/selection
idx = contains(Meta.Units, "percent",'IgnoreCase',true);
notpct = ~idx'; %tilda gives the negation operation -> changes 1 to 0, and 0 to 1
takelog = notpct & canlog; %same as: notpct.*canlog
[notpct; canlog; takelog]; %Check the decisions and which X's will be logged.
X(:,takelog) = log(X(:,takelog)); %Take the natural logarithm for the selected series
clear small canlog pos idx notpct takelog
%Transform to Stationary based on repeating Unit Root (UR) tests
%(since we do not have any transformation codes given to us)
%NB: We dont know how many times we will need to difference each
%of the series until stationarity is achieved, therefore, we need
%to use a WHILE loop instead of FOR loop.
[h,~,~,~,ADFreg] = adftest(x, 'Model','ARD', 'Lags',0:5, 'Alpha',0.1);
allBIC = [ADFreg.BIC]; %The BIC's
%Keep the results of the ADF model with the BIC-selected lag order
isStationary = h(idx); %h=1 implies stationarity at the 10% level of significance
x = diff(x); %First differences
countdiffs(i) = countdiffs(i) + 1; %Count how many times we have differenced the series
Xst(1+countdiffs(i):end, i) = x;
countdiffs; %Take a look on how many differences it took to stationar-ize each indicator
clear h ADFreg allBIC idx i countdiffs x isStationary
X = Xst; %Make X the transformed matrix
%------------------------------%
% Outlier Detection/Adjustment %
%------------------------------%
iqr = prctile(X,75) - prctile(X,25); %Interquartile range (IQR). prctile() omits NaNs by default
xmm = abs( X - median(X,'omitnan') );
thr = 10; %threshold for outliers detection (i.e. multiples of IQR)
outlier = (xmm > thr*iqr);
sum(outlier); %column-wise sum to check # of problematic obs per indicators
%Step 1: Replace detected outliers with missing
if sum(out_i)==0 %no outliers detected
X(out_i,i) = NaN; %Replace with missing
%Before we fix the outliers we could remove the rows with too many NaNs
row_out = sum(isnan(X),2); %row-wise summation of units
table(dates, row_out); %NB: We will have a spike in the number of outliers/NaNs around the COVID-lockdowns period
idx = 2; %Remove the upper-edge up to the idx-th observation (inclusive)
dates = dates(idx+1:end);
%Step 2: Correct Outliers and the NaNs
X(tofill,i) = median(X(:,i),'omitnan');
%Now the dataset will be 'balanced' (meaning with no missing values).
%NB: Here we fixed the NaNs by using the median. Given the ragged-edge
% problem, and because we are estimating a FAR forecasting model,
% a better fix would be to impute the missing using EM ala SW(2002).
%Save the dataset for future use
save(fullfile('./Data/fred.mat')) %saves the entire workspace
Principal Component Analysis (PCA)
load(fullfile('./Data/fred.mat'));
%[coeff,score,latent,tsquar,explained,mu] = pca(X);
%If X is TxN (with T>N), then:
%- coeff (NxN): Principal component coefficients, also known as loadings
%- score (TxN): Principal component scores are the representations of X in the
% principal component space (i.e. the factors)
%- latent (Nx1): Latent are the principal component variances (i.e. the eigenvalues of the covariance matrix of X)
%- tsquar (Tx1): Hotelling's T-squared statistic for each observation in X
%- explained (Nx1): Percentage of the total variance explained by each principal component
%- mu (1xN): The estimated mean of each variable in X.
%NB: By default pca() centers X by subtracting the means of each column before
% computing the principal components. To apply PCA on the non-centered data, use:
% [coeff,score,latent,tsquared,explained,mu] = pca(Xf, 'Centered', false);
%%% Standardized X (i.e. instead of centered) %%%
%Because MATLAB only de-means (i.e. centers) the X matrix, I can standardize X before applying pca().
Z = (X - mean(X))./std(X);
[coeff,F,~,~,explained] = pca(Z, 'Centered',false);
%Factors can be thought of as weighted-averages of the individual series
%Let's compare the 1st PCA factor with the cross-sectional average using equal weights
plot(1:T, F(:,1), 1:T, mean(Z,2));
title("PCA factor Vs Averaged series"); axis tight
legend(["1st PCA factor", "Time-series Average (of Z)"], 'location','SW');
%Re-construct the (centered) X matrix (i.e. Zhat):
Xcentered = F*coeff'; %-> Zhat
%Let's take a look of yt vis-a-vis the first few factors
TTy = array2timetable(y, 'RowTimes',dates, 'VariableNames',Yname);
Fnames = "F"+ string(1:size(F,2));
TTf = array2timetable(F, 'RowTimes',dates, 'VariableNames',Fnames);
p = stackedplot(TT,{["F1" "F2"] Yname});
p.Title = 'Factors & CPI Inflation (Annualized MoM%)';
p.DisplayLabels = {'F','Infl.'};
p.AxesProperties(1).LegendLabels = ["F1","F2"];
p.AxesProperties(1).LegendLocation = 'SouthWest';
clear TTy Fnames TTf TT p
%-----------------------------------------------%
%%% Variability Explained by PCA and k Choice %%%
%-----------------------------------------------%
explained; %<- Displays the % of total variance explained by the principal components
%Table with Cummulative Variance Explained by each of the first 10 factors
array2table([(1:kk)' cumsum(explained(1:kk))], 'VariableNames', ["F", "Cummul. Var. Expl."])
ans = 10×2 table
| F | Cummul. Var. Expl. |
---|
1 | 1 | 17.5947 |
---|
2 | 2 | 31.6041 |
---|
3 | 3 | 41.9913 |
---|
4 | 4 | 50.8909 |
---|
5 | 5 | 57.5251 |
---|
6 | 6 | 63.6013 |
---|
7 | 7 | 68.5373 |
---|
8 | 8 | 72.7965 |
---|
9 | 9 | 76.3515 |
---|
10 | 10 | 79.8102 |
---|
%Make a plot of the incremental percent variability explained by each principal component.
xlabel('Principal Component')
ylabel('Variance Explained (%)')
%Find the number of factors (k) required to explain at least 95% variability.
sum_explained = sum_explained + explained(k);
k %You could use this figure as your k_star (-number of factors to consider in your FAR)
%Alternativly, find the k that minimizes the PCp2 information criterion by Bai-Ng (2002)
kmax = 8; %Select the max number of factor to consider
e = Z - Zhat; %Residuals from predicting Z using the factors
sigma = mean(sum(e.*e/T)); %Sum of squared residuals divided by NT
PCp2 = log(sigma) + (N+T/N*T)*log(min([N T]))*kz; %Criterion PC_p2
[minPCp2, kstr] = min(PCp2,[],2);
clear Xcentered explained sum_explained kmax Zhat e sigma kz PCp2 minPCp2
Forecasting with PCA Factors
%At this point we can choose to work with all the factors selected
%by the above procedure (and stored in 'k'), or we can simply
%focus on the 1st factor that has the largest proportion in
%explaining the variance of X. For simplicity we do the latter.
%-------------------------------------------------------------%
% Estimate a F-AR(p) (aka Diffusion Index) Model ala SW(2002) %
%-------------------------------------------------------------%
Xf = lagmatrix(F(:,1),0:P); %lags will create NaNs at the top
%Remove the missing values at the top (due to the lags)
h=1; %h-steps ahead forecast (h=1 -> 1-month ahead)
Xh = Xf(1:end-h, :); %X contains the factor & its P lags
%To include a constant term in the model add a column of ones
b = regress(yh, [ones(T,1) Xh]); %OLS <- This the FAR model
yforc = [1 Xf(end, :)]*b; %Forecast
fcstDate = LHS_T + calmonths(h);
disp(['Annualized MoM% CPI Inflation forecast for ' char(fcstDate) ': ' num2str(round(yforc,4))])
Annualized MoM% CPI Inflation forecast for 01/10/2022: 2.8431
clear idx yh Xh P N T h b
%Create a table with the forecast and add the competing forecasts
forcT_api = array2timetable(yforc, 'VariableNames',Models, 'RowTimes',fcstDate)
forcT_api = 1×2 timetable
clear Models fcstDate yforc
%Create a table containing all the previous forecasts
load(fullfile('./Data/forecasts.mat'),'forcT'); %load only table 'forcT'
forcT = synchronize(forcT,forcT_api,'union','fillwithmissing') %Merge the 2 timetables
w2save = ["forcT", "y_untr_api", "downdatefredAPI"]; %select which variables to save
save(fullfile('./Data/forecasts.mat'),w2save{:},'-append'); %save selected variables (append, don't overwrite)
clear forcT_api forcT w2save
Note that all of the forecasts in the above table were run on the same date (17/10/2022) reflecting the information contained in each of the two datasets (call them FRED-MD, and FRED-RT where RT stands for Real-Time) as of that date. Also, all forecasts were conducted for the 1-step ahead prediction. The difference in the resulting 1-step ahead forecasted period is due to an additional LHS observation at the end of the sample.
If we had run the forecasting exercise (using this exact same code) the first week of October (say around 5/10), then both datasets would have contained the same inflation observations for the target series CPIAUCSL, and the 1-step ahead forecasts would correspond to the same period (i.e. October 2022).
Despite calling them by their generic name, 'forecasts', the setup here allows us to distinguish between the backcasts, being the predictions that refer to a time that has passed, and the nowcast, referring to the current month's prediction. By setting we obtained the backcasts, using the FRED-MD dataset, and the nowcast using the FRED-RT. To obtain the nowcasts from the FRED-MD dataset, re-run the models for . Factor Interpretability
The PCA loadings describe how strongly each factor depends on the original variables and in what direction. Therefore, each factor can be given an interpretation by inspecting which variables it weights most heavily, by looking at the estimated loadings.
plot(1:N, coeff(:,1:kk),'-');
%stairs(1:N, coeff(:,1:kk),'-'); %Use this for a larger set of series N
title(['Loadings for the first ', num2str(kk), ' PCA factors'])
xlabel('Indicator'); ylabel('Loading');
legend("F"+string(1:kk),'location','NW');
Since the loadings act as weights in averaging the timeseries in X into the factors, consequently, if we classify our series into some categories reflecting their type (e.g. Prices, Output, Stock Market Indicators, Interest Rates, Labor Market etc.) we can then provide an interpretation of each factor by looking how each factor correlates with each one of the (variables classified in the different) categories. That is a similar, but more involved procedure to the one that we saw above. For example, for the FRED-MD data, the interpretation of the first 8 factors is shown in the following graph. Each one column stacks the explained variability attributed to each one of the 12 categories distinguished by McCracken-Ng (2016).
To see how this factor-interpretation graph was created, please refer to McCracken and Ng (2016).
Partial Least Squares (PLS)
The idea underlying the PLS methodology, is conceptually very similar to the extraction of factors using PCA, with the difference that in PLS we take into account the factors that are important for our LHS variable (= response variable = target variable) y. In other words, while PCA reduces the variation among all the time series into a component that explains the most of all their variation, PLS can be thought of as reducing the variation among the series down to a common component that is the most closely related to a prespecified “target” variable.
In terms of implementation, similarly to PCA, X has to be normalised (i.e. have zero mean and unit variance) and y has to be de-meaned. Because of the similarity of the two approaches we focused on the PCA. However, PLS could potentially provide better forecasts compared to PCA.
As a final remark, the loadings/weights in both the PCA and PLS procedures can be used as a variable selection tool, and pick only the variables that contribute most to each factor. Having explained the differences between PCA & PLS, it is worth noting that PCA weights might include large weights for variables that are not strongly correlated with y.
Take-home Exercise 1
As a final exercise, in order to put together everything we learnt in this course, estimate the following 3 models for a monthly target variable of your choice (inflation, unemployment, industrial production etc.) and then compare their forecasting performance with the error measure of your choice (RMSE, MAE etc) in a pseudo out-of-sample forecast evaluation exercise (i.e. CV) conducted in a recursive manner (i.e. use an expanding window):
- Iterated AR(p) (rather than direct) with the optimal lag order selected with K-fold CV
- LASSO
- F-AR(p) with 2 PCA factors, where p=1.
Take-home Exercise 2
Download a set of daily financial indicators and add them to set of predictors. Then, down-sample them to monthly and repeat the forecasting exercise (with FAR and PCA) using the augmented monthly dataset. Do you get a better forecast, or have you simply added noise?
load(fullfile('./Data/fred_raw.mat'));
fred_codes = Meta.SeriesID; %Re-donwload the monthly dataset to have everything up-to-date
clearvars -except fred_codes
[TT_m, Meta_m] = fred_api(fred_codes, 'StartDate',datetime(1995,1,1));
%8 Daily Stock Market Series
fred_codes = ["NIKKEI225";"NASDAQCOM";"SP500";"DJIA";"WILLSMLCAP";"VIXCLS";"RVXCLS";"GVZCLS"];
[TT_d, Meta_d] = fred_api(fred_codes, 'StartDate',datetime(1995,1,1));
TTd2m = retime(TT_d, TT_m.Time, 'fillwithmissing'); %Convert D to M (i.e. downsample/temporally-aggregate)
Also, calculate the confidence interval for the one-step ahead forecast obtained by the OLS-based FAR model with PCA factors. Hint: You will need two things to compute the CI's: (1) the standard errors of the coefficients (betas), and (2) the estimated error variance (OLS sigma-squared). You can obtain those via regress() by requesting additional outputs.
%[B,BINT,R,RINT,STATS] = regress(Y,X)
%BINT = matrix with 95% confidence intervals for B
%R = vector of residuals.
%RINT = matrix of intervals that can be used to diagnose outliers.
%STATS = a vector containing, the R-square, the F statistic and p value
% for the full model, and the error variance estimate.
References
Bai, J., and Ng, S. (2002). "Determining the number of factors in approximate factor models." Econometrica, 70(1), 191-221.
Bergmeir, C., Hyndman, R. J., and Koo, B. (2018). "A note on the validity of cross-validation for evaluating autoregressive time series prediction." Computational Statistics & Data Analysis, 120: pp. 70-83.
Cleveland RB., Cleveland WS., McRae JE., and Terpenning IJ. (1990), "STL: A seasonal-trend decomposition procedure based on loess." Journal of Official Statistics, 6(1): pp. 3–33
Dickey, D. A., and W. A. Fuller. (1979). "Distribution of the estimators for autoregressive time series with a unit root." Journal of the American Statistical Association 74: 427–431.
Findley, D. F., B. C. Monsell, W. R. Bell, M. C. Otto, and B.-C. Chen. (1998) “New Capabilities and Methods of the X-12-ARIMA Seasonal-Adjustment Program.” Journal of Business & Economic Statistics. Vol. 16 (2): pp. 127–152.
Hodrick, Robert J., and Edward C. Prescott. (1997) "Postwar U.S. Business Cycles: An Empirical Investigation." Journal of Money, Credit and Banking 29 (1): pp. 1–16.
Hoerl, A. and Kennard, R. (1970) "Ridge Regression: Biased Estimation for Nonorthogonal Problems." Technometrics, 12, 55-67.
McCracken, M. and Ng, S., (2016), "FRED-MD: A Monthly Database for Macroeconomic Research," Journal of Business & Economic Statistics, 34(4): pp. 574-589.
Stock, J. H., & Watson, M. W. (2002). "Macroeconomic forecasting using diffusion indexes." Journal of Business & Economic Statistics, 20(2), 147-162.
Tibshirani, R. (1996). "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.
Zou, H., and Hastie, T. (2005). "Regularization and variable selection via the elastic net." Journal of the royal statistical society: series B (statistical methodology), 67(2), 301-320.