Plain and Target-PCA factors for mixed-frequency Data

using the MATLAB Nowcasting Suite

This example uses various pre-processing fucntions, available in the Nowcasting Suite, with the goal of constructing a set of monthly target-relevant PCA factors, setting the quarterly US GDP growth rate as the target variable. The high-frequency factors can then be used in a MIDAS-type predictive regression to backcast, nowcast and/or forecast the low-frequency target, in real-time.
The Nowcasting Suite has been tailored for macroeconomic forecasting handling real-time mixed-frequency datasets, using MATLAB's native time-stamped data format, called timetable. The pcaF() function can implement target-PCA with either hard- or soft-thresholding ala Bai-Ng (2008). The example presented below uses hard-thresholding. Furthermore, pcaF() can create target-factors for both single and mixed-frequency data. That is, it can construct monthly target-relevant factors for a quarterly target by down-sampling the transformed high-frequency regressors into the target's lower frequency, conduct the pre-selection step at the target's sampling frequency, and then, collect the selected predictors at their highest available frequency.
Contents
1. Downloading Data in Real-time
2. Transforming the target variable
3. Time-Dummies for Target Variable (to control for irregular periods)
4. Pre-processing the Monthly Dataset
4.1.a. Log Transformations
4.1.b. Stationarity Transformations
4.2. Outlier Detection and Adjustment
4.3. Fill Missing Values using EM ala SW(2002)
5. Estimate Plain- and Target-PCA factors
5.1. Plain-PCA monthly factors
5.2. Target-PCA monthly factors with Hard-thresholding
5.3. Target-PCA quarterly factors with Hard-thresholding
6. Extracting Leads & Lags (MIDAS & single-frequency)
6.1 Leads for Quarterly Predictors (X's)
6.2 MIDAS Leads & Lags for Monthly Factors (F's)
6.3 Legendre MIDAS for Monthly Factors (F's)
List of Functions
1. fred_api()
2. transLHS(), StepsAhead4Nowcast(), trans()
3. LHStimeDummies()
4.
4.1.b. trans2I0(), trans2I0_ADF(), DFGLS()
4.2. outl_adj()
4.3. MissFill()
5. pcaF()
6.
6.1 DownSample(), trans2MIDAS()
6.2 AFTERtrans2MIDAS(), flatify()
6.3 trans2MIDASLegendre()
A short description of selected functions included in the Nowcasting Suite, together with a brief overview of their features, can be found here.
%Add the Nowcasting Suite folder to MATLAB's search-path
addpath(genpath(fullfile(pwd, 'MaFunctions')));
warning('off','all') %Suppress all MATLAB warnings

1. Downloading Data in Real-time

Let's start by downloading a set of mixed-frequency (30 monthly & the quarterly target) series in real-time, using the FRED API, accessed via the fred_api() function. The series mnemonics to be downloaded are stored in codes.M and codes.Q.
%Dataset: 20 Macro and Financial Series and 1 Quarterly
 
%To change the length of the series to train/estimate the models,
strtDate = datetime(2010,1,1);
 
%%% 1) MONTHLY DATA Codes %%%
codes.M = ["AAAFFM"; "AWHNONAG"; "BAAFFM"; "BOGMBASE"; "IPFINAL"; "CES9093000001"; ...
"CPIAUCSL"; "CPIULFSL"; "CUSR0000SAD"; "DMANEMP"; "DTCOLNVHFNM"; "IPDCONGD"; ...
"GS5"; "INDPRO"; "IPDMAT"; "LNS12032194"; "LNS13023557"; "LNS13023569"; ...
"LNS14000026"; "NDMANEMP"; "IPFPNSS"; "PPIIDC"; "T1YFFM"; "T5YFFM"; ...
"TB3MS"; "TB6SMFFM"; "TOTRESNS"; "UEMP5TO14"; "USPBS"; "USTPU"];
 
%%% 2) QUARTERLY DATA Codes %%%
codes.Q = ["GDPC1"; "A014RE1Q156NBEA"; "B021RE1Q156NBEA"; "CNCF"; "COMPRMS"; ...
"DHCERG3Q086SBEA"; "DHUTRG3Q086SBEA"; "DONGRG3Q086SBEA"; "COMPRNFB"; ...
"SLCE"; "OPHMFG"; "OPHNFB"; "OPHPBS"; "OUTBS"; "PCESV"; "RCPHBS"; "TLBSNNCB"; ...
"TNWBSNNB"; "ULCNFB"; "Y033RC1Q027SBEA"];
 
%%% Categories of Downloaded Variables %%%
%- Monthly (codes.M)
cat.M = ["Interest Rates"; "Labor"; "Interest Rates"; "Money and Credit"; "Industrial Production"; ...
"Labor"; "Prices"; "Prices"; "Prices"; "Labor"; "Money and Credit"; "Industrial Production"; ...
"Interest Rates"; "Industrial Production"; "Industrial Production"; "Labor"; "Labor"; ...
"Labor"; "Labor"; "Labor"; "Industrial Production"; "Prices"; "Interest Rates"; "Interest Rates"; ...
"Interest Rates"; "Interest Rates"; "Money and Credit"; "Labor"; "Labor"; "Labor"];
 
%- Quarterly (codes.Q)
cat.Q = ["Output, income, consumption"; "Output, income, consumption"; "Output, income, consumption"; ...
"Non-Household Balance Sheets"; "Earnings & Productivity"; "Prices"; "Prices"; "Prices"; ...
"Earnings & Productivity"; "Output, income, consumption"; "Earnings & Productivity"; ...
"Earnings & Productivity"; "Earnings & Productivity"; "Output, income, consumption"; ...
"Output, income, consumption"; "Earnings & Productivity"; "Non-Household Balance Sheets"; ...
"Non-Household Balance Sheets"; "Earnings & Productivity"; "Output, income, consumption"];
The table below shows the distribution of the downloaded variables into the various economic categories, which closely follow those of McCracken and Ng in the FRED-MD dataset.
%ALL Categories (i.e. the calssification system used consists of 11 categories)
tabulate(categorical([cat.M; cat.Q]))
Value Count Percent Earnings & Productivity 7 14.00% Industrial Production 5 10.00% Interest Rates 7 14.00% Labor 11 22.00% Money and Credit 3 6.00% Non-Household Balance Sheets 3 6.00% Output, income, consumption 7 14.00% Prices 7 14.00%
%-------------------%
%%% Download Data %%%
%-------------------%
 
%%% 1) MONTHLY DATA %%%
[X.M, Meta, release] = fred_api(codes.M, 'StartDate',strtDate);
%The reference date in FRED API is the 1st of the Month. Change it to LAST day of month.
X.M.Time = dateshift(X.M.Time,'end','month',0);
 
RelDates = table(Meta.SeriesID, release.lastupdate', release.lastdate', Meta.Frequency, 'VariableNames',["SeriesID","LastUpdate","LastOb","Freq"]);
The following two tables contain selected metadata for the monthly panel. Specifically, the first reports the FRED mnemonic and the description of each series, the assigned economic category, and its measurement units. The second table relates to the publication lags and the resulted ragged-edge structure of the monthly dataset. It reports the last available observation for each series, as well as the release date of that last observation.
[~,loc] = ismember(codes.M, Meta.SeriesID); %loc: index in RIGHT for each value in LEFT
[Meta(loc,"SeriesID") table(cat.M,'VariableNames',"Category") Meta(loc,["Title"])]
ans = 30x3 table
 SeriesIDCategoryTitle
1"AAAFFM""Interest Rates""Moody's Seasoned Aaa Corporate Bond Minus Federal Funds Rate"
2"AWHNONAG""Labor""Average Weekly Hours of Production and Nonsupervisory Employees, Total Private"
3"BAAFFM""Interest Rates""Moody's Seasoned Baa Corporate Bond Minus Federal Funds Rate"
4"BOGMBASE""Money and Credit""Monetary Base; Total"
5"IPFINAL""Industrial Production""Industrial Production: Final Products"
6"CES9093000001""Labor""All Employees, Local Government"
7"CPIAUCSL""Prices""Consumer Price Index for All Urban Consumers: All Items in U.S. City Average"
8"CPIULFSL""Prices""Consumer Price Index for All Urban Consumers: All Items Less Food in U.S. City Average"
9"CUSR0000SAD""Prices""Consumer Price Index for All Urban Consumers: Durables in U.S. City Average"
10"DMANEMP""Labor""All Employees, Durable Goods"
11"DTCOLNVHFNM""Money and Credit""Consumer Motor Vehicle Loans Owned by Finance Companies, Level"
12"IPDCONGD""Industrial Production""Industrial Production: Durable Consumer Goods"
13"GS5""Interest Rates""Market Yield on U.S. Treasury Securities at 5-Year Constant Maturity, Quoted on an Investment Basis"
14"INDPRO""Industrial Production""Industrial Production: Total Index"
[Meta(:,["SeriesID","Units"]) RelDates(:,["LastUpdate","LastOb"])]
ans = 30x4 table
 SeriesIDUnitsLastUpdateLastOb
1"AAAFFM""Percent"01-Feb-202431-Jan-2024
2"AWHNONAG""Hours"02-Feb-202431-Jan-2024
3"BAAFFM""Percent"01-Feb-202431-Jan-2024
4"BOGMBASE""Millions of Dollars"23-Jan-202431-Dec-2023
5"IPFINAL""Index 2017=100"15-Feb-202431-Jan-2024
6"CES9093000001""Thousands of Persons"02-Feb-202431-Jan-2024
7"CPIAUCSL""Index 1982-1984=100"13-Feb-202431-Jan-2024
8"CPIULFSL""Index 1982-1984=100"13-Feb-202431-Jan-2024
9"CUSR0000SAD""Index 1982-1984=100"13-Feb-202431-Jan-2024
10"DMANEMP""Thousands of Persons"02-Feb-202431-Jan-2024
11"DTCOLNVHFNM""Millions of Dollars"19-Jan-202430-Nov-2023
12"IPDCONGD""Index 2017=100"15-Feb-202431-Jan-2024
13"GS5""Percent"01-Feb-202431-Jan-2024
14"INDPRO""Index 2017=100"15-Feb-202431-Jan-2024
%%% 2) QUARTERLY DATA %%%
[X.Q, Meta_q, release_q] = fred_api(codes.Q, 'StartDate',strtDate);
%The reference date in FRED API is the 1st day of the Quarter. Change it to the LAST day.
X.Q.Time = dateshift(X.Q.Time,'end','quarter',0);
 
Meta = [Meta; Meta_q];
RelDates_q = table(Meta_q.SeriesID, release_q.lastupdate',release_q.lastdate', Meta_q.Frequency, 'VariableNames',["SeriesID","LastUpdate","LastOb","Freq"]); %Keep to see which series have been updated since last time
RelDates = [RelDates; RelDates_q];
clear Meta_q release_q RelDates_q release loc
 
Xraw = X; %Rename the dataset

2. Transforming the target variable

Function transLHS() applies the desired transformation on the target variable. The specs passed (in struct 'LHS') below will transform the GDP from levels to annualized quarter-on-quarter (QoQ) discretely compounded growth rates.
Specifically, setting tcode=11 calculates the percentage changes, while tcode=1 would take the 1st differences, tcode=2 the 2nd differences, and tcode=21 would transform by taking the first differences of the percentage changes.
Setting XoX=LP determines the base period, with LP corresponding to 'last period' (QoQ, MoM), as opposed to SPLY or 'same period last year' (YoY). Apart from the non h-specific transformations (QoQ/MoM/YoY), transLHS() can also accommodate h-specific transformations, such as the h-quarters/months changes or percentage changes.
LHS.name = 'GDPC1'; %GDP growth (QoQ% Annualized; simple-rates)
LHS.f = 'Q';
LHS.tcode = 11; LHS.hSpecificY = false; LHS.Percent = true;
LHS.Annualize = true; LHS.log = false; LHS.XoX = 'LP';
LHS.log4lags = true; %Transformation for the AR-lags (output: LHS.y4lags)
LHS.DaysDelay = 65;
LHS.flow = true; %Stock/flow type referes to the variable in LEVELS (not the transformed version)
%Description (for plotting):
LHS.P.Title = 'US Real GDP growth';
LHS.P.Des = 'Annualized QoQ%';
The GDP series will normally contain provisional 1st and 2nd estimates on the first and second months of the following quarter, respectively, before the final figure is released, on the last month of the quarter following the reference quarter. There are 2 modelling choices for those preliminary figures:
1) Keep them among the realized values, and simply forecast the next period.
2) Delete them from the sample, and forecast them.
transLHS() allows to automate the detection and removal of provisional values with the optional input 'RmvProvisional'. Below, we are taking the 2nd route by making use of 'RmvProvisional'. By setting DaysDelay=65, observations released within the first 65 days after the closing of the reference quarter are labeled as provisional and deleted from the resulted transformed target (LHS.y). This ensures that only GDP values released in the 3rd month of the quarter remain in the sample. To follow the 1st route, we can simply set RmvProvisional=[] or DaysDelay=[].
% Treating Provisional/Preliminary/Advance figures.
RelDates(RelDates.SeriesID==LHS.name,:) %The release dates of the LAST of ob of the series
ans = 1x4 table
 SeriesIDLastUpdateLastObFreq
1"GDPC1"25-Jan-202431-Dec-2023"Quarterly"
lastObRelDate = RelDates{RelDates.SeriesID==LHS.name,"LastUpdate"};
rmvpro.DateReleased = lastObRelDate; %We have set LHS.DaysDelay=65
 
%NB: If I use input 'RmvProvisional' to remove Provisional obs from the LHS,
%we also need to remove these obs from the inputted matrix X for consistency
%(- that's the purpose of the 2nd output of transLHS()).
hmax = 3;
[LHS, Xraw.(LHS.f)] = transLHS(LHS,Xraw.(LHS.f),'hall',1:hmax,'RmvProvisional',rmvpro);
clear lastObRelDate rmvpro hmax
 
%Display the last 5 obs
LHS.pr.y(end-4:end,1);
LHS.y(end-4:end,1); %<- The last (provisional) observation has been removed
LHS.y4lags(end-4:end,1); %MoM change for AR-lags (-don't multiply by 100)
 
%Plot the transformed LHS series
p = stackedplot(LHS.y,{LHS.y.Properties.VariableNames(1)});
p.Title = [LHS.P.Title ' (' LHS.P.Des ')'];
p.DisplayLabels = {''};
clear p
Apart form the transformed LHS, function transLHS() also returns a separate transformed version of the target (in LHS.y4lags), to be used as autoregressive (AR) terms. For target variables that are transformed in 1st or 2nd differences (not levels), regardless of the selected LHS transformation, the transformation for the AR lags is always:
(1) with respect to the previous period (i.e. MoM/QoQ, even if the user selected YoY for the LHS),
(2) non-annualized, and
(3) expressed as percentages in decimal form (not in '%' form, i.e. without dividing by 100).
tail(LHS.y4lags,5) %Transformation for AR terms
Time GDPC1 __________ _________ 30/09/2022 0.0065646 31/12/2022 0.0063341 31/03/2023 0.0055484 30/06/2023 0.0050982 30/09/2023 0.011868
To handle the mismatch of frequencies, conveniently transLHS() also stores the date corresponding to the last available observation of the LHS, in LHS.T.
LHS.T %Last LHS observation
ans = datetime 30/09/2023
Next, we use function StepsAhead4Nowcast() to define the number of h-steps ahead required to predict the current period (-nowcasting). The resulted 'h' measures in terms of the target-variable's sampling frequency. Therefore, here 'h' corresponds to quarters-ahead.
%today's date
asofDate = datetime('today', 'Format','dd/MM/yyyy');
 
%Find the 'h' needed in order to calculate the NOWCAST.
%NB: If instead of the NOWCAST, you want the 2-quarters ahead FORECAST,
% then simply do: h_now=StepsAhead4Nowcast(), and then set h=h_now+2;
h = StepsAhead4Nowcast(LHS.T,'Now',asofDate,'Frequency',LHS.f);
 
%fcastDate: The DATE to be forecasted (i.e. the DATE corresponding to h-steps ahead)
fcstDate = dateshift(LHS.T,'end','quarter', h);
 
msg = ['As of ' char(asofDate) ', the last available observation for the ' LHS.P.Title ' is for ' char(LHS.T) '.' newline 'The nowcast (' char(fcstDate) ') is obtained by setting h=' num2str(h) '.'];
disp(msg)
As of 26/02/2024, the last available observation for the US Real GDP growth is for 30/09/2023. The nowcast (31/03/2024) is obtained by setting h=2.
Function transLHS() makes an internal call to another Nowcasting Suite function, trans() which applies the desired transformation. Here is an example of using trans(), to convert headline and core CPI indicies to MoM% annualized inflation in discretely compounding rates. Similarily to transLHS(), setting XoX='SPLY' would transform CPI to YoY% inflation, while setting Log=true, would calculate MoM% annualized inflation in continuously compounding rates.
x0 = fred_api(["CPIAUCSL", "CPILFESL"], 'StartDate',datetime(2020,1,1));
x1 = trans(x0,11,'M','h',[],'hSpecificY',false,'Log',false,'Percent',true,'Annualize',true,'XoX','LP');
p = stackedplot(x1(:,:),{["CPIAUCSL", "CPILFESL"]});
p.DisplayLabels = {''};
p.Title = 'US CPI Inflation';
p.AxesProperties(1).LegendLabels = ["Headline","Core"];
clear p x0 x1 msg

3. Time-Dummies for Target Variable (to control for irregular periods)

In order to control for irregular periods (structural breaks, outliers) in the target variable, we can create a set of time-dummies capturing such periods, and subsequently, control for these by adding the dummies as additional control variables in the correpsonding predictive regressions. The same control dummies can accompany the thresholding regressions of the targeted-factors in the predictor pre-selection step. Function LHStimeDummies() uses outlier detection based on IQR to create a different set of time-dummies for each h-step ahead version of the LHS. Automating the time-dummy creation process, allows to easily employ different target-variables.
The table that follows shows that setting the threshold for the detection of irregular periods to 1.8, the function labels a total of 7 observations as 'irregular'. Those 7 irregular observations are clustered by LHStimeDummies() into 4 periods of 'irregular' activity, each made of consecutive 'irregular' quarters, 2 of which span multiple quarters. Specifically, over the period 2010-2023 covered in this sample, the 4 periods where GDP growth took extreme values are: 2014Q1, 2020Q1-2020Q3, 2021Q2 and 2021Q4-2022Q1. The set of dummies created by function LHStimeDummies() will consist of 4 controlling dummies (- some covering consecutive quarters), one for each of these periods.
%Detect irregular periods for Y
[Dh, PeriodsDetected, outliersY] = LHStimeDummies(LHS,'DetectOptions',struct('Threshold',1.8));
[outliersY(:,h) LHS.y(:,h)] %See the outliers (for the nowcast period)
ans = 54x2 timetable
 Timeout_GDPC1_h2GDPC1_h2
130/06/201003.9269
230/09/201003.1202
331/12/201002.1170
431/03/20110-0.9454
530/06/201102.7339
630/09/20110-0.0892
731/12/201104.5684
831/03/201203.3968
930/06/201201.7973
1030/09/201200.5774
1131/12/201200.4634
1231/03/201304.0050
1330/06/201301.0749
1430/09/201303.4493
Dh{h}(15:50,:)
ans = 36x4 timetable
 TimeD_20140331D_20200331D_20210630D_20211231
131/12/20130000
231/03/20141000
330/06/20140000
430/09/20140000
531/12/20140000
631/03/20150000
730/06/20150000
830/09/20150000
931/12/20150000
1031/03/20160000
1130/06/20160000
1230/09/20160000
1331/12/20160000
1431/03/20170000
clear PeriodsDetected outliersY

4. Pre-processing the Monthly Dataset

4.1. Log and Stationarity [I(0)] Transformations
4.2. Outlier Detection and Adjustment (i.e. set to NaN)
4.3. Fill Missing Values using EM ala SW(2002)
%%%%%%%%%%%%%%%%%%
% Pre-processing %
%%%%%%%%%%%%%%%%%%
f='M'; %Changing this to 'Q', would run the same pre-preprocessing for panel Q.
Xt = Xraw.(f);
 
[~,locmeta] = ismember(Xt.Properties.VariableNames, Meta.SeriesID); %index in RIGHT for each value in LEFT
MetaF = Meta(locmeta,:); %Frequency-specific 'Meta'
clear locmeta Xraw

4.1. Log and Stationarity Transformations

4.1.a. Log Transformations

The first pre-processing step is to choose in which of the predictors to apply a log-transformation before differencing, to ensure stationarity. Following Stock and Watson (2002), we take the natural logarithm in all non-negative series (-rule 1), that are not already in rates or percentage units (-rule 2).
To see which variables are already in rates or percentage units, we will need to read through the table containing the variable descriptions. Alternatively, to automate the 2nd rule's selection, we can search and exclude all the series that contain the word 'percent' in the Units variable of the metadata table of the monthly dataset.
%1a) Log Transformations
 
%Rule 1: Series that are non-negative (+ive or 0)
small = 1e-6; % Value close to zero
canlog = (min(Xt{:,:}) > small); %series NOT close to 0
 
%Rule 2: Series that are not already in rates or percentage units
MetaF(:, ["SeriesID","Units"])
ans = 30x2 table
 SeriesIDUnits
1"AAAFFM""Percent"
2"AWHNONAG""Hours"
3"BAAFFM""Percent"
4"BOGMBASE""Millions of Dollars"
5"IPFINAL""Index 2017=100"
6"CES9093000001""Thousands of Persons"
7"CPIAUCSL""Index 1982-1984=100"
8"CPIULFSL""Index 1982-1984=100"
9"CUSR0000SAD""Index 1982-1984=100"
10"DMANEMP""Thousands of Persons"
11"DTCOLNVHFNM""Millions of Dollars"
12"IPDCONGD""Index 2017=100"
13"GS5""Percent"
14"INDPRO""Index 2017=100"
notpct = ~contains(MetaF.Units, "percent",'IgnoreCase',true)'; %tilda = negation operation
 
%Combine Rules 1 & 2
what2log = notpct & canlog;
 
%Check the decisions and which X's will be logged.
array2table([string(Xt.Properties.VariableNames); notpct; canlog; what2log]','VariableNames',["Series","Non-Rate","Non-negative","What2Log"])
ans = 30x4 table
 SeriesNon-RateNon-negativeWhat2Log
1"AAAFFM""false""false""false"
2"AWHNONAG""true""true""true"
3"BAAFFM""false""true""false"
4"BOGMBASE""true""true""true"
5"IPFINAL""true""true""true"
6"CES9093000001""true""true""true"
7"CPIAUCSL""true""true""true"
8"CPIULFSL""true""true""true"
9"CUSR0000SAD""true""true""true"
10"DMANEMP""true""true""true"
11"DTCOLNVHFNM""true""true""true"
12"IPDCONGD""true""true""true"
13"GS5""false""true""false"
14"INDPRO""true""true""true"
tcode.(f).log = what2log;
clear small canlog pos idx notpct MetaF

4.1.b. Stationarity Transformations

The vector containing the decisions for the series over which to take the natural logarithm, is passed as input in function trans2I0() which implements an iterative unit root testing procedure, in order to determine the apporpriate number of differences required to achieve stationarity, for each of the series in the monthly panel. Prior to transforming the series, trans2I0() applies the log-transformations.
The unit root testing procedure of trans2I0() is based on the DF-GLS test of Elliott, Rothenberg & Stock (1996). The function allows to choose the optimal lag-length k* in the underlying ADF regressions, with 3 different methods:
1. SIC,
2. Modified Akaike information criterion (MAIC), ala Ng & Perron (2001; Econometrica),
3. Sequential-t algorithm ala Ng & Perron (1995; JASA),
The stationarity transformations for the monthly panel below are based on the SIC, while the rejection of the null is set to the 5% significance level.
%1b) Stationarity [I(0)] Transformations
 
%Transform to Stationary based on iterative Unit Root (UR) tests
%(DFGLS with optimal truncation lag order selected via SIC, considering 0-Kmax lags)
opDFGLS = struct('Trend',1,'Kmin',0,'Kmax',[],'MethADF',1,'MethCVal',{'CL'},'MethLagOrder',{'SIC'},'SignifLevel',5);
[Xst, objDFGLS] = trans2I0(Xt,'How','URTest','What2Log',what2log,'TestOptions',opDFGLS,'MaxDiffs',3,'Diagnostics',true);
tcode.(f).D = objDFGLS.TransCodes;
clear what2log Xt
Function trans2I0() contains a ceiling on the number of times a series is allowed to be differenced until it achieves stationarity, with the maximum being set at 3. The choice of the maximum differences is in line with the stationarity transformation codes of McCracken and Ng (2016). The first column of the following table shows how many times each of the series was differenced (0=levels; 1=1st diffs; 2=2nd diffs etc.), and the second column shows whether the series managed to reject the null hypothesis of a unit root.
dfgls = array2table([objDFGLS.TransCodes' objDFGLS.URrejected'],'VariableNames',["TransCodes","URrejected"],'RowNames',Xst.Properties.VariableNames)
dfgls = 30x2 table
 TransCodesURrejected
1 AAAFFM11
2 AWHNONAG31
3 BAAFFM11
4 BOGMBASE11
5 IPFINAL11
6 CES909300000111
7 CPIAUCSL11
8 CPIULFSL11
9 CUSR0000SAD11
10 DMANEMP11
11 DTCOLNVHFNM11
12 IPDCONGD01
13 GS511
14 INDPRO11
Moreover, trans2I0() saves the output of the DFGLS test for each series in levels, as well as, after each consecutive differencing, allowing an in-depth look of what is happening in the series, with each subsequent application of the test. The test's output closely follows the output of the 'dfgls' command in Stata. The next 4 tables report the DFGLS test results, one after each differencing applied on the 2nd series (AWHNONAG), until it achieved stationarity. The corresponding DFGLS test statistic is reported on the 3rd column of the test output. Recall that the desicion rule for the rejection of the null for a unit root, was set at the 5% level of significance with the optimal truncation lag order selected via the SIC (-1st row of the test output table). We can see that in the 4th table, corresponding to the series differenced 3 times, the unit root null is rejected even at the 1%.
series = 2;
objDFGLS.TestResults{series}{:}
ans = 3x9 table
 MethodOpt. LagDF-GLSObsRMSE1% C.Val.5% C.Val.10% C.Val.IC
1"SIC"2-1.73091660.0031-3.4972-2.9392-2.6508-11.4285
2"MAIC"2-1.73091660.0031-3.4972-2.9392-2.6508-11.4634
3"Seq. t"11-2.66001570.0029-3.4972-2.8304-2.55160.0004
ans = 3x9 table
 MethodOpt. LagDF-GLSObsRMSE1% C.Val.5% C.Val.10% C.Val.IC
1"SIC"10-1.80191570.0031-3.4984-2.8454-2.5654-11.1984
2"MAIC"10-1.80191570.0031-3.4984-2.8454-2.5654-11.3252
3"Seq. t"10-1.80191570.0031-3.4984-2.8454-2.56540.0001
ans = 3x9 table
 MethodOpt. LagDF-GLSObsRMSE1% C.Val.5% C.Val.10% C.Val.IC
1"SIC"10-2.78341560.0038-3.4996-2.8452-2.5653-10.8103
2"MAIC"9-2.37771570.0039-3.4996-2.8598-2.5788-10.8363
3"Seq. t"10-2.78341560.0038-3.4996-2.8452-2.56530.0045
ans = 3x9 table
 MethodOpt. LagDF-GLSObsRMSE1% C.Val.5% C.Val.10% C.Val.IC
1"SIC"8-3.76461570.0047-3.5008-2.8738-2.5916-10.4018
2"MAIC"5-1.96431600.0057-3.5008-2.9111-2.6256-10.0336
3"Seq. t"10-3.96121550.0047-3.5008-2.8450-2.56520.0811
clear series
 
%CHECK: Stationarity using SIMPLE ADF test (NOT DFGLS)
[Xtest, objADF] = trans2I0_ADF(Xst,'What2Log',[],'Diagnostics',false); %NB: Do NOT log here
hasUR = (objADF.TransCodes ~= 0);
find(hasUR);
dfgls(hasUR,:);
%objADF.TransCodes>0 => series did NOT have its UR rejected/removed by the DFGLS
chck = array2table([objADF.TransCodes', objDFGLS.URrejected'],'VariableNames',["ADF_TransCodes","DFGLS_URrejected"],'RowNames',Xst.Properties.VariableNames);
chck(hasUR,:);
cases2check = ~(objADF.TransCodes==0 & objDFGLS.URrejected==1);
chck(cases2check,:);
dfgls(cases2check,:);
 
Xst(:,hasUR) = Xtest(:,hasUR); %Replace the UR-series with the ADF-stationary series
tcode.(f).D(hasUR) = objDFGLS.TransCodes(hasUR) + objADF.TransCodes(hasUR); %Final tcodes of series in Xst
clear Xtest hasUR opDFGLS cases2check chck objADF objDFGLS dfgls
transLHS() makes an internal call to another function of the Nowcasting Suite, the DFGLS() function. DFGLS() conducts the unit root test in individual series and returns the test output. Let's run an example using the GDP series in continuously compounding (that is, log-) growth rates. The settings selected below replicate Stata's default behaviour for its 'dfgls' command. Specifically:
- Setting Trend=1, the series is GLS-transformed (de-trended) using a linear trend prior running the ADF regressions. This sets the alternative hypothesis (Ha) to the series being stationary around a linear trend. Alternatively, setting Trend=0, the GLS-tranformation only de-means the variable and the Ha is set to Yt being stationary with no linear time trend.
- The minimum number of lags on the ADF regression is set to 1, while the maximum (Kmax=[]) is set to the value proposed by Schwert (1989):
- The critical values for the DF-GLS (aka ERS) test statistic are based on Cheung & Lai (1995). The test output also reports the Elliott, Rothenberg & Stock (1996) critical values.
%(A)DF OLS regressions: dYt = rho*Yt-1 + c1*dYt-1 + c2*dYt-2 +...+ ckmax*dYt-kmax + et
%When k=0 the regression equation becomes the DF regression (i.e. no dYt lags).
%where k is the # of dYt-j terms in the (A)DF regression.
 
x0 = X.Q(:,"GDPC1"); %GDP (in Levels)
x1 = trans(x0,11,'Q','h',[],'hSpecificY',false,'Log',true,'Percent',false,'Annualize',false,'XoX','LP');
dfglsout = DFGLS(x1,'Trend',1,'Kmin',1,'Kmax',[],'MethADF',0,'MethCVal','CL')
The maximum lag chosen (using the Schwert criterion) is 10
dfglsout = 3x9 table
 MethodOpt. LagDF-GLSObsRMSE1% C.Val.5% C.Val.10% C.Val.IC
1"SIC"1-6.4829440.0164-3.7510-3.2019-2.8976-8.0507
2"MAIC"1-6.4829440.0164-3.7510-3.2019-2.8976-2.4902
3"Seq. t"NaNNaNNaNNaNNaNNaNNaNNaN
clear dfglsout x0 x1
The empty cells at the bottom of the output table that refer to the Ng and Perron (1995) sequential-t algorithm, imply that the 'last' dYt-k term at the different iterations for the lag order, was never statstically significant at the pre-selected 10% threshold.

4.2. Outlier Detection and Adjustment

Function outl_adj() detects outliers in the individual series based on each series' Interquartile Range (IQR), and adjusts them by either replacing with the median, or setting them to missing (NaN) to be filled at later pre-processing steps. A data point (x) is labeled as an outlier if its absolute deviation from the median is larger than a user-selected multiple (-threshold) of the series' IQR:
The threshold is normally set to 10.
Optional input 'IncludeUp2' allows the user to select the desired date up to which to perform outlier detection and correction. This feature of outl_adj() can be useful if one wants to leave intact the observations at the bottom end of the sample, that will be used to calculate the forecast. Such practise could lead to superior forecasting accuracy, when extreme economic conditions are unravelling, and extreme values appear at the bottom end of the sample.
For example, setting IncludeUp2=dateshift(LHS.T,'end','month',-12) will exclude from the outlier adjusting process any 'leading' observations, as well as any lagging observations pretaining to the 12 months preceding the last available observation of the target. Assuming that the forecast is estimated using V=12 lags of the monthly predictors (one year's worth of past information, plus any available leading observation), this will exclude any observations that will be used to calculate the forecast.
Alternatively, setting IncludeUp2=0 searches and fixes outliers to the entire series, which is the standard practise. That's what is done below, in this example.
%------------------------------%
% Outlier Detection/Adjustment %
%------------------------------%
%Step 1: Detect outliers & replace with missing
inclup2 = 0;
thr = 10; %IQR multiple
adjm = 0; %replace detected outliers with NaNs
[Xst, outlier] = outl_adj(Xst, thr, adjm,'IncludeUp2',inclup2);
 
%row-wise sum to check # of problematic obs per period
array2timetable(sum(outlier{:,:},2), 'RowTimes',outlier.Time,'VariableNames', "Outliers")
ans = 169x1 timetable
 TimeOutliers
131/01/20100
228/02/20100
331/03/20101
430/04/20100
531/05/20100
630/06/20100
731/07/20100
831/08/20100
930/09/20100
1031/10/20100
1130/11/20100
1231/12/20101
1331/01/20110
1428/02/20110
%column-wise sum to check # of problematic obs per indicators
array2table(sum(outlier{:,:},1)','VariableNames',"Outliers",'RowNames',Xst.Properties.VariableNames)
ans = 30x1 table
 Outliers
1 AAAFFM0
2 AWHNONAG0
3 BAAFFM0
4 BOGMBASE0
5 IPFINAL1
6 CES90930000012
7 CPIAUCSL0
8 CPIULFSL0
9 CUSR0000SAD1
10 DMANEMP2
11 DTCOLNVHFNM4
12 IPDCONGD0
13 GS50
14 INDPRO1
clear outlier adjm thr inclup2
 
%Check ROWS with too many NaNs
array2timetable(sum(isnan(Xst{:,:}),2),'RowTimes',Xst.Time,'VariableNames', "NaNs");
%NB: We will have a spike in the number of outliers/NaNs around the COVID-lockdowns period & in the BOTTOM because of the ragged edge
 
X0.(f) = Xst; %X0.M = Xst;
clear Xst X

4.3. Fill Missing Values using EM ala SW(2002)

In this section, we use function MissFill() to implement the EM algorithm of SW(2002) in order to fill any missing values that are due to:
1) the restricted history of the series,
2) asynchronous publication releases and the implied ragged-edges, and
3) any outliers detected and set to missing by outl_adj(), in the steps above.
Before filling the missing values of the transfromed monthly panel (X0.M), lets take a look on its structure.
R = 25;
tail(X0.M(:, 1:R),10)
Time AAAFFM AWHNONAG BAAFFM BOGMBASE IPFINAL CES9093000001 CPIAUCSL CPIULFSL CUSR0000SAD DMANEMP DTCOLNVHFNM IPDCONGD GS5 INDPRO IPDMAT LNS12032194 LNS13023557 LNS13023569 LNS14000026 NDMANEMP IPFPNSS PPIIDC T1YFFM T5YFFM TB3MS __________ ______ __________ ______ __________ ___________ _____________ __________ __________ ___________ __________ ___________ ________ _____ ___________ ___________ ___________ ___________ ___________ ___________ ___________ __________ _________ ______ ______ _____ 30/04/2023 -0.31 -0.0058997 -0.36 0.0038694 0.011934 0.0017278 0.0042594 0.0048814 0.005979 0.0012378 0.0079712 4.6858 -0.28 0.0047811 0.020251 -0.048839 0.054911 0.025367 3.1 -0.00020587 0.0074643 -0.00116 -0.18 -0.46 0.23 31/05/2023 -0.02 0.0059084 0.02 -0.0042105 -0.0064963 0.0017937 0.0010983 0.0009255 0.0019772 0.00012369 0.011805 4.6935 0.05 -0.002205 0.00048824 -0.041667 0.040237 0.015297 3.3 -0.0012361 -0.0051994 -0.011951 0.01 -0.17 0.22 30/06/2023 -0.05 -0.0029542 -0.05 0.0069601 -0.011523 0.0019281 0.0021009 0.0021772 -0.0018909 0.0023473 0.020129 4.658 0.36 -0.0060792 -0.00028333 0.11633 -0.041362 0.058949 3.1 -0.0020636 -0.0087547 0.0002877 0.3 0.33 0.02 31/07/2023 -0.03 0 -0.05 -0.016304 0.011757 0.0019244 0.0020538 0.0020731 -0.0028784 0.0011099 0.010767 4.6913 0.19 0.0086603 0.0036842 -0.046316 0.050505 -0.045754 3.1 -0.0031033 0.0088713 0.0021374 0.09 0.15 0.09 31/08/2023 0.08 0 0.07 0.007457 -0.00095221 0.0014409 0.0051047 0.0055183 -0.0030131 0.00036971 0.012299 4.6664 0.17 -0.00073581 -0.0069698 0.05178 0.03317 0.10311 3.2 -0.00020723 8.8863e-06 0.019667 -0.21 -0.04 0.05 30/09/2023 0.18 0 0.14 0.001474 -0.0034292 0.0018495 0.0035894 0.0038546 -0.0034978 0.0011083 0.0086896 4.6761 0.18 0.00092573 0.0041578 -0.036675 0.047038 -0.010187 3.1 0.00082867 -0.0018441 0.0064029 0.07 0.18 0.02 31/10/2023 0.48 0 0.47 0.0060885 -0.0080607 0.002051 0.00079048 0.00045946 -0.0036294 -0.0040698 0.0063631 4.6241 0.28 -0.0081467 -0.01737 0.05149 -0.079672 0.028597 3.2 0.00041408 -0.0055976 -0.015882 -0.02 0.28 0.02 30/11/2023 -0.33 -0.002963 -0.34 0.022979 0.0065739 0.0015696 0.0016018 0.0015868 -0.0027726 0.004562 0.0059471 4.6563 -0.28 0.0029145 0.0076218 -0.070094 -0.053859 -0.035447 3.1 -0.002487 0.0040404 -0.010533 -0.14 -0.28 -0.07 31/12/2023 -0.54 0.0059259 -0.65 0.016559 0.00067555 0.0023838 0.0023283 0.0023657 -0.0036848 0.0015979 NaN 4.6758 -0.49 4.8699e-06 0.00056341 0.052907 -0.017085 0.045348 3.3 -0.0010381 -0.001021 -0.013899 -0.32 -0.49 -0.03 31/01/2024 0.13 -0.0089154 0.04 NaN 0.0043495 0.0010199 0.0030497 0.0029272 -0.0046151 0.00049116 NaN 4.6782 -0.02 -0.00095106 -0.003113 0.048892 0.05204 -0.1019 3.2 0.0039391 0.001637 0.0057311 -0.17 -0.02 -0.02
%%% Fill the Missing Values (outliers & top NaNs) using EM ala SW(2002) %%%
%Step 2: Correct Outliers & the NaNs
%I am filling up to the LHS i.e. excluding the ragged-edge (fillup2=LHS.T)
opK = struct('Method','BN2002','Kmax',8,'Criterion',2);
X.M = MissFill(X0.M,'Method','EM','Options',opK,'FillUp2',LHS.T);
 
%Fill the monthly panel that will be used to construct the monthly PCA factors.
%Setting FillUp2=0 will fill the entire matrix. This will include any
%available obs ahead of the forecast period (LHS.T+h).
X4pca.M = MissFill(X0.M,'Method','EM','Options',opK,'FillUp2',0);
 
clear opK
%NB: The 'X4pca' dataset will be 'balanced' (meaning with no missing values), but X.M will have ragged-edges (unbalanced).
Both of the tables below show the bottom end of the monthly predictors' matrix. The first table shows a version of the predictors matrix where we have filled the missing values (ragged-edges and any detected outliers) all the way up to the last available observation of the LHS, partly leaving the ragged-edge intact.
The first dataset (X.M) can be used to assess the (leading) informational content of the series that become timely available, while the second dataset (X4pca.M) is used below to apply targeted-PCA and extract a set of monthly factors. The first table will therefore show the ragged-edge of the selected predictors, while the second has its ragged-edge filled.
tail(X.M(:, 1:R),10)
Time AAAFFM AWHNONAG BAAFFM BOGMBASE IPFINAL CES9093000001 CPIAUCSL CPIULFSL CUSR0000SAD DMANEMP DTCOLNVHFNM IPDCONGD GS5 INDPRO IPDMAT LNS12032194 LNS13023557 LNS13023569 LNS14000026 NDMANEMP IPFPNSS PPIIDC T1YFFM T5YFFM TB3MS __________ ______ __________ ______ __________ ___________ _____________ __________ __________ ___________ __________ ___________ ________ _____ ___________ ___________ ___________ ___________ ___________ ___________ ___________ __________ _________ ______ ______ _____ 30/04/2023 -0.31 -0.0058997 -0.36 0.0038694 0.011934 0.0017278 0.0042594 0.0048814 0.005979 0.0012378 0.0079712 4.6858 -0.28 0.0047811 0.020251 -0.048839 0.054911 0.025367 3.1 -0.00020587 0.0074643 -0.00116 -0.18 -0.46 0.23 31/05/2023 -0.02 0.0059084 0.02 -0.0042105 -0.0064963 0.0017937 0.0010983 0.0009255 0.0019772 0.00012369 0.011805 4.6935 0.05 -0.002205 0.00048824 -0.041667 0.040237 0.015297 3.3 -0.0012361 -0.0051994 -0.011951 0.01 -0.17 0.22 30/06/2023 -0.05 -0.0029542 -0.05 0.0069601 -0.011523 0.0019281 0.0021009 0.0021772 -0.0018909 0.0023473 0.020129 4.658 0.36 -0.0060792 -0.00028333 0.11633 -0.041362 0.058949 3.1 -0.0020636 -0.0087547 0.0002877 0.3 0.33 0.02 31/07/2023 -0.03 0 -0.05 -0.016304 0.011757 0.0019244 0.0020538 0.0020731 -0.0028784 0.0011099 0.010767 4.6913 0.19 0.0086603 0.0036842 -0.046316 0.050505 -0.045754 3.1 -0.0031033 0.0088713 0.0021374 0.09 0.15 0.09 31/08/2023 0.08 0 0.07 0.007457 -0.00095221 0.0014409 0.0051047 0.0055183 -0.0030131 0.00036971 0.012299 4.6664 0.17 -0.00073581 -0.0069698 0.05178 0.03317 0.10311 3.2 -0.00020723 8.8863e-06 0.019667 -0.21 -0.04 0.05 30/09/2023 0.18 0 0.14 0.001474 -0.0034292 0.0018495 0.0035894 0.0038546 -0.0034978 0.0011083 0.0086896 4.6761 0.18 0.00092573 0.0041578 -0.036675 0.047038 -0.010187 3.1 0.00082867 -0.0018441 0.0064029 0.07 0.18 0.02 31/10/2023 0.48 0 0.47 0.0060885 -0.0080607 0.002051 0.00079048 0.00045946 -0.0036294 -0.0040698 0.0063631 4.6241 0.28 -0.0081467 -0.01737 0.05149 -0.079672 0.028597 3.2 0.00041408 -0.0055976 -0.015882 -0.02 0.28 0.02 30/11/2023 -0.33 -0.002963 -0.34 0.022979 0.0065739 0.0015696 0.0016018 0.0015868 -0.0027726 0.004562 0.0059471 4.6563 -0.28 0.0029145 0.0076218 -0.070094 -0.053859 -0.035447 3.1 -0.002487 0.0040404 -0.010533 -0.14 -0.28 -0.07 31/12/2023 -0.54 0.0059259 -0.65 0.016559 0.00067555 0.0023838 0.0023283 0.0023657 -0.0036848 0.0015979 NaN 4.6758 -0.49 4.8699e-06 0.00056341 0.052907 -0.017085 0.045348 3.3 -0.0010381 -0.001021 -0.013899 -0.32 -0.49 -0.03 31/01/2024 0.13 -0.0089154 0.04 NaN 0.0043495 0.0010199 0.0030497 0.0029272 -0.0046151 0.00049116 NaN 4.6782 -0.02 -0.00095106 -0.003113 0.048892 0.05204 -0.1019 3.2 0.0039391 0.001637 0.0057311 -0.17 -0.02 -0.02
tail(X4pca.M(:, 1:R),10)
Time AAAFFM AWHNONAG BAAFFM BOGMBASE IPFINAL CES9093000001 CPIAUCSL CPIULFSL CUSR0000SAD DMANEMP DTCOLNVHFNM IPDCONGD GS5 INDPRO IPDMAT LNS12032194 LNS13023557 LNS13023569 LNS14000026 NDMANEMP IPFPNSS PPIIDC T1YFFM T5YFFM TB3MS __________ ______ __________ ______ __________ ___________ _____________ __________ __________ ___________ __________ ___________ ________ _____ ___________ ___________ ___________ ___________ ___________ ___________ ___________ __________ _________ ______ ______ _____ 30/04/2023 -0.31 -0.0058997 -0.36 0.0038694 0.011934 0.0017278 0.0042594 0.0048814 0.005979 0.0012378 0.0079712 4.6858 -0.28 0.0047811 0.020251 -0.048839 0.054911 0.025367 3.1 -0.00020587 0.0074643 -0.00116 -0.18 -0.46 0.23 31/05/2023 -0.02 0.0059084 0.02 -0.0042105 -0.0064963 0.0017937 0.0010983 0.0009255 0.0019772 0.00012369 0.011805 4.6935 0.05 -0.002205 0.00048824 -0.041667 0.040237 0.015297 3.3 -0.0012361 -0.0051994 -0.011951 0.01 -0.17 0.22 30/06/2023 -0.05 -0.0029542 -0.05 0.0069601 -0.011523 0.0019281 0.0021009 0.0021772 -0.0018909 0.0023473 0.020129 4.658 0.36 -0.0060792 -0.00028333 0.11633 -0.041362 0.058949 3.1 -0.0020636 -0.0087547 0.0002877 0.3 0.33 0.02 31/07/2023 -0.03 0 -0.05 -0.016304 0.011757 0.0019244 0.0020538 0.0020731 -0.0028784 0.0011099 0.010767 4.6913 0.19 0.0086603 0.0036842 -0.046316 0.050505 -0.045754 3.1 -0.0031033 0.0088713 0.0021374 0.09 0.15 0.09 31/08/2023 0.08 0 0.07 0.007457 -0.00095221 0.0014409 0.0051047 0.0055183 -0.0030131 0.00036971 0.012299 4.6664 0.17 -0.00073581 -0.0069698 0.05178 0.03317 0.10311 3.2 -0.00020723 8.8863e-06 0.019667 -0.21 -0.04 0.05 30/09/2023 0.18 0 0.14 0.001474 -0.0034292 0.0018495 0.0035894 0.0038546 -0.0034978 0.0011083 0.0086896 4.6761 0.18 0.00092573 0.0041578 -0.036675 0.047038 -0.010187 3.1 0.00082867 -0.0018441 0.0064029 0.07 0.18 0.02 31/10/2023 0.48 0 0.47 0.0060885 -0.0080607 0.002051 0.00079048 0.00045946 -0.0036294 -0.0040698 0.0063631 4.6241 0.28 -0.0081467 -0.01737 0.05149 -0.079672 0.028597 3.2 0.00041408 -0.0055976 -0.015882 -0.02 0.28 0.02 30/11/2023 -0.33 -0.002963 -0.34 0.022979 0.0065739 0.0015696 0.0016018 0.0015868 -0.0027726 0.004562 0.0059471 4.6563 -0.28 0.0029145 0.0076218 -0.070094 -0.053859 -0.035447 3.1 -0.002487 0.0040404 -0.010533 -0.14 -0.28 -0.07 31/12/2023 -0.54 0.0059259 -0.65 0.016559 0.00067555 0.0023838 0.0023283 0.0023657 -0.0036848 0.0015979 0.0096841 4.6758 -0.49 4.8699e-06 0.00056341 0.052907 -0.017085 0.045348 3.3 -0.0010381 -0.001021 -0.013899 -0.32 -0.49 -0.03 31/01/2024 0.13 -0.0089154 0.04 -0.084674 0.0043495 0.0010199 0.0030497 0.0029272 -0.0046151 0.00049116 -0.010961 4.6782 -0.02 -0.00095106 -0.003113 0.048892 0.05204 -0.1019 3.2 0.0039391 0.001637 0.0057311 -0.17 -0.02 -0.02
It might be sensible for one to need to exclude from the sample any timely observations that are available for the period ahead of the period to be forecasted (> LHS.T + h). In other words, if the task is to backcast 2024Q1, you might want to exclude monthly observations for 2024M4, 2024M5 etc., from the sample. That would make the predictors' matrix horizon-specific. For the predictors' matrix (X.M) we can achieve that by simply deleting the observations ahead of the forecast period (LHS.T+h), i.e. by executing X.M(X.M.Time > fcstDate,:)=[]. Similarily, if we don't want the estimated PCA factors to include observations ahead of the forecast period, we can do so when filling the matrix over which we will apply PCA (X4pca.M) by setting MissFill(.,FillUp2=fcstDate).
MissFill() can also fill missing values using iterated (as opposed to direct) multistep univariate AR(P) models for each of the series, where the optimal lag order P* can be selected with a number of methods, including BIC, AIC, as well as K-fold CV ala Bergmeir et al (2018). Because of the time-series regression nature of the underlying model, this method can only fill missing values in the middle (due to outliers) and at the bottom (due to publication lags), but not at the top of the series.

5. Estimate Plain- and Target-PCA factors

In this section we use the filled monthly panel (X4pca.M) to create two sets of monthly factors. First, we estimate a standard set of factors, and then, we use hard-thresholding, ala Bai-Ng (2008), to select target-relevant predictors over which PCA is subsequently applied. The mismatch in the frequencies between the target and the predictors is handled by temporally-aggregating the stationary monthly panel to the sampling frequency of the LHS. The target-relevant predictors selected during the pre-selection step, are kept at their highest available frequency, over which PCA is applied resulting into the 2nd set of monthly target-relevant factors.
In both PCA applications we use the information critirion of Bai-Ng (2002) to determine the optimal number of factors, setting the maximum allowed number of factors to .

5.1. Plain-PCA monthly factors

clear X0
%Set the options for selecting the optimal number of factors (K*)
opK = struct('Method','BN2002','Kmax',8,'Criterion',2);
 
%- MONTHLY plain-Factors
opTrgt = []; %<- Do NOT implement Target-factors
[F.p, objF.p] = pcaF(X4pca.M,'treatX','M','KstarOptions',opK,'KstarAllMethods',false,'Groups',cat.M,'Diagnostics',true,'addGraph',true,'ExcludefromX',[],'Target',opTrgt);
Function pcaF() creates a number of post-estimation graphs and tables, that can help better understand the underlying structure of the estimated factors, both in terms of their explanatory power and in terms of factor interpretability.
The first graph is a Pareto chart showing the incremental (and cummulative) variance explained by each (additional) factor. Since factors are organized in decreasing order of their respective eigenvalues, the underlying ordering of the factors reflects their importance in explaining the set of predictors X. As such, the 1st factor explains more of the variability of X compared to the 2nd factor, and the 2nd more than the 3rd, and so on.
The second graph provides interpretability for each factor based on the economic categories that each series has been assigned, and the explanatory power of that particular factor on the series, as measured by its mR2. Specifically, mR2(i,k) measures the incremental explanatory power of the k-th factor for the i-th series.
The two tables that follow (PostPCA.T2 & PostPCA.T3) are the table versions of graphs 1 & 2, respectively. The first shows the variance explained by each factor, as well as the cummulative explained variance. The second table reflects each factor's interpretation. Specifically, for each factor Fk, it calculates the contributions of the various categories by summing the mR2's of the regressors assigned to the same category, and then dividing by the sum of all mR2(i,k)'s for that k.
objF.p.PostPCA.T2
ans = 8x2 table
 Variance ExplainedCummulative VE
1 F123.500023.5000
2 F215.700039.2000
3 F314.100053.3000
4 F4760.3000
5 F5666.3000
6 F6571.3000
7 F7475.3000
8 F83.700079
objF.p.PostPCA.T3
ans = 5x8 table
 F1F2F3F4F5F6F7F8
1 Industrial Production42.880017.16002.39008.99002.270010.19008.41000.5200
2 Interest Rates3.550010.480087.080039.76005.560016.120011.860013.1500
3 Labor44.060029.94001.800026.930042.820033.150063.440047.8900
4 Money and Credit1.990018.72001.13003.250027.700017.300011.440029.8900
5 Prices7.520023.70007.610021.070021.650023.24004.84008.5500
The next post-estimation table (PostPCA.T1) sorts the series wrt their mR2, for each of the factors. The table below shows the 10 most important series for each of the factors.
head(objF.p.PostPCA.T1,10)
F1 Series F1 mR2 F2 Series F2 mR2 F3 Series F3 mR2 F4 Series F4 mR2 F5 Series F5 mR2 F6 Series F6 mR2 F7 Series F7 mR2 F8 Series F8 mR2 ___________ _______ _______________ _______ __________ ________ _____________ ________ _______________ ________ _______________ ________ _____________ ________ _____________ ________ "IPFPNSS" 0.79304 "IPDCONGD" 0.7095 "T5YFFM" 0.8501 "TB6SMFFM" 0.25509 "CUSR0000SAD" 0.32261 "DTCOLNVHFNM" 0.25147 "AWHNONAG" 0.49539 "LNS13023569" 0.28288 "IPFINAL" 0.75719 "LNS14000026" 0.58915 "AAAFFM" 0.68638 "TB3MS" 0.25299 "CES9093000001" 0.24659 "CES9093000001" 0.206 "LNS12032194" 0.17578 "DTCOLNVHFNM" 0.2175 "IPDMAT" 0.74855 "UEMP5TO14" 0.5459 "BAAFFM" 0.63814 "PPIIDC" 0.15845 "TOTRESNS" 0.23064 "IPDCONGD" 0.13991 "DTCOLNVHFNM" 0.1026 "AWHNONAG" 0.10875 "INDPRO" 0.72047 "CPIULFSL" 0.45451 "T1YFFM" 0.62147 "T1YFFM" 0.15429 "BOGMBASE" 0.23011 "CPIAUCSL" 0.12408 "LNS13023569" 0.050865 "TB3MS" 0.066639 "USPBS" 0.61097 "BOGMBASE" 0.45069 "GS5" 0.44548 "LNS13023569" 0.11338 "LNS13023569" 0.1481 "CPIULFSL" 0.11477 "TB3MS" 0.042965 "NDMANEMP" 0.063662 "USTPU" 0.60677 "CPIAUCSL" 0.4311 "TB6SMFFM" 0.43824 "CUSR0000SAD" 0.10313 "LNS13023557" 0.14112 "PPIIDC" 0.10152 "INDPRO" 0.035259 "TOTRESNS" 0.057376 "DMANEMP" 0.57578 "TOTRESNS" 0.42874 "PPIIDC" 0.12381 "CPIULFSL" 0.094261 "AWHNONAG" 0.094517 "TB3MS" 0.084521 "PPIIDC" 0.033903 "GS5" 0.056896 "NDMANEMP" 0.43347 "TB3MS" 0.24372 "CPIAUCSL" 0.086045 "LNS13023557" 0.091013 "LNS12032194" 0.05843 "USTPU" 0.081826 "BAAFFM" 0.028959 "BOGMBASE" 0.055366 "UEMP5TO14" 0.28087 "PPIIDC" 0.17325 "CPIULFSL" 0.082261 "CPIAUCSL" 0.086897 "GS5" 0.038911 "AAAFFM" 0.071449 "AAAFFM" 0.028597 "CUSR0000SAD" 0.047334 "PPIIDC" 0.21498 "CES9093000001" 0.12979 "IPFINAL" 0.040786 "AWHNONAG" 0.082295 "CPIAUCSL" 0.03627 "BAAFFM" 0.069567 "IPFPNSS" 0.025914 "USPBS" 0.039049
Finally, pcaF() stores the NxK* table with the factor loadings (Lambda), and two TxN timetables containing the standardized (Uz) and un-standardized (Ux) idiosyncratic components of the Factor model, .
objF.p.Lambda %Table with the estimated loadings
ans = 30x8 table
 F1F2F3F4F5F6F7F8
1 AAAFFM-0.1883-0.39882.2031-0.9487-0.61131.19360.8457-0.1462
2 AWHNONAG0.0126-0.28750.05081.08391.25930.10573.52001.7184
3 BAAFFM-0.4636-0.58332.1243-0.6565-0.39921.17780.8511-0.4022
4 BOGMBASE-0.3741-1.69570.4085-0.67031.9649-0.3065-0.68481.2261
5 IPFINAL1.7954-0.2746-0.53700.7357-0.09110.34840.74450.1059
6 CES90930000010.22950.9100-0.15410.44962.03402.02670.2044-0.1292
7 CPIAUCSL0.78391.65840.7800-1.11370.7801-1.57300.32120.6935
8 CPIULFSL0.75361.70290.7627-1.16000.6882-1.51280.42800.6778
9 CUSR0000SAD0.39820.59480.4651-1.21342.3265-0.4289-0.5646-1.1337
10 DMANEMP1.5656-0.3215-0.3435-0.5317-0.72900.2119-0.42450.1644
11 DTCOLNVHFNM0.53200.0653-0.07540.11590.76182.2393-1.60192.4301
12 IPDCONGD-0.10172.1276-0.09810.29890.69641.6703-0.1122-0.1931
13 GS50.47210.85151.77491.0217-0.8080-0.2621-0.60401.2429
14 INDPRO1.7513-0.3286-0.34170.94730.41560.12270.9391-0.0657
tail(objF.p.Ux,10) %Timetable of the idiosyncratic components
Time AAAFFM AWHNONAG BAAFFM BOGMBASE IPFINAL CES9093000001 CPIAUCSL CPIULFSL CUSR0000SAD DMANEMP DTCOLNVHFNM IPDCONGD GS5 INDPRO IPDMAT LNS12032194 LNS13023557 LNS13023569 LNS14000026 NDMANEMP IPFPNSS PPIIDC T1YFFM T5YFFM TB3MS TB6SMFFM TOTRESNS UEMP5TO14 USPBS USTPU __________ _________ __________ _________ __________ ___________ _____________ ___________ ___________ ___________ ___________ ___________ _________ __________ ___________ ___________ ___________ ___________ ___________ ___________ ___________ __________ __________ __________ ________ __________ _________ __________ _________ ___________ ___________ 30/04/2023 0.16385 -0.0056488 0.17393 0.0066697 0.0056168 -0.0005414 0.000209 0.00069115 0.0028057 -0.0011723 0.00069139 0.01801 -0.14603 -0.00018935 0.014236 -0.018796 0.054252 -0.0021382 -1.0201 -0.0013672 0.002216 -0.0047072 0.016396 -0.0474 0.074067 -0.023873 0.011085 -0.040384 -0.0015882 -0.0015197 31/05/2023 0.09586 0.0059989 0.13207 -0.0043723 -0.0051759 -0.00054287 -0.00015112 -0.00016045 0.0030127 -0.00026179 0.0015749 0.015782 -0.043364 -0.0014278 0.0027332 -0.028816 0.068283 -0.008895 0.16453 -0.0016714 -0.0041662 -0.0051505 0.0088063 -0.12614 0.10276 0.011006 -0.0075481 0.041151 0.0010086 0.00053976 30/06/2023 -0.22998 -0.0055684 -0.23609 -0.0074276 -0.0045732 0.00042675 0.00036039 0.0005404 0.002001 0.0025954 0.0049075 0.0016585 -0.0073682 -0.00053418 0.0063874 0.1059 -0.0034093 -0.039595 0.095395 -0.001586 -0.0031214 0.0030399 0.16737 0.058875 -0.11403 0.045015 -0.011385 0.042917 0.00048348 -0.0008796 31/07/2023 0.023964 -0.0073113 0.0060068 0.00064104 0.0064334 8.4513e-05 0.00026241 0.00034661 0.00025673 0.00017388 0.0036026 0.030599 -0.020665 0.0037575 -0.0016175 -0.035861 0.069165 -0.049144 -0.2951 -0.0038001 0.0039105 0.0043187 0.012872 0.083352 -0.074843 -0.027669 0.0040119 -0.065333 -0.0020829 -0.00057446 31/08/2023 0.046005 -0.0069685 0.10139 0.0050564 0.0036399 0.00045599 -4.8826e-06 -8.5005e-05 -0.001943 -0.00085153 0.0020133 0.005358 0.072295 0.004347 -0.0013606 -0.019976 0.011313 -0.042586 -0.063513 -0.0002158 0.0042646 0.0053429 -0.070706 0.022273 -0.0064197 -0.047919 0.011848 0.035245 -0.00097526 -0.0021386 30/09/2023 0.034747 -0.0017348 0.0015935 0.0055608 -0.0023882 0.00035403 0.00058999 0.00074653 -0.0032104 0.00016032 0.0015282 0.024854 -0.030848 0.0013173 0.0052076 -0.041962 0.053293 -0.041891 -0.61902 -0.00042723 -0.0012358 0.0015667 -0.0030817 0.012554 -0.041401 -0.011442 0.012912 0.065127 -0.0020823 -0.00027747 31/10/2023 0.14193 -0.0034625 0.074525 -0.0026898 0.0042569 0.0003221 0.00081066 0.00064867 -0.00054119 -0.0023242 -0.0019356 -0.032031 0.094401 0.0010748 -0.0034829 0.0013522 -0.045547 0.0033203 0.4954 0.00081136 0.0047719 -0.0085731 -0.12077 0.00744 0.063366 -0.038664 -0.0013265 0.097111 3.0866e-05 0.00044878 30/11/2023 0.050767 -0.0053279 0.030197 0.0039258 0.0051499 -0.00036011 0.0010763 0.0013755 -0.0036706 0.0038862 6.3697e-05 0.057164 -0.077993 0.00083233 0.0057249 -0.04836 -0.020087 -0.011916 -1.7008 -0.0020524 0.0027791 -0.0003915 -0.0033032 0.058189 -0.13984 -0.01684 0.010832 -0.036909 -0.0011405 -0.00216 31/12/2023 -0.052317 -0.0011488 -0.14015 0.0022282 -0.00061307 -0.00047186 0.0014695 0.0016403 -0.0038447 0.00074047 3.2997e-06 0.016808 -0.14032 -0.00044852 0.0012077 0.021622 -0.0065701 0.0060923 -0.39864 -0.0002751 -0.0017002 -0.0048451 -0.0038009 0.069751 -0.072892 0.10593 0.0042385 -0.084174 0.00044886 0.0012461 31/01/2024 0.061835 -0.0027749 -0.017254 0.00040868 0.0018428 0.00064072 0.00078791 0.00028371 -0.0032416 -0.001117 8.9594e-05 -0.015135 0.060058 -0.001894 -0.00038985 -0.00018184 -0.018877 -0.036451 0.55225 0.00035745 0.00030043 -0.0013313 -0.096827 0.065427 0.029028 -0.017464 0.00066009 0.056911 -3.6382e-05 0.00095145

5.2. Target-PCA monthly factors with Hard-thresholding

Target-relevant factors are obtained by using a pre-selection step which determines the predictors that are most relevant to the variable that we want to predict. Since standard PCA is an unsupervised learning technique, this additional pre-selection step is done in an effort to imporve the forecasting accuracy of models based on standard PCA factors, such as the factor-augmented autoregressive model (F-AR), which is also commonly referred to as the Diffusion Index (DI) model, popularized by Stock-Watson (2002).
Hard-thresholding, was introduced by Bai-Ng (2008). The pre-selection step involves estimating an OLS regression for each predictor, and picking the predictors that are above a pre-defined statistical significance threshold. pcaF() allows the user to:
1) introduce optional control variables, such as time-dummies and AR-lags in the pre-selection regressions, and to
2) estimate horizon-specific pre-selection regressions, so that the selected predictors are better suited for the period one wants to forecast.
The OLS regression of the hard-thresholding pre-selection step takes the form:
on AR-lags + time-dummies + Contemporaneous
The Nowcasting Suite works as a ecosystem of related functions, where the output of one function can directly become the input of another. Conveniently, outputs LHS.y4lags of transLHS() and Dh of LHStimeDummies(), containing the AR-lags and time-dummies corresponding to the LHS of interest, can directly be passed to pcaF(), to be added as control variables in the pre-selection, without further processing, or the need to re-align/time-sync the data matricies, manually.
In hard-thresholding, the threshold corresponds to the critical value that defines the rejection area for the hypothesis of the insignificance of the i-th predictor. For example, it is common to set the threshold to 1.96, which is the critical value of the Standard Normal (and Student-t with df→∞) distribution corresponding to upper-tail (1-side) probability of 2.5%. This in turn will pick the X's with a t-statistic higher that 1.96 or lower than -1.96, implying that the PCA will be applied to all those predictors that are statistically significant at the 5% level of significance, or in other words, those that reject the against a 2-sided , with 95% confidence. Finally, pcaF() will include the LHS by default, among the pre-selected predictors over which it applies PCA. To exclude the LHS, we can simply add it in 'ExcludefromX'.
%- MONTHLY target-Factors
 
%Thresholding settings for target-PCA
prescrn = struct('How','Individual','Statistic','t');
topx = struct('Method','Threshold','k',1.96);
P=2; %Add 2 AR-lags as controlling variables
rollwin = []; %Set estimation window for the pre-selection OLS regression
opTrgt = struct('LHS',LHS,'h',h,'YLags',P,'Dummies',{Dh},'RollingWindow',rollwin,'What2Screen',prescrn,'Screen',topx);
[F.t, objF.t] = pcaF(X4pca.M,'treatX','M','KstarOptions',opK,'KstarAllMethods',false,'Groups',cat.M,'Diagnostics',true,'addGraph',true,'ExcludefromX',[],'Target',opTrgt);
After implementing target-PCA with hard-thresholding, pcaF() stores a table containing the t-statistics and the corresponding p-values for all the predictors in the monthly panel.
objF.t.Target.stat
ans = 30x2 table
 stat1stat2
1 AAAFFM-1.92600.0610
2 AWHNONAG-0.21200.8330
3 BAAFFM-2.68000.0100
4 BOGMBASE2.09500.0420
5 IPFINAL11.62100
6 CES9093000001-1.72700.0910
7 CPIAUCSL-0.92800.3590
8 CPIULFSL-1.34400.1860
9 CUSR0000SAD-1.58400.1210
10 DMANEMP7.39200
11 DTCOLNVHFNM2.92400.0050
12 IPDCONGD-3.18300.0030
13 GS51.09200.2810
14 INDPRO8.00300
Another table stored by pcaF() is the table with only the pre-selected predictors, ranked descendingly based on their absolute t-statistics from the most to the least target-relevant.
objF.t.Target.seriesrnk
ans = 17x2 table
 stat1stat2
1 IPFINAL11.62100
2 IPFPNSS11.30400
3 IPDMAT9.95400
4 INDPRO8.00300
5 DMANEMP7.39200
6 USTPU6.94000
7 LNS12032194-6.13500
8 UEMP5TO145.41600
9 USPBS4.86600
10 LNS140000263.67500.0010
11 NDMANEMP3.45400.0010
12 LNS130235573.18300.0030
13 IPDCONGD-3.18300.0030
14 DTCOLNVHFNM2.92400.0050
Before extracting the leads and lags for the selected factors, let us see in a plot how the two sets of factors compare. Below we plot the 1st principal component of the monthly plain- and target-PCA factors, together with the GDP growth. The figure also contains shaded areas representing the recession periods, as determined by the NBER Business Cycle Dating committee. To add in a signle plot the quarterly GDP growth together with the monthly factors, we fill the months within the quarter by repeating the corresponding value.
%Fill the quarterly series by repeating within-period values
y2m = retime(LHS.y(:,h), F.t.Time, 'next'); %Fill within-quarter months with next value
 
%Print the last 10 rows
tail(y2m(:, :),10)
Time GDPC1_h2 __________ ________ 30/04/2023 2.0602 31/05/2023 2.0602 30/06/2023 2.0602 31/07/2023 4.8617 31/08/2023 4.8617 30/09/2023 4.8617 31/10/2023 NaN 30/11/2023 NaN 31/12/2023 NaN 31/01/2024 NaN
 
%Standardize the GDP to have comparable scale with the F's
y2m{:,:} = zscore4nan(y2m{:,:});
 
F.t.Properties.VariableNames = F.t.Properties.VariableNames + "T";
F.p.Properties.VariableNames = F.p.Properties.VariableNames + "P";
 
TTm = [y2m F.t F.p]; %Combine GDP & Factors in a single timetable
 
%----------------------%
% NBER recession Dates %
%----------------------%
%To add recession shaded areas in the graph
%0) Download the (monthly) NBER recession dates from FRED (mnemonic: USREC)
%1) Create an event table
%2) Pass the event table as a property to the timetable (to be ploted).
% When you display the timetable, you can also see the event labels.
%3) Passing the timetable (with the attached event table) to stackedplot()
% will plot events as shaded regions. (since R2023b)
 
recessions = fred_api("USREC", 'StartDate',datetime(1980,1,1));
recessions{:,:} = dum_me2(recessions{:,:}); % dum_me2() numerates the recessions
rvec = recessions{:,:};
recessionStrt = recessions.Time(arrayfun(@(z) find(rvec==z,1,'first'),1:max(rvec)));
recessionStrt = dateshift(recessionStrt,'end','month',0); %Convert to END-of-month
recessionEnd = recessions.Time(arrayfun(@(z) find(rvec==z,1,'last'),1:max(rvec)));
recessionEnd = dateshift(recessionEnd,'end','month',0);
recDates = array2table([recessionStrt recessionEnd],'VariableNames',["Start","End"])
recDates = 6x2 table
 StartEnd
129/02/198031/07/1980
231/08/198130/11/1982
331/08/199031/03/1991
430/04/200130/11/2001
531/01/200830/06/2009
631/03/202030/04/2020
EventDates = eventtable(recDates{:,1},'EventEnds',recDates{:,2}, 'EventLabels',"Recession"); %event table
TTm.Properties.Events = EventDates; %Pass the events table as a property to the timetable
clear recessions rvec recessionStrt recessionEnd recDates EventDates
 
%Plot the 1st target- and plain-Factor, against the GDP growth series
p = stackedplot(TTm,{["F1P" "F1T" LHS.name+"_h"+h]});
p.AxesProperties(1).LegendLabels = ["Plain-F1","Target-F1","GDP growth"];
p.AxesProperties(1).LegendLocation = 'SouthWest';
p.DisplayLabels = {''};
p.Title = 'US GDP growth & Monthly PCA Factors';
p.LineProperties(1).LineStyle = {'-','-', '-.'};
p.LineProperties(1).Marker = {'none','none','none'};
p.LineProperties(1).LineWidth = [0.5 0.5 1.2];
p.LineProperties(1).Color = [0.96, 0.49, 0.04; 0.06 0.55 0.91; 0 0 0]; %1=orange; 2=blue; 3=black
The following two tables report the correlation coefficients for the 3 series that were ploted above. Since those are based on repeating GDP values, correlation coefficients with the GDP, are going to be low.
vnm = ["GDPC1_h2","F1P","F1T"];
array2table(corr(TTm{:,vnm},'rows','pairwise'), 'VariableNames',vnm)
ans = 3x3 table
 GDPC1_h2F1PF1T
11-0.0897-0.1828
2-0.089710.9623
3-0.18280.96231
exclcovid = (TTm.Time<=datetime(2020,1,1) | TTm.Time>=datetime(2022,7,1));
array2table(corr(TTm{exclcovid,vnm},'rows','pairwise'), 'VariableNames',vnm)
ans = 3x3 table
 GDPC1_h2F1PF1T
11-0.0473-0.0551
2-0.047310.9155
3-0.05510.91551
clear p TTm y2m exclcovid vnm prescrn topx

5.3. Target-PCA quarterly factors with Hard-thresholding

Finally, pcaF() stores a copy of the temporally-aggegated (that is, the monthly-averaged-to-quarterly) predictors matrix used in the pre-selection step. We can use these to create a set of quarterly target-factors. The table below shows the quarterly versions of the pre-selected predictors.
Xq_targt = objF.t.Target.X4ps(:,objF.t.Target.series); %The pre-selected predictors in quarterly frequency
tail(Xq_targt,10)
Time IPFINAL IPFPNSS IPDMAT INDPRO DMANEMP USTPU LNS12032194 UEMP5TO14 USPBS LNS14000026 NDMANEMP LNS13023557 IPDCONGD DTCOLNVHFNM TB6SMFFM BAAFFM BOGMBASE __________ ___________ ___________ ___________ __________ ___________ ___________ ___________ _________ ___________ ___________ ___________ ___________ ________ ___________ __________ __________ __________ 30/09/2021 0.00071768 0.0017262 -0.00081651 -0.0015048 0.004422 0.002958 -0.012838 7.5075 0.0054219 4.6667 0.0037093 -0.011065 4.6406 0.0035151 0 -0.07 0.019432 31/12/2021 0.0044744 0.0047433 0.0072683 0.0060838 0.0033424 0.0041283 -0.041437 7.4462 0.007956 3.9 0.0040892 -0.028159 4.6678 -6.664e-05 0.033333 0.023333 0.0012654 31/03/2022 0.0070441 0.0062718 0.0034087 0.005219 0.0037732 0.0047895 0.017808 7.438 0.0053885 3.5667 0.0036933 -0.0011446 4.6773 -0.0013085 0.18667 0.29 -0.01481 30/06/2022 9.8421e-07 -0.00072889 -0.0011347 0.00067686 0.0020174 0.0010008 -0.04451 7.397 0.0010673 3.3333 0.0035843 -0.0024661 4.6833 0.00031423 0.096667 -0.0033333 -0.035994 30/09/2022 0.0018801 0.0016959 0.0045704 0.0027221 0.0027541 0.00051089 0.019338 7.4743 0.0023009 3.1333 0.0010951 -0.039364 4.6748 0.003594 0.043333 -0.32333 -0.0058379 31/12/2022 -0.0027342 -0.0043087 -0.0090745 -0.0066453 0.0022366 -0.00027857 0.0024155 7.4401 0.00080744 3.3 -0.0013009 -0.0016755 4.6648 0.0077741 -0.19667 -0.53667 -0.000302 31/03/2023 -0.00035805 0.00032875 0.00042475 0.0038352 -4.1282e-05 0.0013096 0.017823 7.495 0.00093711 3.2 -6.8608e-05 -0.0198 4.6546 0.0077323 -0.11333 -0.14667 0.010052 30/06/2023 -0.0020285 -0.0021633 0.0068186 -0.0011677 0.0012362 0.00047389 0.0086063 7.5377 0.0012551 3.1667 -0.0011685 0.017929 4.6791 0.013302 -0.0033333 -0.13 0.0022063 30/09/2023 0.0024586 0.0023453 0.00029072 0.0029501 0.00086265 0.000254 -0.010404 7.5469 -0.00027689 3.1333 -0.0008273 0.043571 4.6779 0.010585 -0.043333 0.053333 -0.0024577 31/12/2023 -0.00027042 -0.00085939 -0.0030615 -0.0017424 0.00069671 0.00021921 0.011434 7.5454 0.00058265 3.2 -0.001037 -0.050205 4.6521 0.0073314 -0.063333 -0.17333 0.015209
 
%Apply PCA to the quarterly version of the predictors selected by the previous pcaf() call
opTrgt = [];
[~,loc] = ismember(Xq_targt.Properties.VariableNames, codes.M); %loc: index in RIGHT for each value in LEFT
[Fq, objFq] = pcaF(Xq_targt,'treatX','Q','KstarOptions',opK,'KstarAllMethods',false,'Groups',cat.M(loc),'Diagnostics',true,'addGraph',true,'ExcludefromX',[],'Target',opTrgt);
%Combine GDP & Quarterly Factors in a single timetable
yz = zscore4nan(LHS.y(:,h)); %Standardize the GDP
TTq = synchronize(yz, Fq, 'union','fillwithmissing');
vnm = ["GDPC1_h2","F1"];
array2table(corr(TTq{:,vnm},'rows','pairwise'), 'VariableNames',vnm)
ans = 2x2 table
 GDPC1_h2F1
11-0.2622
2-0.26221
 
p = stackedplot(TTq,{["F1" LHS.name+"_h"+h]});
p.AxesProperties(1).LegendLabels = ["Target-F1","GDP growth"];
p.AxesProperties(1).LegendLocation = 'SouthWest';
p.DisplayLabels = {''};
p.Title = 'US GDP growth & Quarterly PCA Factor';
p.LineProperties(1).LineStyle = {'-', '-.'};
p.LineProperties(1).LineWidth = [0.5 1.2];
p.LineProperties(1).Color = [0.96, 0.49 0.32; 0 0 0];
clear loc opTrgt objFq TTq p yz vnm Xq_targt opK Fq

6. Extracting Leads & Lags (MIDAS & single-frequency)

6.1 Leads for Quarterly Predictors (X's)
6.2 MIDAS Leads & Lags for Monthly Factors (F's)
6.3 Legendre MIDAS for Monthly Factors (F's)
In the nowcasting literature, 'leads' refer to the extracted time-series lags pertaining to 'leading periods', after applying the time-series lag operator to series that become more timely available compared to the target-variable's publication release schedule. These lags comprised of leading periods, contain ahead-in-time information that can potentially contribute to better predicting the relatively back-in-time target variable.
Using the trans2MIDAS() funtion of the Nowcasting Suite, the forecaster can distinguish the leads and lags of each predictor, allowing to experiment with different structures of the MIDAS polynomials.
Optional inputs allow the researcher to choose between two methods for extracting the leads & lags for each variable: (1) The standard method of extracting in total V leads + lags (i.e. applying the MIDAS time-series operator to obtain the first V lagged series of the high-frequency covariate); (2) Considering ALL leads + V lags, ala Andreou, Ghysels and Kourtellos (2013, JBES). The latter (AGK2013) method could be argued to be preferable, compared to the standard method of creating the MIDAS conditioning set, for the following reasons:
  1. If for an Xi we have many leads, using the first method might result in including little or NO lagged information at all, for that Xi.
  2. The latter method ensures that the lagging information is the same for all the X's, meaning that it covers the exact same period, past the last available LHS observation. For instance, assuming a quarterly-monthly mix, 1 quarter of past monthly information (prior to the last observed LHS) will be included if we set V=3, an entire year if V=12, two years if V=24 etc. To illustrate this point, let’s assume that the target variable's last observed value corresponds to Q4 (i.e., 31st Dec 2022). Also, say that for X1 we have 0 leads (i.e., last observed value for X1 is in 31/12/2022), while for X2 we have 5 leading months, meaning that we have 5 monthly observations ahead of the last observed LHS value (i.e., last observed value for X2 is in 31/5/2023). Extracting the last V=12 leads/lags (using the standard method) will have a year's worth of information for both X1 and X2, but only for X1 that information will correspond to the entire last year of the LHS variable.
For example, assuming that we have predictors at monthly and daily frequency (and a quarterly target), one might consider adding to the same model a restricted lead and lag (e.g. Almon) polynomial for the daily indicator to ensure parsimony, and an unrestricted polynomial for the leads of the monthly indicator, while excluding any monthly lags altogether. Furthermore, being able to distinguish between what comprises leading and what lagging information allows us to see whether potential forecasting gains are due to timely information flowing into the model or if it's just the outcome of flexibility in the temporal-aggregation functions that summarize the lagged information.
trans2MIDASLegendre() works in the same manner as trans2MIDAS(), but employs the newly introduced MIDAS-Legendre polynomials further reducing the dimensionality of covariates, while also adding all the benefits of having orthogonal covariates.
Finally, the two functions also synchronize/align the different conditioning variables (X,D,LY) to each other, and to the target variable (y), so that they correspond to the appropriate periods for the estimation of the (direct-forecasting) model producing the desired h-steps ahead forecast.
%adding code soon
References
Andreou E., Ghysels E., and Kourtellos A., (2013) "Should Macroeconomic Forecasters Use Daily Financial Data and How?", Journal of Business & Economic Statistics, 31(2): 240-251.
Bai, J., and Ng, S. (2002). "Determining the number of factors in approximate factor models." Econometrica, 70(1), 191-221.
Bergmeir, C., Hyndman, R. J., and Koo, B. (2018). "A note on the validity of cross-validation for evaluating autoregressive time series prediction." Computational Statistics & Data Analysis, 120: pp. 70-83.
Cheung and Lai (1995) Lag order and critical values of a modified Dickey–Fuller test. Oxford Bulletin of Economics and Statistics 57: 411–419.
Elliott, Rothenberg and Stock (1996) Efficient Tests for an Autoregressive Unit Root, Econometrica, Vol. 64, No. 4, pp. 813-836.
McCracken, M. and Ng, S., (2016), "FRED-MD: A Monthly Database for Macroeconomic Research," Journal of Business & Economic Statistics, 34(4): pp. 574-589.
Ng and Perron (1995) Unit root tests in ARMA models with data-dependent methods for the selection of the truncation lag. Journal of the American Statistical Association 90: 268–281.
Ng and Perron (2001) Lag length selection and the construction of unit root tests with good size and power. Econometrica 69: 1519–1554.
Schwert (1989) Tests for unit roots: A Monte Carlo investigation. Journal of Business and Economic Statistics 2: 147–159.
Stock, J. H., & Watson, M. W. (2002). "Macroeconomic forecasting using diffusion indexes." Journal of Business & Economic Statistics, 20(2), 147-162.