Skip to content

isat treats each initial block as GUM and hence often does not search at all if diagnostics of that "GUM" don't pass #39

@jkurle

Description

@jkurle

Hi all,

I have encountered a major issue with isat(). It arises from the fact that each block search starts with the model in which all indicators from that block are added. This is treated as the GUM but selection of indicators is only undertaken if the "GUM" passes all diagnostic tests.

As an example: Suppose we have a sample of 100 observations and we want to do IIS. In this case, 4 blocks of 25 indicators each are used as the starting points for the search. For example, the first block includes indicators iis1-iis25, the second one iis26-iis50, and so on. The problem is that each of these starting models (regressors + set of indicators) is internally treated as the GUM in getsFun(). getsFun() only starts its search, however, if all diagnostic tests are passed. That means that some blocks are not even searched.

In the following minimal reproducible example I have added two outliers to the sample. The cause the normality test to reject for each of the blocks. So when indicators iis1-iis25 are included, the outlier at observation 100 causes the normality test of that "GUM" to fail and hence none of the iis1-iis25 indicators are actually selected over. The same happens for the other three blocks such that in effect, no paths are searched. In other examples, I have encountered less extreme versions but it has happened (even with data under the null (no contamination)) to me that some of the blocks were not searched at all. Similarly, it also occurs for less extreme outliers, e.g. you can change the outliers to only 3 or 2.5 and still observe that behaviour.

library(gets)
# the issue actually also arises for other seeds that I randomly tried, e.g. also seeds 1-10
# for seed 12345, no search is undertaken for the two middle blocks but at least the ones with contamination are searched
set.seed(11)
u <- rnorm(100)
# create deterministic outliers at observations 1 and 100
u[100] <- 4 # alternatively try 3 or 2.5
u[1] <- 4 # alternatively try 3 or 2.5
x <- rnorm(100)
y <- 2*x + u
# no search is conducted
isat(y = y, mxreg = x, iis = TRUE, sis = FALSE, t.pval = 1/100, normality.JarqueB = 0.05)
# to visualise the outliers
model <- lm(y ~ x, data = data.frame(cbind(y, x)))
plot(model$residuals)

I think it is concerning that even small contamination of only 2% of the sample causes the whole procedure to break down. I guess this is why Autometrics searches over more block compositions rather than "chronologically".

It is not clear what the actual GUM should be in our case. At least with indicator saturation, we have (potentially many, many) more regressors than observations, so cannot estimate the most general model and check for misspecification. I am therefore suggesting that we turn off diagnostics for the initial path searches and only select indicators based on statistical significance. Then, the diagnostics could be turned on at the final selection (when all retained IIS, SIS, TIS etc. are added together). Alternatively, we could turn the diagnostics on already a bit earlier, when the final selection of a specific indicator type is made. By that I mean e.g. after all IIS blocks were searched and the final selection of IIS is made, we could turn on diagnostics.

I don't want to turn off diagnostics completely for selecting indicators. Sometimes, an observation can be an "outlier" (unusual) not in terms of the size of its error but because it does not match the more general pattern of the data, such as homoskedasticity, arch, etc.

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions