P1.T2.23.1. Machine-learning splits and sub-samples

Nicole Seaman · May 8, 2023

Learning objectives: Discuss the philosophical and practical differences between machine-learning techniques and classical econometrics. Explain the differences among the training, validation, and test data sub-samples and how each is used.

Questions:

23.1.1. Peter is developing a quantitative model that will predict asset prices in an exotic asset class. His initial decision is whether to rely on classical econometrics techniques--which are more familiar to him--or to utilize more modern machine-learning techniques, which are less familiar.

Among the following, in which scenario is he likely to prefer machine-learning techniques?

a. Due to the audience, he prefers an approach that is highly interpretable and easy to explain
b. He has developed a hypothesis and prefers, as the first step in the process, to begin with the refinement of the hypothesis; further, his hypothesis includes a data-generating process based on financial theory
c. He is not interested in causality, but he does need to make predictions, and he does requires flexibility in assumptions such as linearity and error properties
d. With respect to the bias-variance trade-off, the more important criteria is to minimize the bias of coefficients (aka, unbiased estimates) of the explanatory variables

23.1.2. Stephanie's work experience was in conventional econometrics, where she routinely distinguished between in-sample estimation and out-of-sample testing. Put simply, she routinely split the overall sample into two parts. However, as she is now engaged with machine learning techniques, she notices that the overall data sample is often split into three parts: training, validation, and testing.

Which of the following is the best description of the purpose of the validation set?

a. The validation set is employed to build her model and fit its final parameters
b. The validation set helps avoid overfitting by selecting among candidate models and/or tuning hyperparameters so her model generalizes to new data
c. The validation set is the pure holdout set that is retained to determine the final chosen model's effectiveness at the end of the analysis
d. The validation set is the list of patterns, typically specified in regular expressions (aka, regex), that validate the input data

23.1.3. Charles has a small sample of only 16 monthly observations. Due to the sample size and that he is undecided on which type of model to utilize, he wants to apply the technique of k-fold cross-validation. His ultimate hold-out (testing) set will include four months such that his cross-training set (aka, training plus validation) will include only 12 observations. His first model is a simple univariate linear regression, and that analysis is displayed below. He selects k = 3 folds such that each fold in his 3-fold cross-validation includes 12/3 = 4 observations (excluding the hold-out testing set of 4 months). The three versions of the linear regression (M1, M2, and M3) are each trained on eight observations (see the green cells). This enables him to compute their respective out-of-sample sum of squared residuals (aka, CV RSS in blue cells); similarly, the mean squared error (MSE) for each model is displayed.

Charles observes that the third version of the univariate linear regression (i.e., M3) has the lowest CV RSS.

Given this context, which of the following statements is TRUE about his 3-fold cross-validation?

a. A proper k-fold cross-validation technique computes the average MSE; specifically, (2.02 +1.98 +0.94)/3 =1.65; and compares this performance to his alternative models
b. His mistake is that, given the sample size is only 12, he should instead select k = four folds with 12/4 = three observations per fold
c. The drawback of his approach is that the bias-variance tradeoff implies that the model with the lowest residual sum of squares (RSS) is likely to have the highest mean squared error (MSE), and this subjectivity is a disadvantage of k-fold cross-validation
d. The mistake in his methodology is that each of his three models should instead train on only four observations (rather than eight) so that each can validate on the next four observations and test on the remaining four observations; i.e., 4 train + 4 validate + 4 test = 12 total observations

Answers here:

In forum

P1.T2.23.1. Machine-learning splits and sub-samples

Nicole Seaman

Director of CFA & FRM Operations

Similar threads