# P1.T2.23.2 Principal component analysis (PCA) and model complexity

#### Nicole Seaman

##### Director of CFA & FRM Operations
Staff member
Subscriber
Learning objectives: Understand the differences between and consequences of underfitting and overfitting, and propose potential remedies for each. Use principal components analysis to reduce the dimensionality of a set of features.

Questions:

23.2.1. Patricia is building a factor model for her firm's primary equity portfolio. Her database includes several dozen candidate common factors, aka features. She considers employing principal component analysis (PCA) for the task. Each of the following statements about PCA is true EXCEPT which is false?

a. PCA is an unsupervised learning technique
b. The primary benefit of PCA is dimensionality reduction
c. PCA translates correlated features into a linear combination of uncorrelated components
d. PCA is an ensemble technique that combines models where each model partitions the data into hierarchical nodes with branches

23.2.2. Oliver is conducting principal component analysis (PCA) for risk management purposes: his goal is to isolate the primary drivers of price variability in the firm's portfolios. His dataset includes many features, and his general procedure includes the following steps (but various sub-steps are excluded):
1. Standardize the data with (n) number of features
2. Compute covariance matrix
3. Retrieve eigenvalues and eigenvectors
4. Sort the eigenvectors (with respect to their eigenvalues) and chose the components with a scree plot
Which of the following is a drawback of his PCA technique?

a. Variables with larger scales will dominate the analysis
b. It may be hard to interpret the meaning of the components
c. It will require that he label the observations, which is a time-consuming process
d. Depending on the interior relationships, his final product will generate n*(n-1)/2 principal components

23.2.3. For purposes of fraud detection, Sarah is experimenting with the following machine learning techniques:
• Decision tree
• Random forest
• K-nearest neighbor (KNN)
• A high-order (aka, high-degree) polynomial regression
In each experiment, she holds out a testing set and then splits the remaining data into a training set and a validation set. In general, she makes the following discovery as an observation that is true across all of her models: as she increases the complexity of the model, the model's error for the training set decreases, i.e., its performance for the training set improves. Among the following choices, which is TRUE as the BEST advice to give her?

a. She should reduce complexity with the goal of minimizing the variance of the prediction in the training and validation set
b. Her models exhibit the classic performance-complexity symptom because she is mistakenly using unsupervised learning models such that she should switch to supervised learning models
c. She should increase complexity until the model's error begins to increase (aka, deteriorates) for the validation set because further complexity is likely to produce predictions with low bias but excessive variance
d. She should increase complexity until the model's error begins to increase (aka, deteriorates) for the validation set because further complexity is likely to produce predictions with low variance but excessive bias