P1.T2.23.3 The k-means algorithm and natural language (NLP) steps

Nicole Seaman

Director of CFA & FRM Operations
Staff member
Subscriber
Learning objectives: Describe how the K-means algorithm separates a sample into clusters. Be aware of natural language processing and how it is used.

Questions:

23.3.1. For each of three banks (Azure Bank, Breeze Bank, and Coastal Bank), Stella collected the following data on three features:

P1.T2.23.3.1.png


Statistics displayed (top-right panel) include the mean, sample standard deviation (STDEV.S), and population standard deviation (STDEV.P). The standardized features (lower-left panel) are based on the sample standard deviation.

With respect to the standardized features (not the raw values), which of the following is nearest to the Euclidean distance between Breeze Bank and Coastal Bank?

a. 2.40
b. 4.15
c. 93.63
d. 119.50


23.3.2. Peter has a list of countries and their features. He wants to employ the k-means algorithm in order to cluster the countries into risk buckets, e.g., very high political risk. His general procedure will include the following steps:
  1. Choose the number of clusters (k) and randomly locate the initial centroids
  2. Assign each observation (aka, data point) to its nearest centroid
  3. Recalculate (aka, update) the centroid positions
  4. Repeat steps #2 and #3 until centroids no longer change
  5. Finalize clustering: Once convergence is reached, the algorithm has created stable clusters. The data points are now grouped into k clusters, and the centroids represent the center points of these clusters.
  6. Evaluate the model's performance
  7. Communicate findings to management which includes an audience unfamiliar with machine learning
Which of the following is truly a limitation (aka drawback) of this k-means approach?

a. He needs to choose the number of clusters; i.e., Step #1
b. His final centroids will lack interpretability; i.e., Step #7
c. He needs to label each observation, given this is a supervised technique
d. He cannot quantitatively evaluate the model's performance due to visual nature of scatterplots; i.e., Step #6


23.3.3. Brenda is conducting a sentiment analysis of a public company's quarterly (10Q) and annual (10K) reports. Her plan is to employ a basic version of natural language processing (NLP); in general, she will perform the following steps:

I. Tokenize the document with regular expressions, aka regex. Tokenization splits the text into words or possibly smaller (or even larger) units.
II. Remove stop words
III. Replace some words with their stems or lemmas; aka, stemming, lemmatization
V. Create n-grams
VI. Generate bag-of-words model

About NLP, each of the following is true EXCEPT which is false?

a. NLP is fast and can be applied to written language or spoken speech
b. Stop words are removed because they tend to cause the algorithm's for and while loops to execute indefinitely, which effectively requires a break
c. The goal of both stemming and lemmatization is to reduce words to their common base (aka, root) form
d. An n-gram is a contiguous cluster of n words (or n tokens) with a specific meaning when considered as a whole

Answers here:
 
Last edited by a moderator:
Top