denominator (n-1) of sample variance

Sergio Guerrero

New Member
What if the sample size is 200. Should we still divide by (n-1)? Is a portfolio of SP 500 stocks a sample or a population? Thank you very much,
 

David Harper CFA FRM

David Harper CFA FRM
Subscriber
Hi @Sergio Guerrero It is still a sample, it's just a large sample. By convention, when the sample is sufficiently large (typically, n > 30, therefore definitely if n = 200), it's fine to use the so-called population variance (i.e., divide by n) because we realize the so-called sample variance is a fine approximation. But more deeply, the statistics issue is that, even given n=200, we are calculating a sample statistic which is really called an estimator (the sample variance) of the unknown population variance, which is the population parameter (technically, the estimate is the value--e.g., 0.0050--generated by the estimator which is the formula or "recipe"--e.g., sum of squared deviations from mean divided by (n-1)--that that is estimating the unknown population parameter).

As an estimator, either is okay because estimators can have different properties.
  • When we divide the sum of squared deviations from mean by (n-1), for any sample size, we generate an estimate of the variance that is unbiased (an "unbiased estimator" see https://en.wikipedia.org/wiki/Bias_of_an_estimator). So this can't be wrong.
  • If we divide the sum of squared differences by (n), for any sample size, we generate an estimate of the variance that is biased. Because 1/(n-1)*(n-1)/n = 1/n, the bias in this estimator = (n-1)/n; i.e., the expected value of this estimator is slightly less than the true value. So this gives us a maximum likelihood estimate (MLE), as opposed to an unbiased, estimator. As Hull writes in Ch 23 footnote 3: "Replacing (m-1) [ie, in the denominator] by (m) moves us from an unbiased estimate of the variance to a maximum likelihood estimate. Maximum likelihood estimates are discussed later in the chapter." This also is not wrong; its begs the question as to which property (properties) we prefer of the statistical estimator.
In short, then, (n-1) is easy and good because it can't really ever be wrong at any sample size! :) At the same time, if n = 200, we can also defend (n) as a convenient alternative, just like Hull does. I hope that helps!
 
Last edited:

emilioalzamora1

Well-Known Member
It depends. Does 200 comprise the whole population or is a just a snapshot (pie of the whole cake)?

1.) If 200 is the whole population, then we simply divide by n
2.) If 200 is a sample out of larger population (e.g. 2000 stocks), then we use the sample variance which implies dividing by (n-1).

ad S&P500 portfolio: it depends again:

1.) if you form a portfolio (for example allocating your wealth proportionally based on e.g. market cap of all stocks in the S&P500, then you have a population and divide by (n)
2. if you form a portfolio of only comprising (lets say comprising only utility and financial stocks given within the S&P500, then you have a sample from the S&P500 and divide by (n-1)

For simplicity, you can easily prove this yourself in Excel for example using either the following functions {population (P) and sample (S)} with different sample sizes:

VAR.P or STDEV.P

VAR.S or STDEV.S

General rule, for a sample size (n > 30) we tend to use the normal distribution. For a sample size (n < 30) we use the t-distribution. (David has an excellent summary about this in one of the fundamental readings in the section 'Quant. Methods', he covers all main distributions discussing pros/cons of each distribution! Highly recommend it.).

What we can infer from modelling sample variance and sample standard deviation is the fact that the larger the sample size becomes, the more negligible it is whether we divide by (n) or (n-1).

This is somewhat related to the central limit theorem (CLT). Even if the CLT centers on the sample mean.

What happens to the sampling distribution of the mean as the sample size becomes large? A larger sample size leads to an estimate of the mean that is on average closer to the population mean.
If a random variable X has a mean mu and variance sigma^2, then the sampling distribution of X becomes approx. normal with mean mu and variance sigma^2/N as N increases.

It is the CLT that provides a theoretical justification for the assumption of normality of the mean, mu .

Means, variances and covariances can be measured with certainty ONLY if we know about all possible outcomes (the population), but when we make a study, we have only a sample of a population.

A reasonable choice for an estimator of the variance of a random variable is

1/N* (X - Xhat)^2 or written as { (X - Xhat)^2 } / N

However, this estimator is biased and we would like to have an unbiased estimator of the variance

1/(N-1)* (X - Xhat)^2 or written as { (X - Xhat)^2 } / (N-1)

Ruppert (Statistics and Finance) comes with the following argument:

'There is a common misunderstanding about the CLT. It is misunderstood that a large population is approximately normally distributed.
The CLT says nothing about the distribution of a population; it is only a statement about the distribution of a sample mean. Also, the CLT does not assume that the population is large; it is the size of the sample that is converging to infinity.
Assuming that the sampling is with replacement, the population could be quite small, in fact, with only two elements.


Although the CLT was first discovered for the sample mean, other estimators are now known to also have approximate normal distributions for large sample sizes. In particular, there are central limit theorems for the maximum likelihood estimators. This is very important, since most estimators we use will be maximum likelihood estimators or least squares estimators. So, if we have a reasonably large sample, we can assume that these estimators have an approximately normal distribution and the normal distribution can be used for testing and constructing confidence intervals.'

For further background reading I do refer you to my all-time favourite which I have cited quite often already in this forum, some would say an 'ancient' book, but still sublime in many fields:

https://www.amazon.com/dp/0079132928/?tag=bt077d-20
 
Last edited:
Top