Statistics: Variance, Standard Deviation and Coefficient of Variation

Stats: Measures of Variation- Jerry Wu

Mode
The mode of a variable is the value or category with the highest frequency in the data. Most commonly used with qualitative data.

Median
Middle value of the data when the data are arranged from lowest to highest. If the sample size n is odd the median is (n+1)/2th value, if n is even, the median is the average of the n/2th and (n+2)/2th values.

Mean = Mu
Sum of the measurements taken on that variable divided by the number of measurements. Only meaningful for quantitative data.

Range

The range is the simplest measure of variation to find. It is simply the highest value minus the lowest value.

RANGE = MAXIMUM - MINIMUM

Since the range only uses the largest and smallest values, it is greatly affected by extreme values, that is - it is not resistant to change.

Variance
"Average Deviation"
The range only involves the smallest and largest numbers, and it would be desirable to have a statistic which involved all of the data values.

The first attempt one might make at this is something they might call the average deviation from the mean and define it as:

Ave deviation=

(1)
$$SUM (X-mu)/N$$

The problem is that this summation is always zero.
So, the average deviation will always be zero. That is why the average deviation is never used.

Population Variance
So, to keep it from being zero, the deviation from the mean is squared and called the "squared deviation from the mean". This "average squared deviation from the mean" is called the variance.

Population Variance=

(2)
$$Sigma^2=SUM (X-mu)^2/N$$

Unbiased Estimate of the Population Variance
One would expect the sample variance to simply be the population variance with the population mean replaced by the sample mean. However, one of the major uses of statistics is to estimate the corresponding parameter. This formula has the problem that the estimated value isn't the same as the parameter. To counteract this, the sum of the squares of the deviations is divided by one less than the sample size.

Sample Variance=

(3)
$$S^2= SUM (X- Xbar)^2/n-1$$

Standard Deviation
There is a problem with variances. Recall that the deviations were squared. That means that the units were also squared. To get the units back the same as the original data values, the square root must be taken.

Population Standard Deviation=

(4)
$$Sigma= Sqrt(Sigma^2)$$

Sample Standard Deviation=

(5)
$$S= sqrt(S^2)$$

The sample standard deviation is not the unbiased estimator for the population standard deviation.

eg.

Joseph's midterm grades, Statistics 96 pts、Math 90 pts、English 85 pts, Geography 78pts、 History 92 points、Chemistry 67 points，what is the variance on Joseph's midterm grades? what is the Standard deviation?
Solution:
We first determine the data is drawn directly from the population, not the sample. thus, using the population variance formula above we get:

(96+90+85+78+92+67)/6 = 508/6 = 84.67
Average grade (mean= mu) = 84.67

Therefore the variance is:

(6)
$$(96-84.67)^2+ (90-84.67)^2+ (85-84.57)^2.....(67-84.57)^2/6 =94.56$$

The Standard deviation is:

(7)
$$sqrt (94.56)= 9.72$$

The calculator does not have a variance key on it. It does have a standard deviation key. You will have to square the standard deviation to find the variance.

Sum of Squares (shortcuts)

The sum of the squares of the deviations from the means is given a shortcut notation and several alternative formulas.

(8)
$$SS(X)=SUM (X-Xbar)^2$$

A little algebraic simplification returns:

(9)
$$SS(X)= SUM X^2-(SumX)^2/n$$

Coefficient of Variation

The coefficient of variation (CV), also known as “relative variability”, equals the standard deviation divided by the mean. CV is often presented as the given ratio multiplied by 100. The CV for a single variable aims to describe the dispersion of the variable in a way that does not depend on the variable's measurement unit. The higher the CV, the greater the dispersion in the variable. The CV for a model aims to describe the model fit in terms of the relative sizes of the squared residuals and outcome values. The lower the CV, the smaller the residuals relative to the predicted value. This is suggestive of a good model fit.

(10)
$$C.V= S/mu$$

eg:

Data of height and weight of 5 students. Comparing the dispersion of the two
N=5
Height：172、168、164、170、176(cm)
Weight：62、57、58、64、64(kg)

Since the unit for two kinds of datas are different, in order to compare the dispersion, we need to calculate the coefficent of variations of both the height and the weight.

Coefficient of Variation for Height of 5 students
(4.47/170)x100% = 2.63%

(11)
$$mu= (164+168+170+172+176)/5 =170 Standard deviation= (164-170)^2+(168-170)^2+(170-170)^2...(176-170)^2/5=4.47$$

Coefficient of Variationfor Weight of 5 students
(3.31/61)x100% = 5.4%

(12)
$$mu= (57+58+62+64+64)/5=61 Standard deviation= (57-61)^2+.....(64-61)^2/5=3.31$$

Comparing the CV of both the height and the weight, we know that the dispersion of the weight is bigger.

page revision: 21, last edited: 26 May 2010 14:58