I. Descriptive Statistics

1. How to represent and summarize a variable? Why do we calculate mean and standard deviation for a variable? How about if the variable is skewed or discrete?

The most accurate way to represent a variable is to visualize it as a distribution histogram (distribution is close to the raw data). We usually use mean (\(\mu\)) to quantify the central tendency (typical value) of the distribution and standard deviation (\(\sigma\)) to characterize the degree of dispersion/spread. In addition, we use quantiles to measure locations of a distribution.

Notes:

  • mean is just one of the measures of central tendency. Another two common measures are mode and median. Which measure is the most appropriate to use depends on the questions and also the shape of the distribution.

  • standard deviation is defined according to the mean. If mean is not clearly defined, there is no standard deviation either. Given the association between mean and standard deviation, it is also meaningless to compare standard deviations across data sets. To compare dispersion across groups, we can use the coefficient of variation: \(CV = \sigma/\mu\).

  • if the distribution curve can be described mathematically using a distribution formula, the parameters of the formula will be the accurate measures.

For example, a Normal distribution (bell curve) could be described as \(X \sim {\sf Norm}(\mu,\sigma)\), while a Poisson distribution (right skewed curve) could be described as \(X \sim {\sf Pois(\lambda)}\). In a Normal distribution, mean and standard deviation are the two parameters that determine the distribution. Given the popularity of Normal distribution in the real world, the two parameters are commonly used measures to characterize data sets (represented as distribution histogram). In a Poisson distribution, \(\lambda\) is the only parameter (rate parameter), which is the total number of events (