### 1. How to represent and summarize a variable? Why do we calculate mean and standard deviation for a variable? How about if the variable is skewed or discrete?

The most accurate way to represent a variable is to visualize it as a distribution histogram (distribution is close to the raw data). We usually use *mean* (\(\mu\)) to quantify the central tendency (typical value) of the distribution and *standard deviation* (\(\sigma\)) to characterize the degree of dispersion/spread. In addition, we use quantiles to measure locations of a distribution.

Notes:

*mean* is just one of the measures of central tendency. Another two common measures are *mode* and *median*. Which measure is the most appropriate to use depends on the questions and also the shape of the distribution.

*standard deviation* is defined according to the *mean*. If *mean* is not clearly defined, there is no *standard deviation* either. Given the association between *mean* and *standard deviation*, it is also meaningless to compare *standard deviations* across data sets. To compare dispersion across groups, we can use the *coefficient of variation*: \(CV = \sigma/\mu\).

if the distribution curve can be described mathematically using a distribution formula, the parameters of the formula will be the accurate measures.

For example, a Normal distribution (bell curve) could be described as \(X \sim {\sf Norm}(\mu,\sigma)\), while a Poisson distribution (right skewed curve) could be described as \(X \sim {\sf Pois(\lambda)}\). In a Normal distribution, *mean* and *standard deviation* are the two parameters that determine the distribution. Given the popularity of Normal distribution in the real world, the two parameters are commonly used measures to characterize data sets (represented as distribution histogram). In a Poisson distribution, \(\lambda\) is the only parameter (rate parameter), which is the total number of events (