Individuals who have a specific disease
Complication Rate
The complication rate is the percentage of patients, out of all the patients who had a drug, who had adverse and/or serious adverse events. Dindo et al (2004) discusses this in detail.
Confidence interval (CI)
A confidence interval (CI) is a type of interval estimate of a population parameter. It is an observed interval (i.e. it is calculated from the observations), in principle different from sample to sample, that frequently includes the parameter of interest if the experiment is repeated. How frequently the observed interval contains the parameter is determined by the confidence level or confidence coefficient. More specifically, the meaning of the term "confidence level" is that, if confidence intervals are constructed across many separate data analyses of repeated (and possibly different) experiments, the proportion of such intervals that contain the true value of the parameter will match the confidence level.
Confounding variable
A variable which is not a factor being considered in an observational study or experiment, but may be at least partially responsible for the observed outcomes. Experimental design methods use randomisation to minimize the effect of confounding variables, but that is not possible in observational studies. The possibility of one or more confounding variables is one of the biggest problems in trying to make inferences based on observational studies.
Control group
The group that experiences the control treatment (standard treatment against which new treatments are compared).
A variable that has an effect that is of no direct interest. The analysis of the variable of interest is made more accurate by controlling for variation in the covariate.
Whereas the Kaplan-Meier method with log-rank test is useful for comparing survival curves in two or more groups, Cox proportional hazards regression allows analysing the effect of several risk factors on survival. The probability of the endpoint (death, or any other event of interest, e.g. recurrence of disease) is called the hazard.


The study of the human population and its composition.
Difference (Statistical)
The quantity by which one quantity differs from another; the remainder left after subtracting one quantity from another. A new drug is substantially different to the standard in some quantifiable way.


Effect Size
Effect size is a value indicating how much your independent variable has affected the dependent variable in an experimental study (i.e. how much variance in your dependent variable was a result of the independent variable). When using effect size with ANOVA, we use \eta^{2}.
A new drug is close to that delivered by the standard. Being slightly inferior is still acceptable.


Gene Expression
The process by which information from a gene (DNA) is used to synthesise functional gene products.
Generalised Estimating Equation (GEE) model
Used to estimate the parameters of a generalized linear model with a possible unknown correlation between outcomes. Parameter estimates from the GEE are consistent even when the covariance structure is misspecified, under mild regularity conditions. The focus of the GEE is on estimating the average response over the population ("population-averaged" effects) rather than the regression parameters that would enable prediction of the effect of changing one or more covariates on a given individual. They are a popular alternative to the likelihood–based generalized linear mixed model which is more sensitive to variance structure specification. They are commonly used in large epidemiological studies, especially multi-site cohort studies because they can handle many types of unmeasured dependence between outcomes.


Hazard Ratio (HR)
In survival analysis, the hazard ratio (HR) is the ratio of the hazard rates corresponding to the conditions described by two levels of an explanatory variable. For example, in a drug study, the treated population may die at twice the rate per unit time as the control population. The hazard ratio would be 2, indicating higher hazard of death from the treatment. Or in another study, men receiving the same treatment may suffer a certain complication ten times more frequently per unit time than women, giving a hazard ratio of 10.

Hazard ratios differ from relative risks in that the latter are cumulative over an entire study, using a defined endpoint, while the former represent instantaneous risk over the study time period, or some subset thereof. Hazard ratios suffer somewhat less from selection bias with respect to the endpoints chosen and can indicate risks that happen before the endpoint.


The incidence is a measure of the probability of occurrence of a given disease in a population within a specified period of time. We can distinguish cumulative incidence (number of new cases per population at risk developing a given disease within a specified time period divided by the size of the population initially at risk) to the incidence rate (number of new cases per population at risk in a given time period). If each participant can be followed throughout the period of the study, the incidence rate is the same as the cumulative incidence. They can be used to calculate the prevalence of a disease in the study population.


Log rank test
The log-rank test is a hypothesis test to compare the survival distributions of two samples. It is a nonparametric test and appropriate to use when the data are right skewed and censored (technically, the censoring must be non-informative). It is widely used in clinical trials to establish the efficacy of a new treatment in comparison with a control treatment when the measurement is the time to event (such as the time from initial treatment to a heart attack).
Involving information about an individual or group at different times throughout a long period.


The mean of a set of n items of data, x_1, x_2, ..., x_n is (\sum\limits_{j=1}^n x_j)/n), which is the arithmetic mean of the numbers x_1, x_2, ..., x_n. The mean is usually denoted by placing a bar over the symbol for the variable being measured. If the variable is x the mean is denoted by x̄. If the data constitute a sample from a population, then x̄ may be referred to as the sample mean; it is an unbiased estimate of the population mean.
Suppose that the observations in a set of numerical data are sorted in ascending order. Then the (sample) median is the middle observation. If there is an even number of observations, it is the average of the two middle observations.

So, if there are n observations, when n is odd: median = \frac{1}{2}(n+1)

When n is even:median = \frac{(\frac{1}{2}n) + (\frac{1}{2}n+1)}{2}

Mixed Effects model
A mixed model is a statistical model containing both fixed effects and random effects, that is mixed effects. They are particularly useful in settings where repeated measurements are made on the same statistical units (longitudinal study), or where measurements are made on clusters of related statistical units. Because of their advantage to deal with missing values, mixed effects models are often preferred over more traditional approaches such as repeated measures ANOVA.
Mortality rate
The mortality rate is a measure of the number of deaths in a population, rather than the number of new diseases, per unit of time.


A new drug is not inferior to the standard. The new drug must be the same or superior to the standard even if slightly so.
Non-parametric test
A test that makes no distributional assumptions about the population under investigation. The adjective ‘non-parametric’ was introduced by Wolfowitz in 1942. Non-parametric tests are usually very simple to perform and often make use of ranks (examples include the Mann–Whitney and Wilcoxon signed-rank tests).
Normal distribution

The distribution of a random variable X for which the probability density function f is given by f(x)=\frac{1}{\sigma\sqrt{2\pi}}exp(-\frac{(x-\mu)^2}{2\sigma^2}), -\infty < x < \infty. The parameters μ and σ2 are, respectively, the mean and variance of the distribution. The distribution is denoted by N(\mu, \sigma2). If the random variable X has such a distribution, then this is denoted by X ∼ N(\mu, \sigma2) and the random variable may be referred to as a normal variable.

The graph of f(x) approaches the x-axis extremely quickly, and is effectively zero if |x−\mu| < 3\sigma (hence the three-sigma rule). In fact, P(|X−\mu| < 2\sigma)≈95.5% and P(|X−\mu| < 3\sigma)\approx99.7%. The first derivation of the form of f is believed to be that of de Moivre in 1733. The description 'normal distribution' was used by Galton in 1889, whereas 'Gaussian distribution' was used by Karl Pearson in 1905.

The normal distribution is the basis of a large proportion of statistical analysis. Its importance and ubiquity are largely a consequence of the Central Limit Theorem, which implies that averaging almost always leads to a bell-shaped distribution (hence the name 'normal').

Null hypothesis
In statistical inference on observational data, the null hypothesis refers to a general statement or default position that there is no relationship between two measured phenomena. Rejecting or disproving the null hypothesis—and thus concluding that there are grounds for believing that there is a relationship between two phenomena (e.g. that a potential treatment has a measurable effect) is a central task in the modern practice of science, and gives a precise sense in which a claim is capable of being proven false. The null hypothesis is generally assumed to be true until evidence indicates otherwise. It is often denoted H0.


A thing or class of things external to or independent of the mind.
Odds ratio
Defined as the ratio of the odds of an event happening in two distinct groups. If the probabilities of the event in the two groups are p and q then the odds are p:(1‐p) and q:(1‐q) and the odds ratio is:

\frac{p/1-p}{q/1-q} = \frac{p(1-q)}{q(1-p)}


The p-value is a function of the observed sample results that is used for testing a statistical hypothesis. Before performing the test a threshold value is chosen, called the significance level of the test, traditionally 5% and denoted as α. If the p-value is equal to or smaller than the significance level (α), it suggests that the observed data are inconsistent with the assumption that the null hypothesis is true, and thus that hypothesis must be rejected and the alternative hypothesis is accepted as true. When the p-value is calculated correctly, such a test is guaranteed to control the Type I error rate to be no greater than α.
Parametric test
A statistical test that depends upon assumption(s) about the distribution of the data (e.g., that these are normally distributed). Parametric statistics is a branch of statistics which assumes that the data have come from a type of probability distribution and makes inferences about the parameters of the distribution
Paired data
Two data sets are "paired" when observations are performed on the same samples or subjects, that is, when a one-to-one relationship exists between values in the two data sets. An example of paired data would be a before-and-after drug test. The researcher might record the blood pressure of each subject in the study before a drug is administered and then measure again after administration. These measurements would be paired data, since each "before" measure is related only to the "after" measure from the same subject.
A medication or procedure that is inert (i.e. one having no pharmacological effect) but intended to give patients the perception that they are receiving treatment or assistance for their complaint. From Latin placebo: "I shall please."
Poisson regression
Poisson regression is a form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modelled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.
Prospective literally means "looking forward". It can also refer to an event that is likely or expected to happen in the future. A prospective cohort study is a cohort study that follows over time a group of similar individuals (cohorts) who differ with respect to certain factors under study, to determine how these factors affect rates of a certain outcome.
The probability of accepting the alternative hypothesis when it is in fact true.


Of or relating to quality or qualities; measuring, or measured by, the quality of something. In later use often contrasted with quantitative. A nominal or categorical variable. A variable whose values are not numerical. Examples include gender (male, female), paint colour (red, white, blue), type of bird (duck, goose, owl). The categories are not measured but rather numbered or ranked to distinguish from each other.
That is, or may be, measured or assessed with respect to or on the basis of quantity; that may be expressed in terms of quantity; quantifiable.


The process of making something random. This often applies to the random assignment of participants to different experimental conditions.
The action or process of enlisting new patients or subjects.
Relative Risk
In statistics and epidemiology, relative risk (RR) is the ratio of the probability of an event occurring (for example, developing a disease, being injured) in an exposed group to the probability of the event occurring in a comparison, non-exposed group. Relative risk includes two important features: (i) a comparison of risk between two "exposures" puts risks in context, and (ii) "exposure" is ensured by having proper denominators for each group representing the exposure.
The action or fact of continuing to observe, treat, or recognize a patient or subject.
Retrospective generally means to take a look back at events that already have taken place (retrospective cohort study, also called a historic cohort study)
Risk difference
Risk difference is a way of measuring the size of a difference between two treatments. It is defined as:
OutcomeCases a b
Non-cases c d


Sample Size
The number of observations in a sample.
Standard Deviation
The square root of the variance of observed results.
Standard Error (SR)
Standard Error (SE) is the Standard Deviation of the Mean rather than the observed results. More rigorously the standard error is the square root of the variance of a statistic. For example, the standard error of the mean of a sample of n observations taken from a population with variance σ2 is \frac{\sigma}{\sqrt{n}}. The same term is used for the corresponding sample estimate is \frac{s}{\sqrt{n}}. The standard error is always smaller than the standard deviation.
Standardised incidence ratio
Standardised incidence ratio is used for analyses of the regional distributions of the disease frequency in addition to age-standardized incidences. It is calculated as the quotient of the observed and the expected number of cases.
A group of people who associate for some reason or purpose other than personal preference.
Standardised mortality ratio
The standardized mortality ratio or SMR, is a quantity, expressed as either a ratio or percentage quantifying the increase or decrease in mortality of a study cohort with respect to the general population
Statistical Significance
The statistical significance (or statistically significant result) is attained when a p-value is less than a predetermined threshold, usually at 0.05 (5%).
The separation of a population sample in a study into defined groups such as by age or sex.
Of, relating to, or proceeding from an individual's thoughts, views, etc.; derived from or expressing a person's individuality or idiosyncrasy; not impartial or literal; personal, individual.


Type-I error
A type I error is the incorrect rejection of a true null hypothesis (a "false positive"), while a type II error is the failure to reject a false null hypothesis (a "false negative"). More simply stated, a type I error is detecting an effect that is not present, while a type II error is failing to detect an effect that is present.
Type-II error
Is the probability to accept H0, given the falsehood of H0. This is denoted by β. The Power of a test is 1 - β the probability of correctly rejecting H0 when it is false, which also translates as the probability of correctly accepting the alternative hypothesis H1 when it is true.


A measure of the variability of the data set. For data, x_1,x_2,...,x_n, the variance, \sigma^{2}, is defined to be:

\frac{1}{n}\sum_{j=1}^{n}(x_{j}-\bar{x})^{2}=\frac{1}{n}(\sum_{j=1}^{n}x_{j}^{2} - n\bar{x}^{2}) = \frac{1}{n}\{\sum_{j=1}^{n}x_{j}^{2}-\frac{1}{n}(\sum_{j=1}^{n}x_{j})^{2}\}

where \bar{x} is the mean given by:

\bar{x} = \frac{1}{n}\sum_{j=1}^{n}x_{j}

The variance is never negative and can be zero only if all the data values are the same.

This is correct if the data set effectively constitutes the entire population; for example, if the values x_1, x_2,... are the diameters of the planets of the solar system, or the lifetimes of all known patients with a rare disease. However, if the data constitute a random sample from a population, and we are interested in the variance of the values in the population, as opposed to the variance of the values in the sample, then it is appropriate to use the divisor (n−1), since this leads to an unbiased estimate of the population variance. This sample variance is given by:

s^2 = \frac{1}{n-1}\sum_{j=1}^{n}(x_j-\bar{x})^2


s^2 = \frac{1}{n-1}\sum_{j=1}^{m}f_j(x_j-\bar{x})^2