Beyond normality: the bootstrap method for hypothesis testing

tl;dr: Parametric bootstrap methods can be used to test hypothesis and calculate p values while assuming any particular population distribution we may want. Non-parametric bootstrapping methods can be used to test hypotheses and calculate p values without having to assume any particular population as long as the sample can be assumed to be representative of the population and one can transform the data adequately to take into account the null hypothesis. The p values from bootstrap methods may differ from those from classical methods, especially when the assumptions of the classical methods do not hold. The different methods of calculation can push a p value beyond the 0.05 threshold which means that statements of statistical significance are sensitive to all the assumptions used in the test.


In this article I show how to use parametric and non-parametric bootstrapping to test null hypotheses, with special emphasis on situations when the assumption of normality may not hold. To make it more relevant, I will use real data (from my own research) illustrate the application of these methods. If you get lost somewhere in this article, you may want to take a look at my previous post, where I introduced the basic concepts behind hypothesis testing and sampling distributions. As in the previous post, the analysis will be done in R, so before we get into the details, it is important to properly setup our R session:

for(name in c("ggplot2", "plotly","furrr", "distr6"))
  library(name, character.only = TRUE)
plan(multiprocess) # Turns on parallel computation
set.seed(2019) # Reproducible Monte Carlo simulation

The data I will use consists of measurements of individual plant biomass (i.e. the weight of a plant after we have remove all the water) exposed to a control treatment (C), drought (D), high temperature (HT) and high temperature and drought (HTD). First, let’s take a look at the data:

Biomass = data.frame(Treatment = rep(c("C", "D", "HT", "HTD"), each = 18),
                 Biomass = c(2.03,  4.49,   3.84,   2.66,   7.4,    3.04,   2.63,   7,  5.84,   6.99,   4.15,   5.74,   10.49,  23.3,   14.21,  16.97,  11.56,  17.94, 6.01,    6.94,   6.05,   5.23,   2.47,   6.24,   3.96,   4.47,   2.35,   4.37,   3.33,   6.04,   7.98,   11.44,  10.02,  9.64,   11.19,  12.71, 5.22,    4.61,   7.58,   4.7,    6.68,   4.88,   4.11,   4.28,   5.77,   1.54,   2.79,   7.64,   8.68,   7.68,   12, 7.06,   9.9,    17.94, 3.8, 3.8,    5.14,   6.06,   2.78,   2.63,   3.91,   4.65,   5.62,   4.5,    4.45,   5.44,   8.53,   5.59,   6.14,   4.92,   6.54,   7.01))

p = ggplot(data = Biomass, aes(x = Treatment, y = Biomass, colour = Treatment)) + geom_point() + geom_jitter(width = 0.25) + geom_violin(fill = NA) + theme(legend.position = "none")

I want to answer the classic question of whether the treatments are decreasing the individual plant biomass. There are different ways of addressing this question but, for the purpose of this article, I will focus on hypothesis testing as this is the most common approach in plant research. Remember that in hypothesis testing the skeptical position (i.e. no effect) is taken as the null hypothesis and the goal is to calculate the probability that, if the null hypothesis is true, we will observe the data collected in the experiment.

The main issue with these data is that they do not appear to be samples from Normal distributions. To be more formal, we can use a normality test to check this claim, such as the Shapiro-Wilk test:

group_by(Biomass, Treatment) %>% summarise(test = shapiro.test(Biomass)$p.value)
## # A tibble: 4 x 2
##   Treatment   test
##   <fct>      <dbl>
## 1 C         0.0154
## 2 D         0.269 
## 3 HT        0.0275
## 4 HTD       0.924

We can see that, at least for the control and high temperature treatment, it is not very likely to get these kind of data if the variation across replicates was normally distributed (i.e. the assumption of normality does not hold). This actually makes sense as these data (i) cannot be negative (while the Normal distribution allows for negative values) and (ii) the measurements were performed on young plants in the exponential phase of growth, which would tend to produce Log-Normal distributions. A Log-Normal distribution is the distribution of a random variable which logarithm is normally distributed (in other words, a Log-Normal is the result of applying an exponential function to a Normal distribution). Indeed, if I log-transform the data and run the tests again, the p-values are much higher:

group_by(Biomass, Treatment) %>% summarise(test = shapiro.test(log(Biomass))$p.value)
## # A tibble: 4 x 2
##   Treatment  test
##   <fct>     <dbl>
## 1 C         0.704
## 2 D         0.556
## 3 HT        0.751
## 4 HTD       0.829

which means that the log-transformed data is compatible with the assumption of normality. At this point the usual path is to either work with the orignal scale and address the question using t test (hoping that it is robust enough) or to log-transform the data and apply the t test on the log scale. As will be shown below, whther we test on the original scale or the log scale can lead to differences in the p values.

In this article I will show alternative methods to test differences in means. These methods are known as bootstrapping methods and they come in many flavours. Here I will focus on the parametric bootstrap and non-parametric bootstrap (when people just bootstrap, without adjective, it generally means the non-parametric version). In the final section, the result from these methods will be compared to the t applied to the original and log scale to illustrate the impact that these choices can have on the results.

Parametric bootstrap

The idea of the parametric bootstrap was discussed in the previous post. The basic idea is that one needs to (i) assume a particular distribution to describe the population from which the data could have been sampled, (ii) estimate the values of the parameters of the distribution using the observed data and null hypothesis and (iii) generate the sampling distribution of any statistic through Monte Carlo simulation.

As discussed above, a reasonable distribution for the individual plant biomass would be a Log-Normal distribution. If a random variable \(x\) follows a Log-Normal distribution, then \(\text{ln}(x)\) follows a Normal distribution. Therefore, the easiest way to implement this distribution (and therefore how it is generally done) is to take the function for the Normal distribution and substitute \(x\) for \(\text{ln}(x)\). The problem is that the resulting function is parameterized in terms of mean and standard deviation of \(\text{ln}(x)\), but we are interested in the mean of \(x\). To express the Log-Normal in terms of the \(x\) we need the following reparameterization:

\[ \begin{align} \mu_x &= \text{exp}\left(\mu_{lx} + \sigma_{lx}^2/2 \right) \\ \sigma_x^2 &= \left[\text{exp}(\sigma_{lx}^2) - 1 \right]\text{exp}(2\mu_{lx} + \sigma_{lx}^2) \end{align} \]

where the subscript “\(_{x}\)” and “\(_{lx}\)” refer to the original scale and the log scale of \(x\), respectively. As discussed in the previous post, different methods may be used to estimate the parameters of a distribution. In this case, the estimators for \(\mu_x\) and \(\sigma_x\) can be derived from the maximum likelihood estimators of \(\mu_{lx}\) and \(\sigma_{lx}\) by just applying the reparameterization, that is:

\[ \begin{align} \hat{\mu_x} &= \text{exp}\left(\hat{\mu_{lx}} + \hat{\sigma_{lx}}^2/2 \right) \\ \hat{\sigma_x}^2 &= \left[\text{exp}(\hat{\sigma_{lx}}^2) - 1 \right]\text{exp}(2\hat{\mu_{lx}} + \hat{\sigma_{lx}}^2) \end{align} \]


\[ \begin{align} \hat{\mu_{lx}} &= \frac{\sum_k \text{ln} x_k}{n}\\ \hat{\sigma_{lx}} &= \frac{\left( \text{ln} x_k - \hat{\mu}\right)^2}{n} \end{align} \]

The R implementation of these formulae are as follows:

mu_x_hat = function(x) {
  mu_hat = mu_hat(x)
  sigma_hat = sigma_hat(x)
  exp(mu_hat + sigma_hat^2/2)
sigma_x_hat = function(x) {
  mu_hat = mu_hat(x)
  sigma_hat = sigma_hat(x)
  sqrt((exp(sigma_hat^2) - 1)*exp(2*mu_hat + sigma_hat^2))
mu_hat= function(x) mean(log(x))
sigma_hat = function(x) sd(log(x))

Note that this code calculates the maximum likelihood estimators for the mean and standard deviation of the Log-Normal distribution. If we had use another distribution we would have to use different formulae. You can find these formulae online (Wikipedia often has good information on many distributions) or you can use an optmization approach for more complex cases (but that is for another post).

The null hypothesis specifies that the mean of the populations from which the data were sampled should be equal. Since we are comparing each treatment to the control, it makes sense to calculate this mean from the control data (an alternative approach is to use the mean of each pair of treatments being compared). Hence, the estimates of the parameters for the population associated to the control are:

Biomass_C = filter(Biomass, Treatment == "C")$Biomass
mu_x_hat_C = mu_x_hat(Biomass_C)
sigma_x_hat_C = sigma_x_hat(Biomass_C)

The other populations will have the mean set to mu_x_hat_C, but the standard deviation is calculated from the samples:

Biomass_D = filter(Biomass, Treatment == "D")$Biomass
sigma_x_hat_D = sigma_x_hat(Biomass_D)
Biomass_HT = filter(Biomass, Treatment == "HT")$Biomass
sigma_x_hat_HT = sigma_x_hat(Biomass_HT)
Biomass_HTD = filter(Biomass, Treatment == "HTD")$Biomass
sigma_x_hat_HTD = sigma_x_hat(Biomass_HTD)

We now construct the population models using the distr6 package:

popC = Lognormal$new(mean = mu_x_hat_C, sd = sigma_x_hat_C)
popD = Lognormal$new(mean = mu_x_hat_C, sd = sigma_x_hat_D)
popHT = Lognormal$new(mean = mu_x_hat_C, sd = sigma_x_hat_HT)
popHTD = Lognormal$new(mean = mu_x_hat_C, sd = sigma_x_hat_HTD)

By sampling from each population we can compute the sampling distribution of the difference of sample means (the future_map_dbl is equivalent to a for loop in parallel):

N = 5e4
sample_size = 18
sample_mean_C = future_map_dbl(1:N, ~ mean(popC$rand(sample_size)))
sample_mean_D = future_map_dbl(1:N, ~ mean(popD$rand(sample_size)))
sample_mean_HT = future_map_dbl(1:N, ~ mean(popHT$rand(sample_size)))
sample_mean_HTD = future_map_dbl(1:N, ~ mean(popHTD$rand(sample_size)))

Which are then compared to the differences observed in the experiment:

effect_mean_D = mean(Biomass_D) - mean(Biomass_C)
effect_mean_HT = mean(Biomass_HT) - mean(Biomass_C)
effect_mean_HTD = mean(Biomass_HTD) - mean(Biomass_C)

And the p values are calculated as the fraction of casers where the different is more extreme:

pval_param_effect_mean_D = sum(sample_mean_D - sample_mean_C <= effect_mean_D)/N
pval_param_effect_mean_HT = sum(sample_mean_HT - sample_mean_C <= effect_mean_HT)/N
pval_param_effect_mean_HTD = sum(sample_mean_HTD - sample_mean_C <= effect_mean_HTD)/N

Non-parametric bootstrap

In non-parametric boostrap methods one tries to approximate the sampling distribution directly from the data, without making any explicit assumption about the distribution of the population from which the data could have been sampled. The procedure is to create alternative samples with the same size as the one observed by randomly sampling from the current sample with replacement (i.e. you allow every value to be sample more than once). This is not a capricious choice: if you do not replace the values that are being sampled you will just end up with the original sample in a different order!

There is an important implicit assumption in bootstrapping: that the sample is representative of the population in terms of its statistical properties (e.g. mean, variance, kurtosis, etc). This means that bootstrapping is more reliable the larger the sample size, as statistical properties will tend to stabilize as sample size increases.

Bootstrapping is most often used to calculated confidence intervals of parameters (e.g. coefficients of a regression model), but they can also be used for hypothesis testing provided that we do not introduce in the calculations any more assumptions than the null hypothesis and the assumptions mentioned above. Since the null hypothesis specifies that the data are being sampled from populations with the same mean, we need to transform the data such that the bootstrap method can produce sampling distributions in agreement with this hypothesis. One way to achieve this is by substracting from each sample its own mean and adding the pooled mean. That is:

mean_Biomass_C_D = mean(c(Biomass_C, Biomass_D))
mean_Biomass_C_HT = mean(c(Biomass_C, Biomass_HT))
mean_Biomass_C_HTD = mean(c(Biomass_C, Biomass_HTD))

boot_mean_Biomass_C_D = Biomass_C - mean(Biomass_C) + mean_Biomass_C_D
boot_mean_Biomass_C_HT = Biomass_C - mean(Biomass_C) + mean_Biomass_C_HT
boot_mean_Biomass_C_HTD = Biomass_C - mean(Biomass_C) + mean_Biomass_C_HTD
boot_mean_Biomass_D = Biomass_D - mean(Biomass_D) + mean_Biomass_C_D
boot_mean_Biomass_HT = Biomass_HT - mean(Biomass_HT) + mean_Biomass_C_HT
boot_mean_Biomass_HTD = Biomass_HTD - mean(Biomass_HTD) + mean_Biomass_C_HTD

Notice that there are now three corrected means for the control as we are doing three tests. The function sample gives a sample with replacement of a given size. In order to compute the three sampling distributions we need to apply bootstrapping six times (one for every corrected sample):

N = 5e4
boot_sample_mean_C_D = future_map_dbl(1:N, ~ mean(sample(boot_mean_Biomass_C_D, sample_size,replace = TRUE)))
boot_sample_mean_C_HT = future_map_dbl(1:N, ~ mean(sample(boot_mean_Biomass_C_HT, sample_size,replace = TRUE)))
boot_sample_mean_C_HTD = future_map_dbl(1:N, ~ mean(sample(boot_mean_Biomass_C_HTD, sample_size,replace = TRUE)))
boot_sample_mean_D = future_map_dbl(1:N, ~ mean(sample(boot_mean_Biomass_D, sample_size,replace = TRUE)))
boot_sample_mean_HT = future_map_dbl(1:N, ~ mean(sample(boot_mean_Biomass_HT, sample_size,replace = TRUE)))
boot_sample_mean_HTD = future_map_dbl(1:N, ~ mean(sample(boot_mean_Biomass_HTD, sample_size,replace = TRUE)))

And the p values are calculated as in the parametric case but using the new samples from the non-parametric bootstrap:

pval_nonparam_effect_mean_D = sum(boot_sample_mean_D - boot_sample_mean_C_D <= effect_mean_D)/N
pval_nonparam_effect_mean_HT = sum(boot_sample_mean_HT - boot_sample_mean_C_HT <= effect_mean_HT)/N
pval_nonparam_effect_mean_HTD = sum(boot_sample_mean_HTD - boot_sample_mean_C_HTD <= effect_mean_HTD)/N

Comparison with classical methods

The p values for the one sided Welch two sample t test applied to original and log transformed scales can be calculated with the t.test function:

pval_t_D = t.test(Biomass_C, Biomass_D, alternative = "greater")$p.value
pval_t_HT = t.test(Biomass_C, Biomass_HT, alternative = "greater")$p.value
pval_t_HTD = t.test(Biomass_C, Biomass_HTD, alternative = "greater")$p.value
pval_t_log_D = t.test(log(Biomass_C), log(Biomass_D), alternative = "greater")$p.value
pval_t_log_HT = t.test(log(Biomass_C), log(Biomass_HT), alternative = "greater")$p.value
pval_t_log_HTD = t.test(log(Biomass_C), log(Biomass_HTD), alternative = "greater")$p.value

Finally, the p values for the different methods and treatments are:

Parametric Non Parametric Original t Log t
D 0.177 0.151 0.159 0.337
HT 0.202 0.178 0.190 0.342
HTD 0.042 0.016 0.021 0.065

As can be seen above, the t test is actually quite robust to deviations from normality and the results in the original scale are quite clsoe to the results from both bootstrapping methods. This is less the case when applied to log tranformed data, where p values were generally larger, even though we did the transformation to “help” the t test.

Regarding the two bootstrapping methods, the parametric one is usually more flexible in calculating p values as it is always possible to implement the null hypothesis in terms of parameters of the population distribution whereas in the non-parametric case this is more difficult. Also, the non-parametric approach generally requires more data than the parametric approach to produce reliable estimates. Of course, the price we pay is that the parametric method makes more assumptions about the data. If the assumptions are not reasonable, results can be less reliable than for the non-parametric case.

Some remarks about p < 0.05

Notice than in the high temperature + drought treatment (HTD), the choice of test and whether to transform the data or not resulted in p values on different sides of the magical threshold of 0.05. This is one of the many reasons why it is not a great idea to dichotomize your thinking according to whether the p value happens to fall below a threshold or not. The problem is that small technical details (that are just assumptions, so they cannot be proven to be right or wrong and are often a matter of personal preference) will push the p value across the threshold. Therefore, if you base your scientific conclusion about an effect on whether p < 0.05 or not, your scientific reasoning is as robust or objective as you may think. Also, p hacking your way into significance becomes very tempting, specially when getting significant results has an impact on your career. So, perhaps, you should stop interpreting your results in terms of whether p < 0.05 or not and also stop using the term statistical significance. That is actually the opinion of the American Statistical Association (see statements from 2016 and 2019).

Then, what should we use as alternative? It is not at all clear but most suggestions point at the need to have a more quantitative and continuous view of statistical inference. One option is to supplement the reported p values with estimates of the effect of the treatment (i.e. by how much was biomass decreased by each treatment) and, very importantly, the uncertainty in these estimates. As with p values, confidence intervals can also be calculated using classical mathematical formulae, parametric and non-parametric bootstrapping methods. I will show how to do this in a future post, stay tuned!

Alejandro Morales' Blog
Alejandro Morales' Blog
Docent Computational Crop Ecophysiology