To generate a sample from a distribution we use functions with names that start with “r”.
Run the following code line to read the description of the functions.
?rnorm
?rbinom
?rt
?runif
For example, the following code will generate a random sample of size \(10\) from normal distribution \(N(10,5^2)\)
rnorm(n = 10, mean = 10, sd = 5)
## [1] 13.526189 7.407752 9.747115 8.478279 10.737673 11.421818 12.787931
## [8] 11.078982 18.177772 8.676007
and the following code will generate a random sample of size \(5\) from uniform distribution \(Unif(-5,5)\).
runif(n = 5, min = -5, max = 5)
## [1] 4.7700697 -0.9578752 1.9721777 -2.8716686 4.3980985
Now, try to generate a normal sample several times. Note that the generation process in R is random, so every time you get a new sample!
rnorm(n = 10, mean = 10, sd = 5)
## [1] -0.2861615 5.6289344 0.4543516 14.9408064 6.4048491 14.0050222
## [7] 9.6469391 1.8574426 20.1448480 9.3619645
rnorm(n = 10, mean = 10, sd = 5)
## [1] -1.494979 4.612990 14.476594 16.594731 9.018690 7.131331 12.952268
## [8] 8.826710 20.168714 12.100100
rnorm(n = 10, mean = 10, sd = 5)
## [1] 9.801491 12.520682 4.442353 11.609474 5.677155 2.491124 8.858177
## [8] 7.751763 3.511956 8.032030
To create a random sample that can be reproduced we use
set.seed(...)
function, which sets the “seed” of the random
number generator in R. Run the following code and make sure that you get
exactly the same results as in this tutorial.
set.seed(0)
rnorm(n = 10, mean = 10, sd = 5)
## [1] 16.314771 8.368833 16.648996 16.362147 12.073207 2.300250 5.357165
## [8] 8.526398 9.971164 22.023267
In this part we will illustrate the Central Limit Theorem for samples generated from uniform distribution \(Unif(0,10)\).
Recall that the density curve of this distribution looks like a “step” of width \(10\) and height \(0.1\).
Let’s generate one large sample (of size \(1000\)) form \(Unif(0,10)\) and plot the distribution histogram of the sample. It should look similar to the theoretical density.
sample = runif(n = 1000, min = 0, max = 10)
hist(sample, freq = FALSE, breaks = 10, xlab = "value", ylab = "density")
Also, recall that the theoretical values of the expectation and variance for a random variable \(X\sim Unif(0,10)\) are \(\mu=E(X) = \frac{10-0}{2} = 5\) and \(\sigma^2=Var(X) = \frac{(10-0)^2}{12} = 8.3333\).
mu = (10 - 0)/2
mu
## [1] 5
sigma2 = (10 - 0)^2/12
sigma2
## [1] 8.333333
Let’s check if these values are consistent with the sample mean and variance.
mean(sample)
## [1] 5.000792
var(sample)
## [1] 8.355622
According to CLT, for large \(n\) the distribution of the sample mean \(\bar{X}\) is approximately normal \(N(\mu, \frac{\sigma^2}{n})\), which, in our case, should be \(N(5, \frac{8.3333}{1000})\).
In general, to find the values of a density curve, use functions that start with “d”, e.g.
?dnorm
?dbinom
?dt
?dunif
To plot the theoretical CLT distribution we first select a grid of values where the density curve will be evaluated.
grid = seq(4.5, 5.5, 0.01)
We use dnorm(...)
function to find the value of the
density curve of \(N(5,
\frac{8.3333}{1000})\) at each point of the grid.
density = dnorm(x = grid, mean = mu, sd = sqrt(sigma2/1000))
Finally, we can plot the density curve.
plot(x = grid, y = density, type = 'l', xlab = "value", ylab = "density")
Let’s use the random sample generation to test the CLT statement for \(n = 1000\). The following code generates one sample of size \(1000\) form \(Unif(0,10)\), and computes the mean of the sample.
mean(runif(n = 1000, min = 0, max = 10))
## [1] 4.877456
Again, if you re-run the above line multiple times you will get different values of the sample mean (because every time R generates a sample at random), but all of these values will be close to the theoretical expectation \(\mu = 5\).
mean(runif(n = 1000, min = 0, max = 10))
## [1] 4.917563
mean(runif(n = 1000, min = 0, max = 10))
## [1] 4.958312
mean(runif(n = 1000, min = 0, max = 10))
## [1] 5.076849
Instead of copy-pasting the code three times, we can use
replicate(...)
function. Read the description of this
function first:
?replicate
Essentially, replicate(n = ..., expr = ...)
function
re-runs the code in the expr
argument n
times.
For instance, the following line will return 500 sample means, where
each of the sample was generated from the uniform distribution.
sample_means = replicate(n = 500, expr = mean(runif(n = 1000, min = 0, max = 10)))
sample_means
## [1] 5.217300 5.165422 4.843427 5.018413 4.933646 5.076321 4.967200 4.927972
## [9] 4.961912 5.005764 5.085939 4.775069 5.064276 5.088605 5.104516 4.983642
## [17] 4.997396 4.905318 4.942314 4.872744 4.944496 4.980214 5.134307 4.843706
## [25] 5.024972 5.011016 5.072923 5.041315 5.077371 5.019578 5.013767 4.930861
## [33] 5.063024 4.846047 4.950929 5.192180 5.078764 5.103997 4.992412 4.921227
## [41] 5.016025 4.968691 5.075156 5.132415 5.030373 5.094945 4.976904 4.915074
## [49] 5.008027 4.982038 5.083267 5.101863 5.025579 4.954598 4.898961 5.084326
## [57] 5.008723 5.047986 5.159624 5.033511 5.036293 5.059429 4.934431 4.933351
## [65] 4.924253 5.069015 5.052752 5.057084 4.876199 4.925773 4.908488 4.998751
## [73] 5.036396 4.947899 4.905555 4.940761 4.970123 4.938482 4.963757 5.101469
## [81] 5.004796 4.953499 5.076905 4.921984 4.954352 4.936210 4.752722 5.027816
## [89] 5.121969 4.921153 5.125787 4.938181 4.988424 4.905817 4.993227 4.937918
## [97] 5.121506 4.925657 5.233869 4.945638 4.918882 4.996280 5.007090 4.941755
## [105] 4.965421 5.027092 5.152457 5.021611 4.769466 4.927260 4.914808 5.138121
## [113] 5.062017 4.921519 5.003714 5.038364 5.027586 5.065542 5.065774 4.968919
## [121] 4.935020 5.054431 5.005871 4.908535 4.924019 4.855042 5.076179 5.105352
## [129] 5.236473 5.016073 5.070181 4.864323 4.978121 5.090352 4.885587 5.000323
## [137] 5.027487 4.992319 4.832015 5.188923 5.131564 4.965701 4.908943 5.066627
## [145] 5.143136 5.026201 4.887138 5.081536 4.834420 4.992977 5.056549 4.952011
## [153] 5.007318 4.999229 5.100085 4.974104 4.928915 4.898238 4.971568 4.932503
## [161] 4.973324 4.819234 4.993121 4.975160 5.000618 5.070849 4.922183 4.961617
## [169] 5.133215 4.956828 4.909355 5.087964 4.932740 5.043739 5.195457 4.926687
## [177] 5.167399 4.994013 4.972398 5.055593 5.172230 4.932467 4.892969 5.118069
## [185] 5.043270 4.988469 5.070476 5.085567 5.012015 5.111663 4.909064 5.002360
## [193] 4.935074 5.093028 5.082830 4.892612 5.023616 5.092829 4.925345 4.854136
## [201] 5.047619 4.871046 5.076407 4.863017 5.153168 4.985827 5.125810 4.960321
## [209] 4.897078 5.001057 4.992500 5.077621 5.067592 4.966761 4.930875 4.947610
## [217] 4.947559 4.854102 5.006722 5.019524 4.893176 5.099437 5.078081 5.140764
## [225] 5.036665 4.957833 5.064297 5.091125 5.051014 5.072620 4.903325 4.861176
## [233] 5.054072 4.982208 4.954110 5.065653 5.022430 4.933200 4.974203 4.986053
## [241] 4.980907 4.845894 4.948996 4.915864 4.968373 5.041304 4.894291 4.992567
## [249] 4.918798 4.949758 5.000819 5.021279 4.994410 5.077585 5.044712 5.139652
## [257] 4.964550 4.953792 4.947632 5.020363 4.926986 4.857973 4.911075 5.070090
## [265] 5.021562 4.981289 4.971433 4.895640 4.827018 5.040485 4.889009 5.243656
## [273] 5.110604 4.923836 5.053839 5.105188 4.985393 4.966379 5.126044 4.939572
## [281] 4.995708 5.093679 4.921617 4.997481 5.005913 5.041243 4.970092 4.945560
## [289] 4.980700 5.066500 5.144286 5.094730 4.916318 4.981146 4.969164 5.051020
## [297] 5.016804 5.018206 5.036075 4.982061 4.952599 4.874060 5.139154 4.933238
## [305] 5.016511 4.946952 4.939957 5.170887 4.908628 5.004941 5.116198 4.984626
## [313] 5.118785 5.009919 4.873520 5.082115 4.949654 4.995518 5.030741 4.894851
## [321] 4.931442 4.865323 4.923871 5.013920 4.977341 5.117683 4.927689 4.994578
## [329] 5.028538 4.761615 5.034992 4.984235 5.042829 5.012016 4.990987 5.146849
## [337] 5.038041 5.007875 4.888911 4.978769 5.125958 4.863964 4.893115 5.033412
## [345] 5.016475 4.825253 5.049961 4.968168 5.026785 4.994738 5.014517 5.014009
## [353] 5.051825 5.057331 4.868613 4.944798 4.873318 4.988611 4.867318 4.940251
## [361] 4.924199 5.015969 4.943485 5.107968 4.867433 5.046453 5.057596 5.006065
## [369] 4.975251 4.983702 4.987819 4.955362 4.796693 4.896598 4.889347 5.075057
## [377] 4.973138 5.059176 5.124781 5.016600 5.105762 4.890326 5.174030 5.014188
## [385] 5.034884 5.012346 5.060907 5.082269 4.797307 4.911095 5.026603 5.111104
## [393] 5.018674 4.989036 5.093952 5.127597 5.045923 5.040408 4.988382 4.941080
## [401] 5.028816 4.955356 4.931682 4.914562 4.993173 4.839455 4.912091 5.117946
## [409] 4.879543 4.996856 4.889823 4.977463 5.033807 5.057309 5.032523 5.099290
## [417] 4.984115 4.859653 4.938532 4.890057 4.846291 4.857286 5.105378 5.040327
## [425] 5.116357 5.015412 5.126495 4.980972 4.906177 4.776651 5.024350 4.964632
## [433] 5.235392 4.957219 4.846627 4.986407 4.786100 4.904838 4.834393 5.064527
## [441] 4.986290 4.741013 5.072512 5.100946 4.973761 5.037719 4.935505 5.095819
## [449] 4.800759 4.881825 4.939409 5.046525 5.036902 4.959067 5.084667 4.887347
## [457] 4.887914 5.069749 4.812720 4.791191 5.087173 5.069333 4.993685 5.064367
## [465] 5.168320 5.067669 5.095691 4.820447 5.104252 5.004527 4.992027 5.128721
## [473] 4.846729 4.983092 4.969123 5.111080 4.945395 4.995962 5.035266 5.189729
## [481] 4.894512 5.064435 4.962645 4.908236 5.176564 5.054324 5.026688 5.058799
## [489] 5.009483 5.043745 4.856548 5.074596 5.169184 5.056804 5.006073 5.005020
## [497] 5.101635 5.025548 5.063826 4.984970
Let’s plot the density histogram for the sample means (we use
xlim =
argument to set x-axis limits and make them
compatible to the ones in the density curve plot).
hist(sample_means, freq = FALSE, breaks = 15, xlim = c(4.5, 5.5))
You can also add the CLT density curve to the plot using
lines(...)
function.
hist(sample_means, freq = FALSE, breaks = 15, xlim = c(4.5, 5.5))
lines(x = grid, y = density)
The density histogram looks like a normal distribution!
Finally, we check that the expectation and variance of the distribution histogram are consistent with \(\mu =5\) and \(\frac{\sigma^2}n=\frac{8.3333}{1000}\).
mean(sample_means)
## [1] 4.996577
var(sample_means)
## [1] 0.008113006
To find a probability \(P(X\leq q)\) we use functions that start with “p”.
?pnorm
?pbinom
?pt
?punif
For example, the following function will find the probability \(P(X\leq -1)\) where \(X\sim N(10, 5^2).\) (Use the normal distribution table form Quercus to check this result)
pnorm(q = -1, mean = 10, sd = 5)
## [1] 0.01390345
If you want to find \(P(X\geq q)\)
instead, set lower.tail = FALSE
.
pnorm(q = -1, mean = 10, sd = 5, lower.tail = FALSE)
## [1] 0.9860966
To solve the reverse problem (find a quantile of a distribution, e.g. value \(a\) such that \(P(X\leq a) = 0.25\)) we use functions that start with “q”.
?qnorm
?qbinom
?qt
?qunif
For example, the following function will find the 25-th percentile of t-distribution with 10 degrees of freedom. (Use the t-distribution table form Quercus to find this quantile and compare this result to the R output)
qt(0.25, df = 10)
## [1] -0.6998121
In this part we will learn how to do statistical testing in R.
First, let’s generate a sample.
set.seed(0)
sample = rnorm(30, 0, 1)
sample
## [1] 1.262954285 -0.326233361 1.329799263 1.272429321 0.414641434
## [6] -1.539950042 -0.928567035 -0.294720447 -0.005767173 2.404653389
## [11] 0.763593461 -0.799009249 -1.147657009 -0.289461574 -0.299215118
## [16] -0.411510833 0.252223448 -0.891921127 0.435683299 -1.237538422
## [21] -0.224267885 0.377395646 0.133336361 0.804189510 -0.057106774
## [26] 0.503607972 1.085769362 -0.690953840 -1.284599354 0.046726172
To conduct statistical testing for the population mean \(\mu\) we use t.test(...)
function.
?t.test
The following code will use sample
to check \(H_0: \mu = 0\) vs. \(H_a: \mu \neq 0\).
Note that you can use
alternative =
argument to switch between one- and
two-sided alternatives;conf.level =
argument to change the significance level
\(\alpha\).t.test(x = sample, alternative = "two.sided", mu = 0, conf.level = 0.95)
##
## One Sample t-test
##
## data: sample
## t = 0.13152, df = 29, p-value = 0.8963
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.3193943 0.3632959
## sample estimates:
## mean of x
## 0.02195079
In the output of t.test(...)
function you get the
following information:
To run statistical testing for proportions we use
binom.test(...)
function. Note that, in this case, argument
x =
corresponds to the “number of successes” and
n =
is the “number of trials”.
For example, if we got 65 successful outcomes out of 100 trials and we want to test \(H_0: p = 0.5\) vs. \(H_a: p > 0.5\) at the significance level \(\alpha = 0.8\) we would run the following code:
binom.test(x = 65, n = 100, p = 0.5, alternative = "greater", conf.level = 0.8)
##
## Exact binomial test
##
## data: 65 and 100
## number of successes = 65, number of trials = 100, p-value = 0.001759
## alternative hypothesis: true probability of success is greater than 0.5
## 80 percent confidence interval:
## 0.6037246 1.0000000
## sample estimates:
## probability of success
## 0.65
Tn the output of binom.test(...)
function you get: