Conditional probability is a measure of the probability of an event occurring given that another event has (by assumption, presumption, assertion or evidence) occurred.
Given two events $A$ and $B$, with the unconditional probability of $B$ (that is, of the event $B$ occurring) being greater than zero, $P(B) > 0$, the conditional probability of $A$ given $B$ is defined as the quotient of the probability of the joint of events $A$ and $B$, and the probability of $B$:
$$ P(A \mid B) = \frac{P(A \cap B)}{P(B)} $$Where $P(A \cap B)$ is the probability that both events $A$ and $B$ occur.
Mathematical notation:
Union: The union of $2$ sets $A$ and $B$ is denoted by $A \cup B$. This is the set of all distinct elements that are in $A$ or $B$. A useful way to remember the symbol is $\cup$. We can define the union of a collection of sets, as the set of all distinct elements that are in any of these sets.
Intersection: The intersection of $2$ sets $A$ and $B$ is denoted by $A \cap B$. This is the set of all distinct elements that are in both $A$ and $B$. A useful way to remember the symbol is $\cap$. We define the intersection of a collection of sets, as the set of all distinct elements that are in all of these sets.
Example: If $ A = \{ 1, 3, 5, 7, 9 \} $ and $ B = \{ 2, 3, 5, 7 \} $ what are $A \cup B$ and $A \cap B$ ?
We have:
$$ A \cup B = \{ 1, 2, 3, 5, 7, 9 \} $$$$ A \cap B = \{ 3, 5, 7 \} $$The probability rule of sum gives the situations in which the probability of a union of events can be calculated by summing probabilities together. It is often used on mutually exclusive events, meaning events that cannot both happen at the same time.
Let $A$ and $B$ be mutually exclusive events. Then, the probability of the union of those events is:
$$ P(A \cup B) = P(A) + P(B) $$The rule of product is a guideline as to when probabilities can be multiplied to produce another meaningful probability. Specifically, the rule of product is used to find the probability of an intersection of events:
Let $A$ and $B$ be independent events. Then,
$$ P(A \cap B) = P(A) \times P(B) $$Bayesian inference is a method for figuring out unobservable quantities given known facts that uses probability to describe the uncertainty over what the values of the unknown quantities could be. The role of probability distributions in Bayesian data analysis is to represent uncertainty, and the role of Bayesian inference is to update probability distributions to reflect what has been learned from data.
where $A$ and $B$ are events and $P(B) \neq 0$.
$P(A\mid B)$ is a conditional probability: the likelihood of event $A$ occurring given that $B$ is true.
$P(B\mid A)$ is also a conditional probability: the likelihood of event $B$ occurring given that $A$ is true.
$P(A)$ and $P(B)$ are the probabilities of observing $A$ and $B$ respectively; they are known as the marginal probability.
Where,
options(warn = -1)
library(ggplot2)
library(ggjoy)
library(ggExtra)
library(devtools)
library(magrittr)
library(lattice)
library(multipanelfigure)
library(purrr)
library(dplyr)
library(forcats)
library(BEST)
The function prop_model
implements a Bayesian model that assumes that:
1s
and 0s
.prop_model <- function(data = c(), prior_prop = c(1, 1), n_draws = 10000,
show_plot = TRUE)
{
data <- as.logical(data)
proportion_success <- c(0, seq(0, 1, length.out = 100), 1)
data_indices <- round(seq(0, length(data), length.out = min(length(data) +
1, 20)))
post_curves <- map_dfr(data_indices, function(i) {
value <- ifelse(i == 0, "Prior", ifelse(data[i], "Success",
"Failure"))
label <- paste0("n=", i)
probability <- dbeta(proportion_success, prior_prop[1] +
sum(data[seq_len(i)]), prior_prop[2] + sum(!data[seq_len(i)]))
probability <- probability/max(probability)
data_frame(value, label, proportion_success, probability)
})
post_curves$label <- fct_rev(factor(post_curves$label, levels = paste0("n=",
data_indices)))
post_curves$value <- factor(post_curves$value, levels = c("Prior",
"Success", "Failure"))
p <- ggplot(post_curves, aes(x = proportion_success, y = label,
height = probability, fill = value)) + geom_joy(stat = "identity",
color = "white", alpha = 0.8, panel_scaling = TRUE, size = 1) +
scale_y_discrete("", expand = c(0.01, 0)) + scale_x_continuous("Underlying proportion of success") +
scale_fill_manual(values = hcl(120 * 2:0 + 15, 100, 65),
name = "", drop = FALSE, labels = c("Prior ", "Success ",
"Failure ")) + theme_light(base_size = 18) +
theme(legend.position = "top")
if (show_plot) {
print(p)
}
invisible(rbeta(n_draws, prior_prop[1] + sum(data), prior_prop[2] +
sum(!data)))
}
Assume you just flipped a coin four times and the result was heads, tails, tails, heads. If you code heads as a success and tails as a failure then the following R
codes runs prop_model
with this data:
data <- c(1, 0, 0, 1)
The output of prop_model
is a plot showing what the model learns about the underlying proportion of success from each data point in the order you entered them. At n=0
there is no data, and all the model knows is that it's equally probable that the proportion of success is anything from 0%
to 100%
. At n=4
all data has been added, and the model knows a little bit more.
prop_model(data)
If we really were interested in the underlying proportion of heads of this coin then prop_model
isn't particularly useful. Since it assumes that any underlying proportion of success is equally likely prior to seeing any data it will take a lot of coin flipping to convince prop_model that the coin is fair. This model is more appropriate in a situation where we have little background knowledge about the underlying proportion of success.
Let's say the zombie apocalypse is upon us and we have come up with a new experimental drug to cure zombieism. We have no clue how effective it's going to be, but when we gave it to 13
zombies two of them turned human again.
# Update the data and rerun prop_model
data = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1)
prop_model(data)
Here again is the prop_model function which has been given the data from our zombie experiment where two out of 13
zombies got cured. In addition to producing a plot, prop_model
also returns a large random sample from the posterior over the underlying proportion of success.
data = c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
# Extract and explore the posterior
posterior <- prop_model(data)
head(posterior)
Looking at these first few samples confirms what is already shown in the plot: That the underlying proportion of cured zombies is likely somewhere between 5%
and 50%
. But these were just the first six samples in posterior which currently contain 10,000
samples (the default of prop_model
).
hist(posterior, breaks = 30, xlim = c(0, 1), col = "lightgreen")
The point of working with samples from a probability distribution is that it makes it easy to calculate new measures of interest.
A point estimate is a single number used to summarize what's known about a parameter of interest. It can be seen as a "best guess" of the value of the parameter. A commonly used point estimate is the median of the posterior. It's the midpoint of the distribution, and it's equally probable for the parameter value to be larger than the median as it is to be smaller than it.
# Calculate some measures of interest using the posterior
median(posterior)
So, a best guess is that the drug would cure around 19%
of all zombies. Another common summary is to report an interval that includes the parameter of interest with a certain probability. This is called a credible interval (CI)
. With a posterior represented as a vector of samples you can calculate a CI using the quantile()
function.
# Calculate a 90% credible interval
quantile(posterior, c(0.05, 0.95))
According to the credible interval, there is a 90%
probability that the proportion of zombies the drug would cure is between 6%
and 39%
. (Here we have to be careful to remember that the percentage of cured zombies and the percentage of probability are two different things.)
Now, there is a rival zombie laboratory that is also working on a drug. They claim that they are certain that their drug cures 7%
of the zombies it's administered to. Can we calculate how probable it is that our drug is better?
# use sum to count how many samples in posterior that are larger than 7%.
sum(posterior > 0.07)
To turn this count into a probability we now need to normalize it, that is, divide it by the total number of samples in posterior
# Calculate the probability
sum(posterior > 0.07) / length(posterior)
It seems there is a large probability (93%)
that our zombie drug is better!
Given the data of 2
cured and 11
relapsed zombies, and using the Baysian model describes before, there is a 90%
probability that our drug cures between 6%
and 39%
of treated zombies. Further, there is a 93%
probability that our drug cures zombies at a higher rate than the current state of the art drug.
Any type of computer program, mathematical expression, or a set of rules that you can feed fixed parameter values, and that you can use to generate simulated data.
Assume the underlying proportion of success of curing a zombie is 42%
and that you administer the drug to 100
zombies.
# The generative zombie drug model
# Set parameters
prop_success <- 0.42
n_zombies <- 100
# Simulating data
data <- c()
for(zombie in 1:n_zombies) {
data[zombie] <- runif(1, min = 0, max = 1) < prop_success
}
data <- as.numeric(data)
data
# Count cured zombies
data <- sum(data == 1)
data
It turns out that the generative model we ran last has a name. It's called the binomial process
or the binomial distribution. In R
you can use the rbinom
function to simulate data from a binomial distribution. The rbinom
function takes three arguments:
n
The number of times you want to run the generative modelsize
The number of trials. (For example, the number of zombies you're giving the drug.)prob
The underlying proportion of success as a number between 0.0
and 1.0
.# Try out rbinom
rbinom(n = 1, size = 100, prob = 0.42)
# Change the parameters to run the simulation 200 times
rbinom(n = 200, size = 100, prob = 0.42)
To get more visitors to your website you are considering paying for an ad to be shown 100
times on a popular social media site. According to the social media site, their ads get clicked on 10%
of the time.
Assume that 10%
is a reasonable number, and assume that the binomial distribution is a reasonable generative model for how people click on ads since it is a binary decision.
# Fill in the parameters
n_samples <- 100000
n_ads_shown <- 100
proportion_clicks <- 0.10
n_visitors <- rbinom(n_samples, size = n_ads_shown, prob = proportion_clicks)
# Visualize n_visitors
hist(n_visitors)
# Probability of 5 or more vistors
sum(n_visitors >= 5) / length(n_visitors)
We're not so sure that your ad will get clicked on exactly 10%
of the time. Instead of assigning proportion_clicks
a single value we are now going to assign it a large number of values drawn from a probability distribution.
For now, we are going to assume that it's equally likely that proportion_clicks
could be as low as 0%
or as high as 20%
. These assumptions translate into a uniform distribution
.
n_samples <- 100000
n_ads_shown <- 100
# Update proportion_clicks
proportion_clicks <- runif(n = n_samples, min = 0.0, max = 0.2)
n_visitors <- rbinom(n = n_samples, size = n_ads_shown, prob = proportion_clicks)
# Visualize proportion clicks
hist(proportion_clicks)
# Visualize n_visitors
hist(n_visitors)
This looks very different from the histogram of n_visitors
we got when proportion_clicks
was exactly 0.1
. With the added uncertainty in proportion_clicks
the uncertainty over the number of visitors we 'll get also increased.
# Probability of 5 or more vistors
sum(n_visitors >= 5) / length(n_visitors)
You ran your ad campaign, and 13
people clicked and visited your site when the ad was shown a 100
times. You would now like to use this new information to update the Bayesian model.
The model we put together previously resulted in two vectors: (1)
proportion_clicks
that represents the uncertainty regarding the underlying proportion of clicks and (2)
n_visitors
which represents the uncertainty regarding the number of visitors you would get.
# Create the prior data frame
prior <- data.frame(proportion_clicks, n_visitors)
# Examine the prior data frame
head(prior)
p = ggplot(prior, aes(n_visitors, proportion_clicks)) + geom_point(alpha=0.3)
ggMarginal(p, type = "histogram")
The reason we've called it prior is because it represents the uncertainty before (that is, prior to) having included the information in the data.
prior$n_visitors
represented the uncertainty over how many visitors you would get because of the ad campaign. But now you know you got exactly 13
visitors.
We update prior to include this information by conditioning on this data. That is, filtering away all rows where prior$n_visitors
isn't equal to 13
.
# Create the posterior data frame
posterior <- prior[prior$n_visitors == 13, ]
# Visualize posterior proportion clicks
hist(posterior$proportion_clicks)
Previously we updated the probability distribution over the underlying proportions of clicks (proportion_clicks)
using new data. Now we want to use this updated proportion_clicks
to predict how many visitors we would get if we reran the ad campaign.
# Assign posterior to a new variable called prior
prior <- posterior
head(prior)
This will represent the uncertainty regarding the new ad campaign you haven't run yet.
While proportion_clicks
represents the uncertainty regarding the underlying proportion of clicks for the next ad campaign, n_visitors
has not been updated yet.
# Replace prior$n_visitors with a new sample and visualize the result
n_samples <- nrow(prior)
n_ads_shown <- 100
prior$n_visitors <- rbinom(n_samples, size = n_ads_shown, prob = prior$proportion_clicks)
hist(prior$n_visitors)
The plot shows a probability distribution over the number of site visitors a new ad campaign would bring in. It now looks pretty likely that there would be more than, say, 5
visitors because of the campaign.
# Calculate the probability that you will get 5 or more visitors
sum(prior$n_visitors >= 5) / length(prior$n_visitors)
The Beta distribution is a useful probability distribution when you want model uncertainty over a parameter bounded between 0
and 1
. Here we will explore how the two parameters of the Beta distribution determine its shape.
# Explore using the rbeta function
beta_sample <- rbeta(n = 1000000, shape1 = 1, shape2 = 1) # Beta(1,1)
# Visualize the results
hist(beta_sample)
A Beta(1,1)
distribution is the same as a uniform distribution between 0
and 1
. It is useful as a so-called non-informative prior as it expresses that any value from 0
to 1
is equally likely.
# Modify the parameters
beta_sample <- rbeta(n = 1000000, shape1 = 100, shape2 = 100)
# Visualize the results
hist(beta_sample)
The larger shape parameters are, the more concentrated the beta distribution becomes. When used as a prior, this Beta distribution encodes the information that the parameter is most likely close to 0.5
.
# Modify the parameters
beta_sample <- rbeta(n = 1000000, shape1 = 100, shape2 = 20)
# Visualize the results
hist(beta_sample)
The larger the shape1
parameter is the closer the resulting distribution is to 1.0
and the larger the shape2
the closer it is to 0
.
# Modify the parameters
beta_sample <- rbeta(n = 1000000, shape1 = 10, shape2 = 150)
# Visualize the results
hist(beta_sample)
beta_sample1 <- rbeta(n = 10000, shape1 = 3, shape2 = 12)
beta_sample2 <- rbeta(n = 10000, shape1 = 15, shape2 = 75)
beta_sample3 <- rbeta(n = 10000, shape1 = 5, shape2 = 95)
beta_sample4 <- rbeta(n = 10000, shape1 = 30, shape2 = 10)
figure1 <- multi_panel_figure(columns = 2, rows = 2, panel_label_type = "none")
q1 <- ggplot() + geom_histogram(aes(beta_sample1), bins=40, colour="black", fill="blue", alpha=0.5) +
theme_joy() + xlim(c(0,1))
q2 <- ggplot() + geom_histogram(aes(beta_sample2), bins=40, colour="black", fill="green", alpha=0.5) +
theme_joy() + xlim(c(0,1))
q3 <- ggplot() + geom_histogram(aes(beta_sample3), bins=40, colour="black", fill="red", alpha=0.5) +
theme_joy() + xlim(c(0,1))
q4 <- ggplot() + geom_histogram(aes(beta_sample4), bins=40, colour="black", fill="yellow", alpha=0.5) +
theme_joy() + xlim(c(0,1))
Most ads get clicked on 5%
of the time, but for some ads it is as low as 2%
and for others as high as 8%
.
There are many different probability distributions that one can argue captures this information. Out of the four Beta distribution shown below, which captures this information the best?
figure1 %<>%
fill_panel(q1, column = 1, row = 1) %<>%
fill_panel(q2, column = 2, row = 1) %<>%
fill_panel(q3, column = 1, row = 2) %<>%
fill_panel(q4, column = 2, row = 2)
figure1
The best distribution would be beta_sample3 rbeta(shape1 = 5, shape2 = 95)
n_draws <- 100000
n_ads_shown <- 100
proportion_clicks <- runif(n_draws, min = 0.0, max = 0.2)
n_visitors <- rbinom(n_draws, size = n_ads_shown, prob = proportion_clicks)
prior <- data.frame(proportion_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 13, ]
# This plots the prior and the posterior in the same plot
par(mfcol = c(2, 1))
hist(prior$proportion_clicks,
xlim = c(0, 0.25))
hist(posterior$proportion_clicks,
xlim = c(0, 0.25))
n_draws <- 100000
n_ads_shown <- 100
# Change the prior on proportion_clicks
proportion_clicks <- rbeta(n_draws, shape1 = 5, shape2 = 95)
n_visitors <- rbinom(n_draws, size = n_ads_shown, prob = proportion_clicks)
prior <- data.frame(proportion_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 13, ]
# This plots the prior and the posterior in the same plot
par(mfcol = c(2, 1))
hist(prior$proportion_clicks,
xlim = c(0, 0.25))
hist(posterior$proportion_clicks,
xlim = c(0, 0.25))
Let's fit the binomial model to both the video ad data (13
out of 100
clicked) and the new text ad data (6
out of a 100
clicked).
# Define parameters
n_draws <- 100000
n_ads_shown <- 100
proportion_clicks <- runif(n_draws, min = 0.0, max = 0.2)
n_visitors <- rbinom(n = n_draws, size = n_ads_shown, prob = proportion_clicks)
prior <- data.frame(proportion_clicks, n_visitors)
# Create the posteriors for video and text ads
posterior_video <- prior[prior$n_visitors == 13, ]
posterior_text <- prior[prior$n_visitors == 6, ]
# Visualize the posteriors
hist(posterior_video$proportion_clicks, xlim = c(0, 0.25))
hist(posterior_text$proportion_clicks, xlim = c(0, 0.25))
The posterior proportion_clicks
for the video and text ad has been put into a single posterior data frame. The reason for [1:4000]
is because these proportion_clicks
are not necessarily of the same length, which they need to be when put into a data frame.
Now it's time to calculate the posterior probability distribution over what the difference in proportion of clicks might be between the video ad and the text ad.
posterior <- data.frame(
video_prop = posterior_video$proportion_clicks[1:4000],
text_prop = posterior_text$proportion_click[1:4000])
# Calculate the posterior difference: video_prop - text_prop
posterior$prop_diff <- posterior$video_prop - posterior$text_prop
# Visualize prop_diff
hist(posterior$prop_diff, xlim = c(-0.15, 0.2))
# Calculate the median of prop_diff
median(posterior$prop_diff)
Calculate the probability that proportion of clicks is larger for the video ad than for the text ad. That is, calculate the proportion of samples in posterior$prop_diff
that are more than zero.
# Calculate the proportion
sum(posterior$prop_diff > 0) / length(posterior$prop_diff)
It's pretty likely that the video ad is better.
Each visitor spends $2.53
on average, a video ad costs $0.25
and a text ad costs $0.05
. Let's figure out the probable profit when using video ads and text ads!
visitor_spend <- 2.53
video_cost <- 0.25
text_cost <- 0.05
# Add the column posterior$video_profit
posterior$video_profit <- posterior$video_prop * visitor_spend - video_cost
# Add the column posterior$text_profit
posterior$text_profit <- posterior$text_prop * visitor_spend - text_cost
# Visualize the video_profit and text_profit columns
hist(posterior$video_profit)
hist(posterior$text_profit)
Using the columns video_profit
and text_profit
that we added to posterior previously, let's conclude the decision analysis.
# Add the column posterior$profit_diff
posterior$profit_diff <- posterior$video_profit - posterior$text_profit
# Visualize posterior$profit_diff
hist(posterior$profit_diff)
# Calculate a "best guess" for the difference in profits
median(posterior$profit_diff)
# Calculate the probability that text ads are better than video ads
sum(posterior$profit_diff < 0) / length(posterior$profit_diff)
Even though text ads get a lower proportion of clicks, they are also much cheaper. And, as we have calculated, there is a 63%
probability that text ads are better.
The Poisson distribution simulates a process where the outcome is a number of occurrences per day/year/area/unit/etc.
The Poisson distribution has one parameter, the average number of events per unit. In R
you can simulate from a Poisson distribution using rpois
where lambda
is the average number of occurrences:
# Simulate from a Poisson distribution and visualize the result
x <- rpois(n = 10000, lambda = 3)
# Visualize x using a histogram
hist(x)
When you put up a banner on your friend's site you got 19
clicks in a day, how many daily clicks should you expect this banner to generate on average? Now, modify your model, one piece at a time, to calculate this. To accommodate the new banner data we are now going to change the model so that it uses a Poisson distribution instead.
# Change the model according to instructions
n_draws <- 100000
mean_clicks <- runif(n_draws, min = 0, max = 80)
n_visitors <- rpois(n = n_draws, mean_clicks)
prior <- data.frame(mean_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 19, ]
# Visualize mean_clicks
hist(prior$mean_clicks)
hist(posterior$mean_clicks)
# 95% credible interval
quantile(posterior$mean_clicks, c(0.025, 0.975))
A standard French-suited deck of playing cards contains 52
cards; 13
each of hearts (♥)
, spades (â™ )
, clubs (♦)
, and diamonds (♣)
. Assuming that we have a well-shuffled deck in front of you, the probability of drawing any given card is 1/52 ≈ 1.92%
.
# Calculate the probability of drawing any of the four aces
prob_to_draw_ace <- 1/52 + 1/52 + 1/52 + 1/52
prob_to_draw_ace
Again, assuming that we have a well-shuffled deck in front of you, the probability of drawing any given card is 1/52 ≈ 1.92%
. The probability of drawing any of the four aces is 1/52 + 1/52 + 1/52 + 1/52 = 4/52
. Once an ace has been drawn, the probability of picking any of the remaining three is 3/51
. If another ace is drawn the probability of picking any of the remaining two is 2/50
, and so on.
We will use the product rule to calculate the probability of picking the four aces in a row from the top of a well-shuffled deck.
# Calculate the probability of picking four aces in a row
prob_to_draw_four_aces <- 4/52 * 3/51 * 2/50 * 1/49
format(prob_to_draw_four_aces, scientific = F)
The code below simulates:
The number of clicks/visitors (n_clicks)
from 100
shown ads using the rbinom function given that the underlying proportion of clicks is 10%
.
Calculates the probability of getting 13 visitors (prob_13_visitors)
.
That is, in probability notation it's calculating $P$(n_visitors = $13$ $\mid$ proportion_clicks = $10%$).
# Rewrite this code so that it uses dbinom instead of rbinom
n_ads_shown <- 100
proportion_clicks <- 0.1
n_visitors <- rbinom(n = 99999, size = n_ads_shown, prob = proportion_clicks)
prob_13_visitors <- sum(n_visitors == 13) / length(n_visitors)
prob_13_visitors
# Rewrite this code so that it uses dbinom instead of rbinom
n_ads_shown <- 100
proportion_clicks <- 0.1
prob_13_visitors <- dbinom(x = 13, size = n_ads_shown, prob = proportion_clicks)
prob_13_visitors
We yield similar result as with rbinom
, but dbinom
is much more efficient!
Calculate $P$(n_visitors $\mid$ proportion_clicks = $10%$): The probability distribution over all possible numbers of visitors.
# Explore using dbinom to calculate probability distributions
n_ads_shown <- 100
proportion_clicks <- 0.1
n_visitors <- seq(0, 100)
prob <- dbinom(n_visitors, size = n_ads_shown, prob = proportion_clicks)
# Plot the distribution
plot(n_visitors, prob, type = "h")
It seems that with proportion_clicks = 10%
the most probable n_visitors values are between 1
and 20
.
13
.proportion_clicks
.n_ads_shown <- 100
proportion_clicks <- seq(0, 1, by = 0.01)
n_visitors <- 13
prob <- dbinom(n_visitors, size = n_ads_shown, prob = proportion_clicks)
plot(proportion_clicks, prob, type = "h")
The plot you we produced almost looks like it shows the probability distribution over different values of proportion_clicks
, but it does not. For one, the values in prob
do not sum up to one. What we have calculated is the likelihood of different values of proportion_clicks
to result in n_visitors = 13
. What we have found by eyeballing this graph is the so-called maximum likelihood estimate of proportion_clicks
.
We know:
We want to show 100
ads: We dont know
how many visitors we will get: 0
at most 100
We dont know
the proportion of clicks we will get from the ads. This is continuous from 0
to 1
so we cannot do all combinations so we will do a fine grid of values: 2
unknown parameters: proportion_clicks
: n_visitors
given proportion_clicks
: proportion_clicks
and n_visitors
combined which is a combination of the likelehood of the data given the parameter value and how likely that parameter value was to begin with: n_visitors
is not 13
(resulting number of clicks out of the 100
ads):Let's resurrect the zombie site example where we tested text ads. Out of a 100
impressions of the text ad, 6
out of a 100
clicked and visited your site.
n_ads_shown <- 100
proportion_clicks <- seq(0, 1, by = 0.01)
n_visitors <- seq(0, 100, by = 1)
pars <- expand.grid(proportion_clicks = proportion_clicks, n_visitors = n_visitors)
pars$prior <- dunif(pars$proportion_clicks, min = 0, max = 0.2)
pars$likelihood <- dbinom(pars$n_visitors, size = n_ads_shown, prob = pars$proportion_clicks)
pars$probability <- pars$likelihood * pars$prior
pars$probability <- pars$probability / sum(pars$probability)
# Condition on the data
pars <- pars <- pars[pars$n_visitors == 6, ]
# Normalize again
pars$probability <- pars$probability / sum(pars$probability)
# Plot the posterior pars$probability
plot(pars$proportion_clicks, pars$probability, type = "h")
Great, we've now done some Bayesian computation, without doing any simulation! The plot we produced should be similar to the posterior distribution previously calculated. However, it required an awful lot of code, isn't there anything we can cut?
Yes, there is! we can directly condition on the data, no need to first create the joint distribution.
# Simplify the code below by directly conditioning on the data
n_ads_shown <- 100
proportion_clicks <- seq(0, 1, by = 0.01)
n_visitors <- 6
pars <- expand.grid(proportion_clicks = proportion_clicks, n_visitors = n_visitors)
pars$prior <- dunif(pars$proportion_clicks, min = 0, max = 0.2)
pars$likelihood <- dbinom(pars$n_visitors, size = n_ads_shown, prob = pars$proportion_clicks)
pars$probability <- pars$likelihood * pars$prior
pars$probability <- pars$probability / sum(pars$probability)
plot(pars$proportion_clicks, pars$probability, type = "h")
Tilde-notation:
$$ n_{ads} = 100 $$$$ P_{clicks} \sim \text{Uniform}(0.0, 0.2) $$$$ n_{visitors} \sim \text{Binomial}(n_{ads}, P_{clicks}) $$Useful when we want to learn about an underlying proportion of success and want to predict the number of success's in future data.
Here is a small data set with the birth weights of six newborn babies in grams.
Let's assume that the Normal distribution is a decent model of birth weight data.
new_borns <- c(3164, 3362, 4435, 3542, 3578, 4529)
# Assign mu and sigma
mu <- as.integer(mean(new_borns))
sigma <- as.integer(sd(new_borns))
weight_distr <- rnorm(n = 100000, mean = mu, sd = sigma)
hist(weight_distr, 60, xlim = c(0, 6000), col = "lightgreen")
The resulting histogram gives us a sense of the uncertainty over the birth weight of newborns. Let's recreate this plot, but calculating the distribution using dnorm
instead of simulating using rnorm
.
# Create weight
weight <- seq(0, 6000, by = 100)
# Calculate likelihood
likelihood <- dnorm(weight, mu, sigma)
# Plot the distribution of weight
plot(weight, likelihood, type="h")
Zombies are stupid, and you and your colleagues at the National Zombie Research Laboratory are interested in how stupid they are. What we're interested in is how much we can learn about the mean zombie IQ from this data.
# The IQ of a bunch of zombies
iq <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 46)
# Defining the parameter grid
pars <- expand.grid(mu = seq(0, 150, length.out = 100), sigma = seq(0.1, 50, length.out = 100))
# Defining and calculating the prior density for each parameter combination
pars$mu_prior <- dnorm(pars$mu, mean = 100, sd = 100)
pars$sigma_prior <- dunif(pars$sigma, min = 0.1, max = 50)
pars$prior <- pars$mu_prior * pars$sigma_prior
# Calculating the likelihood for each parameter combination
for(i in 1:nrow(pars)) {
likelihoods <- dnorm(iq, pars$mu[i], pars$sigma[i])
pars$likelihood[i] <- prod(likelihoods)
}
# Calculate the probability of each parameter combination
pars$probability <- pars$likelihood * pars$prior
pars$probability <- pars$probability / sum(pars$probability)
We computed the probability for each mean (mu)
and SD (sigma)
combination
levelplot(probability ~ mu * sigma, data = pars)
Again pars
contains the data frame representing the posterior zombie IQ distribution we calculated earlier. We will draw sample_indices
: a sample of row numbers (a.k.a. indices) from the posterior. Now, let's sample from pars
to calculate some new measures!
head(pars)
# Sample from pars
sample_indices <- sample( nrow(pars), size = 10000, replace = TRUE, prob = pars$probability)
head(sample_indices)
# Sample from pars to calculate some mu and sigma
pars_sample <- pars[sample_indices, c("mu", "sigma")]
# Visualize the mean IQ
hist(pars_sample$mu, 100)
# Calculate quantiles
quantile(pars_sample$mu, c(0.025, 0.5, 0.975))
The 50%
quantile we just calculated is the same as the median and a good candidate for a "best guess" for the mean IQ of a zombie, and the 2.5%
and 97.5%
quantiles form a 95%
credible interval.
So we have an idea about what the mean zombie IQ is but what range of zombie IQs should we expect? And how likely is it that the next zombie you encounter is, at least, moderately intelligent?
head(pars_sample)
pred_iq <- rnorm(10000, mean = pars_sample$mu, sd = pars_sample$sigma)
# Visualize pred_iq
hist(pred_iq, 50)
# Calculate the probability of a zombie being "smart" (+60 IQ)
sum(pred_iq >= 60) / length(pred_iq)
Thepred_iq
distribution can be interpreted as the uncertainty over what IQ the next zombie you'll meet will have.
MCMC methods are used to approximate the posterior distribution of a parameter of interest by random sampling in a probabilistic space. MCMC methods allow us to estimate the shape of a posterior distribution in case we can’t compute it directly.
The first element to understanding MCMC methods are Monte Carlo simulations, which are just a way of estimating a fixed parameter by repeatedly generating random numbers. By taking the random numbers generated and doing some computation on them, Monte Carlo simulations provide an approximation of a parameter where calculating it directly is impossible or computationally expensive.
The second element to understanding MCMC methods are Markov chains. These are simply sequences of events that are probabilistically related to one another. Each event comes from a set of outcomes, and each outcome determines which outcome occurs next, according to a fixed set of probabilities. An important feature of Markov chains is that they are memoryless: everything that you would possibly need to predict the next event is available in the current state, and no new information comes from knowing historical events.
MCMC methods pick a random parameter value to consider. The simulation will continue to generate random values (this is the Monte Carlo part), but subject to some rule for determining what makes a good parameter value. The trick is that, for a pair of parameter values, it is possible to compute which is a better parameter value, by computing how likely each value is to explain the data, given our prior beliefs. If a randomly generated parameter value is better than the last one, it is added to the chain of parameter values with a certain probability determined by how much better it is (this is the Markov chain part).
Any statistic calculated on the set of samples generated by MCMC simulations is our best guess of that statistic on the true posterior distribution.
The t-test
is a classical statistical procedure used to compare the means of two data sets. In 2013
John Kruschke developed a souped-up Bayesian version of the t-test
he named BEST
(standing for Bayesian Estimation Supersedes the t-test
).
Zombies are stupid, but you and your colleagues at the National Zombie Research Laboratory are interested in how diet affects zombie intelligence. We have done a small experiment where we measured the IQ of 10
zombies on a regular diet and 10
zombies on a brain-based diet. The hypothesis is that zombies that eat more brains perform better on IQ tests. The data from the experiment is put into the variables iq_brains
and iq_regular
.
# The IQ of zombies on a regular diet and a brain based diet.
iq_brains <- c(44, 52, 42, 66, 53, 42, 55, 57, 56, 51)
iq_regular <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 46)
# Calculate the mean difference in IQ between the two groups
print('Mean diffrence between groups')
mean(iq_brains) - mean(iq_regular)
# Fit the BEST model to the data from both groups
best_posterior <- BESTmcmc(iq_brains, iq_regular)
# Plot the model result
plot(best_posterior)
This plot shows the posterior probability distribution over the difference in means between iq_brains
and iq_regular
. On top of this we get:
The mean of the posterior as a "best guess" for the difference.
A 95%
credible interval (called a95%
Highest Density Interval in the plot).
The amount of probability below and above zero difference.
There is some evidence that eating brains makes zombies smarter, but it's uncertain by how much.
The Bayesian model behind BEST
assumes that the generative model for the data is a t-distribution
; a more flexible distribution than the normal distribution as it assumes that data points might be outliers to some degree. This makes BEST's
estimate of the mean difference robust to outliers in the data.
Assume that a super smart mutant zombie (IQ = 150)
got into the iq_regular
group by mistake. This might mess up the results as you and your colleagues really were interested in how diet affects normal zombies.
# The IQ of zombies given a regular diet and a brain based diet.
iq_brains <- c(44, 52, 42, 66, 53, 42, 55, 57, 56, 51)
iq_regular <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 150) # <- Mutant zombie
# Modify the data above and calculate the difference in means
mean(iq_brains) - mean(iq_regular)
# Fit the BEST model to the modified data and plot the result
best_posterior <- BESTmcmc(iq_brains, iq_regular)
# Plot the model result
plot(best_posterior)
We see that the mutant zombie data point has made BEST
more uncertain to some degree. But since BEST
is robust to outliers, it still estimates that brain-eating zombies are more likely to have a higher IQ than zombies on a regular diet.