Data
How do we get answers?
Approach
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid", color_codes=True)
People = pd.read_csv('weightheight.csv')
People.info()
People.head()
Data contains patterns
Dealing with this
Null hypothesis (H0)
A = B
Alternative hypothesis (H1)
A != B
pvalue
100%
sure that are pattern couldn't have emerged due to random chance, but we can quantify the probability that random chance will produce a given pattern. The smaller the pvalue the less likely it is that the null hypothesis can account for our observations.alpha
0.05
: reject null hypothesisThe ttest tells you how significant the differences between groups are; In other words it lets you know if those differences (measured in means or averages
) could have happened by random chance.
Two basic types:
Onesample:
Mean of population different from a given value?Twosample:
Two population means equal?We want to compare the mean heights in cm of the Sample_A
group with a given value. We want to see whether the mean weight of the people in this sample is significantly different from the chosen cutoff point of 65 kg
. We will use a onesample ttest, which allows us to compare the mean of a sample with a chosen value. We will perform this test on the sample provided versus the crucial value of 65 kg
, and test its significance by comparing the value of alpha to the pvalue.
# Get Sample
Sample_A = People.Height.values
sns.kdeplot(Sample_A, shade=True, color="b")
plt.legend(('Height'))
plt.show()
from scipy import stats
# Perform ttest and print result
t_result = stats.ttest_1samp(Sample_A, 65)
# Test significance
alpha = 0.05
if (t_result[1] < alpha):
print(f"mean value of Sample A differs from given value \npvalue: {t_result[1]}")
else:
print(f"No significant difference found \npvalue: {t_result[1]}")
Now we'll compare two sets of samples. Again, we'll be looking at the weight
(weight of males and females) but this time we're going to compare the weight of males
with females
. Does the mean weight differ between the two genders? A twosample ttest can tell us whether the means of two samples differ significantly.
# Get Sample
Sample_A = People[People['Gender']=='Male']['Weight']
Sample_B = People[People['Gender']=='Female']['Weight']
sns.kdeplot(Sample_A, shade=True, color="b")
sns.kdeplot(Sample_B, shade=True, color="r")
plt.legend(('Male Weight', 'Female Weight'))
plt.show()
t_result = stats.ttest_ind(Sample_A, Sample_B)
alpha = 0.05
if (t_result[1] < alpha):
print(f"mean value of Sample A differs from Sample_B \npvalue: {t_result[1]}")
else:
print(f"No significant difference found \npvalue: {t_result[1]}")
A chisquare goodness of fit test determines if a sample data matches a population.
A chisquare test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.
small chi square test statistic
means that your observed data fits your expected data (there is a relationship).large chi square test statistic
means that the data does not fit very well (there isnâ€™t a relationship).# Categorical Dataset
df = pd.read_csv('bankfull.csv', sep=';')
df.head()
The table was called a contingency table
, by Karl Pearson, because the intent is to help determine whether one variable is contingent upon or depends upon the other variable. For example, does an Education
in primary, secondary, tertiary, or unknown depend on Job
, or are they independent?
We can summarize the collected observations in a table with one variable corresponding to columns and another variable corresponding to rows. Each cell in the table corresponds to the count or frequency of observations that correspond to the row and column categories.
Historically, a table summarization of two categorical variables in this form is called a contingency table.
For example, the Education = rows
and Job = columns
table with contrived counts might look as follows:
table = pd.crosstab(df.job, df.education)
table
The ChiSquared test is a statistical hypothesis test that assumes (the null hypothesis) that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable. The test calculates a statistic that has a chisquared distribution, named for the Greek capital letter Chi (X)
pronounced ki
as in kite.
Given the Job/Education example above, the number of observations for a category may or may not be the same. Nevertheless, we can calculate the expected frequency of observations in each Education group and see whether the partitioning of Education by Job results in similar or different frequencies.
The ChiSquared test does this for a contingency table, first calculating the expected frequencies for the groups, then determining whether the division of the groups, called the observed frequencies, matches the expected frequencies.
The result of the test is a test statistic that has a chisquared distribution and can be interpreted to reject or fail to reject the assumption or null hypothesis that the observed and expected frequencies are the same.
We can interpret the test statistic
in the context of the chisquared distribution with the requisite number of degress of freedom as follows:
If Statistic >= Critical Value:
significant result, reject null hypothesis (H0)
, dependent.If Statistic <= Critical Value:
not significant result, fail to reject null hypothesis (H0)
, independent.Degrees of Freedom:
1
from the number of items. Letâ€™s say you were finding the mean weight loss for a lowcarb diet. You could use 4
people, giving 3
degrees of freedom (4 â€“ 1 = 3)
, or you could use 100
people with df = 99.
The degrees of freedom
for the chisquared distribution is calculated based on the size of the contingency table as:
degrees of freedom = (rows  1) * (cols  1)
In terms of a pvalue
and a chosen significance level (alpha)
, the test can be interpreted as follows:
If pvalue <= alpha:
significant result, reject null hypothesis (H0)
, dependent.If pvalue > alpha:
not significant result, fail to reject null hypothesis (H0)
, independent.For the test to be effective, at least five observations
are required in each cell of the contingency table.
The Pearsonâ€™s chisquared test for independence can be calculated in Python using the chi2_contingency()
SciPy
function.
The function takes an array as input representing the contingency table for the categorical variables. It returns the calculated statistic and pvalue for interpretation as well as the calculated degrees of freedom and table of expected frequencies.
# chisquared test with similar proportions
from scipy.stats import chi2_contingency
from scipy.stats import chi2
stat, p_val, dof, expected = chi2_contingency(table)
print(table)
print(f'\ndof = {dof}')
We can interpret the statistic by retrieving the critical value from the chisquared distribution for the probability and number of degrees of freedom.
For example, a probability of 95%
can be used, suggesting that the finding of the test is quite likely given the assumption of the test that the variable is independent. If the statistic is less than or equal to the critical value, we can fail to reject this assumption (independent), otherwise it can be rejected (dependent).
# interpret teststatistic
prob = 0.95
critical = chi2.ppf(prob, dof)
print(f'Probability = {prob}, Critical Value = {round(critical,3)}, Test Statistic = {round(stat,3)}')
if abs(stat) >= critical:
print('\nDependent (reject H0)')
else:
print('\nIndependent (fail to reject H0)')
We can also interpret the pvalue by comparing it to a chosen significance level, which would be 5%
, calculated by inverting the 95%
probability used in the critical value interpretation.
# interpret pvalue
alpha = 1.0  prob
print(f'Significance Level = {round(alpha,3)}, PValue = {round(p_val,3)}')
if p_val <= alpha:
print('\nDependent (reject H0)')
else:
print('\nIndependent (fail to reject H0)')
Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. Use the Fisher's exact test of independence when you have two nominal (categorical) variables and want to find out if proportions for one nominal variable are different among values of the other nominal variable.
Test distinguishes between:
Now, we will work with the Olympics dataset to look at the relative success of the American swimming and athletics teams. Whether each athlete received a medal is coded as True or False
in the MedalTF
column of athletes. Do a larger proportion of swimming or athletics participants come home with medals? A Fisher exact test is a useful way to compare proportions of samples falling into discrete categories. To test this, we will need to perform a Fisher exact test on MedalTF in relation to Sport.
athletes = pd.read_csv('https://assets.datacamp.com/production/repositories/4371/datasets/8fd0a14bfbc5f13719d92334eaf77b23f2e914d6/olyathswim.csv')
athletes.head()
athletes['Medal'].unique()
b = {'Bronze': True, 'Silver': True, 'Gold': True}
athletes['MedalTF'] = athletes['Medal'].map(b)
athletes['MedalTF'] = athletes['MedalTF'].fillna(False)
athletes['Medal'] = athletes['Medal'].fillna('None')
athletes = athletes.dropna()
athletes['MedalTF'].unique()
# Create a table of crosstabulations
table = pd.crosstab(athletes.MedalTF, athletes.Sport)
print(table)
# Perform the Fisher exact test
fisher = stats.fisher_exact(table, alternative='twosided')
# Is the result significant?
alpha = 0.05
if fisher[1] <= alpha:
print(f"\nProportions of medal winners differ significantly \npvalue: {fisher[1]}")
else:
print(f"\nNo significant difference in proportions of medal winners found \npvalue: {fisher[1]}")
According to these results, a higher proportion of swimmers come home with medals than athletics competitors.
In this exercise, we're going to look at random sampling. For the purposes of this exercise, we are interested in differences between the body Weight of competitors in swimming and athletics. In order to test this, we will be using a twosample ttest. However, we will be performing this test on a random sample of the data. By playing with the random subset chosen, we will see how randomness affects the results.
# Define random seed
seed = 420
# Create two subsets, one for the athletics competitors and one for the swimmers
subsetathl = athletes[athletes.Sport == "Athletics"].dropna().sample(n=100, random_state = seed)
subsetswim = athletes[athletes.Sport == "Swimming"].dropna().sample(n=100, random_state = seed)
sns.kdeplot(subsetathl.Weight, shade=True, color="b")
sns.kdeplot(subsetswim.Weight, shade=True, color="r")
plt.legend(('Athletics Weight', 'Swimming Weight'))
plt.show()
# Perform the twosample ttest
t_result = stats.ttest_ind(subsetathl.Weight, subsetswim.Weight)
print(f'Test Statistic = {round(t_result[0],2)}, Pvalue = {round(t_result[1],2)}')
Change the seed value to 1234
and rerun the code
# Define random seed
seed = 1234
# Create two subsets, one for the athletics competitors and one for the swimmers
subsetathl = athletes[athletes.Sport == "Athletics"].dropna().sample(n=100, random_state = seed)
subsetswim = athletes[athletes.Sport == "Swimming"].dropna().sample(n=100, random_state = seed)
# Perform the twosample ttest
t_result = stats.ttest_ind(subsetathl.Weight, subsetswim.Weight)
print(f'Test Statistic = {round(t_result[0],2)}, Pvalue = {round(t_result[1],2)}')
Now we are familiar with randomizing samples and you can see how choosing different random samples can give you different results. Did you notice that our pvalue changed when a different random sample was chosen? Improving our approach to create a balanced dataset would help avoid this problem.
Confounding variable
Blocking = Solution to confounding
We're going to have another look at the same data but, this time, we'll use blocking to improve our approach. Like last time, we will be using a twosample ttest on athlete Weight within the DataFrame, athletes. This time, however, we will control for Sex as a blocking factor, sampling equally from male and female participants. We will need to extract a random subset of athletes from both events to run our test.
seed = 9000
# Create subset blocks
subsetathlm = athletes[(athletes.Sport == "Athletics") & (athletes.Sex == "M")].sample(n=100, random_state= seed)
subsetathlf = athletes[(athletes.Sport == "Athletics") & (athletes.Sex == "F")].sample(n=100, random_state= seed)
subsetswimm = athletes[(athletes.Sport == "Swimming") & (athletes.Sex == "M")].sample(n=100, random_state= seed)
subsetswimf = athletes[(athletes.Sport == "Swimming") & (athletes.Sex == "F")].sample(n=100, random_state= seed)
sns.boxplot(x="Sport", y="Weight", hue="Sex",
data=athletes);
# Combine blocks
subsetathl = pd.concat([subsetathlm, subsetathlf])
subsetswim = pd.concat([subsetswimm, subsetswimf])
# Perform the twosample ttest
t_result = stats.ttest_ind(subsetathl.Weight, subsetswim.Weight)
print(f'Test Statistic = {round(t_result[0],2)}, Pvalue = {round(t_result[1],2)}')
Now our ttest is significant, with a pvalue under 0.05
. This blocked design has resolved the issue we had. You can see how this type of blocking approach can be a useful way to improve your experimental design when a confounding variable is present.
A paired ttest is used to determine whether the mean difference between two sets of observations is zero. In a paired sample ttest, each subject or entity is measured twice, resulting in pairs of observations.
Here, we have a small DataFrame (podataframe)
containing information on 10
Fields. We are interested in potato yield in tons/hectare. For each Field, we have a value for its Yield2018
, before the application of a new fertilizer, and its Yield2019
, after the application of the new fertilizer. We will perform two ttests, a standard twosample test, and a paired ttest. A paired ttest will control for the variation between fields
.
yields2018 = [60.2, 12, 13.8, 91.8, 50, 27, 71.1, 41, 31, 55.9]
yields2019 = [63.2, 15.6, 14.8, 96.7, 53, 29.9, 73, 44.4, 38, 57.3]
podataframe = pd.DataFrame({'Yield2018': yields2018,
'Yield2019' : yields2019})
print(podataframe)
sns.kdeplot(yields2018, shade=True, color="b")
sns.kdeplot(yields2019, shade=True, color="r")
plt.legend(('yields2018', 'yields2019'))
plt.show()
# Perform independent ttest
ttestind = stats.ttest_ind(podataframe.Yield2018, podataframe.Yield2019)
print(f'Test Statistic = {round(ttestind[0],2)}, Pvalue = {round(ttestind[1],2)}')
# Perform paired ttest
ttestpair = stats.ttest_rel(podataframe.Yield2018, podataframe.Yield2019)
print(f'Test Statistic = {round(ttestpair[0],2)}, Pvalue = {round(ttestpair[1],2)}')
Recall, the paired test is more sensitive than the independent test and can pick up a difference that the independent test can't detect. This is because the difference within the samples each year (individual field effect) is quite large in comparison to the difference between the two years (effect of treatment). Paired tests are useful when a large variability exists.
Analysis of variance (ANOVA) is used to analyze the differences among group means in a sample. ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the ttest beyond two means.
ANOVA uses Ftests
to statistically test the equality of means. Fstatistics are based on the ratio of mean squares. It is simply an estimate of population variance that accounts for the degrees of freedom (DF) used to calculate that estimate.
Test types:
One Way ANOVA
Two Way ANOVA
Let's have another look at some data from our Olympic dataset. How does the Weight of athletes vary between teams from different countries? We going to use a oneway ANOVA to check for the presence of significant variation in Weight of Olympic athletes. A oneway ANOVA will allow us to see whether any differences between these groups of values are significant.
sns.boxplot(x="Team", y="Weight",
data=athletes[athletes['Team'].isin(['United States','China', 'France'])]);
# Create arrays
France_athletes = athletes[athletes.Team == 'France'].Weight
US_athletes = athletes[athletes.Team == 'United States'].Weight
China_athletes = athletes[athletes.Team == 'China'].Weight
# Perform oneway ANOVA
anova = stats.f_oneway(France_athletes, US_athletes, China_athletes)
print(f'Test Statistic = {round(anova[0],2)}, Pvalue = {round(anova[1],3)}')
According to this test, significant differences exist between the weights of athletes for each country.
Let's have another look at some data from our Olympic dataset. How does the Weight of athletes vary between teams from different countries and of different sexes? We are going to use a twoway ANOVA to check for the presence of significant variation in the Weight of Olympic sprinters. A twoway ANOVA will allow you to see which of these two factors, Sex and Team, have a significant effect on Weight.
sns.boxplot(x="Team", y="Weight", hue="Sex",
data=athletes[athletes['Team'].isin(['United States','China'])]);
import statsmodels.api as sm
# Create model
formula = 'Weight ~ Sex + Team'
model = sm.formula.ols(formula, data=athletes).fit()
# Perform ANOVA and print table
aov_table = sm.stats.anova_lm(model, typ=2)
print(aov_table)
According to the results of your ANOVA, Sex and Team has a significant effect.
Interaction effects represent the combined effects of factors on the dependent measure. When an interaction effect is present, the impact of one factor depends on the level of the other factor. Part of the power of ANOVA is the ability to estimate and test interaction effects. As Pedhazur and Schmelkin note, the idea that multiple effects should be studied in research rather than the isolated effects of single variables is one of the important contributions of Sir Ronald Fisher. When interaction effects are present, it means that interpretation of the main effects is incomplete or misleading. For example, the relationship between condiments and enjoyment probably depends on the type of food.
Once again, we are going to look at our dataset of Olympic athletes. As in previous exercises, we will be looking at the variation in athlete Weight. We are going to look at athletes of either Sex competing in one of two Events: the 100
meter and 10,000
meter run. An ANOVA will allow us to work out which of these variables affect Weight and whether an interactive effect is present.
df = athletes[athletes['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Men's 100 metres",
"Athletics Women's 10,000 metres",
"Athletics Women's 100 metres"])]
ax = sns.boxplot(data=df, x="Event", y="Weight", hue="Sex")
ax.set_xticklabels(['10,000m M','100m M','10,000m W','100m W']);
# Create model
formula = 'Weight ~ Sex + Event + Sex:Event'
model = sm.formula.ols(formula, data=athletes).fit()
# Perform ANOVA and print table
aov_table = sm.stats.anova_lm(model)
print(aov_table)
Look at the pvalues for the effects of Sex, Event, and Sex:Event. Using a standard alpha of 0.05
, Sex, Event, and Sex:Event have a significant effect on Weight. This means that both factors influence Weight, and the effect of one factor is dependent on the other.
# Create model
formula = 'Weight ~ Sex + Event + Age + Sex:Event + Sex:Age + Sex:Event:Age'
model = sm.formula.ols(formula, data=athletes).fit()
# Perform ANOVA and print table
aov_table = sm.stats.anova_lm(model)
pd.DataFrame(aov_table)
When we run a test:
Real effect present 
No real effect present 


Effect found (positive : alternative hypothesis) 
True Positive 
False Positive 
No effect found (negative: null hypothesis) 
False Negative 
True Negative 
Basis of tests:
Consider:
Picking a single result can be misleading. Account for multiple tests.
Correction methods: Choose method based on independence of tests.
Bonferroni correction: Better to use when tests are not independent from each other.
Î±
value by the number of analyses on the dependent variable.Å ÃdÃ¡k correction: Better to use when tests are independent from each other.
Î±
when the tests are independent from each other and all null hypotheses are true. For positivelycorrelated tests the procedure is conservative and it is liberal for negativelycorrelated tests.Improved nutrition and healthcare has lead to increased human heights in most societies over the past century. But is this trend also reflected amongst elite athletes? To examine this, we'll be looking at another slice from our Olympic dataset and performing multiple tests.
We will slice the athletes dataset containing information about the American male Olympic athletes from three years: 1924
, 1952
, and 2016
. We will perform twosample ttests to compare the three timepoints, seen in boxplots. Between which times do significant differences exist? We will be performing multiple nonindependent tests, so we will need to perform Bonferroni correction on the results.
df = athletes[athletes['Year'].isin(["1924",
"1952",
"2016"])]
sns.boxplot(data=df, x="Year", y="Height");
# Perform three twosample ttests
t_result_1924v2016 = stats.ttest_ind(athletes[athletes.Year == 1924].Height,
athletes[athletes.Year == 2016].Height)
t_result_1952v2016 = stats.ttest_ind(athletes[athletes.Year == 1952].Height,
athletes[athletes.Year == 2016].Height)
t_result_1924v1952 = stats.ttest_ind(athletes[athletes.Year == 1924].Height,
athletes[athletes.Year == 1952].Height)
# Create an array of pvalue results
pvals_array = [t_result_1924v2016[1],t_result_1952v2016[1],t_result_1924v1952[1]]
print('Pvalues:')
print(pd.DataFrame(pvals_array))
# Perform Bonferroni correction
adjustedvalues = sm.stats.multipletests(pvals_array, alpha=0.05, method='b')
print('\nCorrected Pvalues:')
print(pd.DataFrame(adjustedvalues[1]))
We're looking at how the Height of Olympic athletes from the athletes dataset has changed over time. In this exercise, we're considering three events, the 100 meter
, the High Jump
, and the Marathon
. We will be examining the correlation between Height
and Year
separately for each Event. As we did before, we will need to correct for multiple hypothesis tests, but, since these tests are independent, we can use the lessstrict Å ÃdÃ¡k correction.
fig, ax = plt.subplots(3,1, sharex=True, sharey=True, figsize=(8,6))
ax[0].scatter(x=athletes[athletes['Event'].isin(["Athletics Men's 100 metres",
"Athletics Women's 100 metres"])].Year,
y=athletes[athletes['Event'].isin(["Athletics Men's 100 metres",
"Athletics Women's 100 metres"])].Height,
color='red', s=10, alpha=0.5)
ax[1].scatter(x=athletes[athletes['Event'].isin(["Athletics Men's High Jump",
"Athletics Women's High Jump"])].Year,
y=athletes[athletes['Event'].isin(["Athletics Men's High Jump",
"Athletics Women's High Jump"])].Height,
color='lime', s=10, alpha=0.5)
ax[2].scatter(x=athletes[athletes['Event'].isin(["Athletics Men's Marathon",
"Athletics Women's Marathon"])].Year,
y=athletes[athletes['Event'].isin(["Athletics Men's Marathon",
"Athletics Women's Marathon"])].Height,
color='blue', s=10, alpha=0.5)
ax[2].set_xlabel('Date')
ax[0].set_title('100 Meter')
ax[1].set_title('High Jump')
ax[2].set_title('Marathon')
fig.text(0.001, 0.5, 'Height (meters)', va='center', rotation='vertical')
plt.tight_layout()
plt.show()
# Perform Pearson correlations
pearson100 = stats.pearsonr(athletes[athletes['Event'].isin(["Athletics Men's 100 metres",
"Athletics Women's 100 metres"])].Height,
athletes[athletes['Event'].isin(["Athletics Men's 100 metres",
"Athletics Women's 100 metres"])].Year)
pearsonHigh = stats.pearsonr(athletes[athletes['Event'].isin(["Athletics Men's High Jump",
"Athletics Women's High Jump"])].Height,
athletes[athletes['Event'].isin(["Athletics Men's High Jump",
"Athletics Women's High Jump"])].Year)
pearsonMara = stats.pearsonr(athletes[athletes['Event'].isin(["Athletics Men's Marathon",
"Athletics Women's Marathon"])].Height,
athletes[athletes['Event'].isin(["Athletics Men's Marathon",
"Athletics Women's Marathon"])].Year)
# Create array of pvalues
pvals_array = [pearson100[1],pearsonHigh[1],pearsonMara[1]]
print('Pvalues:')
print(pd.DataFrame(pvals_array))
# Perform Å ÃdÃ¡k correction
adjustedvalues= sm.stats.multipletests(pvals_array, alpha=0.05, method='s')
print('\nCorrected Pvalues:')
print(pd.DataFrame(adjustedvalues[1]))
The power of a binary hypothesis test is the probability that the test rejects the null hypothesis (H0)
when a specific alternative hypothesis (H1)
is true. The statistical power ranges from 0
to 1
, and as statistical power increases, the probability of making a type II error (failing to find difference that does exist) decreases. For a type II error probability of Î²
, the corresponding statistical power is 1 âˆ’ Î²
. For example, if experiment 1
has a statistical power of 0.7
, and experiment 2
has a statistical power of 0.95
, then there is a stronger probability that experiment 1
had a type II error than experiment 2
, and experiment 2
is more reliable than experiment 1
due to the reduction in probability of a type II error. It can be equivalently thought of as the probability of accepting the alternative hypothesis (H1)
when it is trueâ€”that is, the ability of a test to detect a specific effect, if that specific effect actually exists. That is,
Power Analysis:
alpha
. Often set to 5%
or0.05
.Measures of effect size:
For a given experiment with Significance level (alpha): 5% or 0.05
, Effect Size: of at least 0.80
, Statistical Power: 80% or 0.80
. We may be interested in estimating a suitable sample size. That is, how many observations are required from each sample in order to at least detect an effect of 0.80
with an 80%
chance of detecting the effect if it is true (20%
of a Type II error) and a 5%
chance of detecting an effect if there is no such effect (Type I error).
The function solve_power()
can be used to calculate one of the four parameters in a power analysis. In our case, we are interested in calculating the sample size. We can use the function by providing the three pieces of information we know (alpha, effect, and power)
and setting the size of argument we wish to calculate the answer of (nobs1)
to None
. This tells the function what to calculate.
The function has an argument called ratio that is the ratio of the number of samples in one sample to the other. If both samples are expected to have the same number of observations, then the ratio is 1.0
. If, for example, the second sample is expected to have half as many observations, then the ratio would be 0.5
.
Now we'll explore the effect of sample size on the results of statistical tests. Here, we'll be comparing the Weight of American Athletics and Swimming competitors in the athletes dataset. The boxplot show the difference between these two groups.
sns.boxplot(x="Sport", y="Weight", data=athletes);
Using a defined seed and varying sample sizes, we will perform ttests comparing the Weight of samples from both Sports.
# Create a random subset of 1000 samples, then use this to do a ttest to compare Weight between Sports.
subset = athletes.sample(n=1000, random_state= 1007)
ttestind = stats.ttest_ind(subset[subset.Sport == "Athletics"].Weight, subset[subset.Sport == "Swimming"].Weight)
print(f'Test Statistic = {round(ttestind[0],2)}, Pvalue = {round(ttestind[1],2)}')
# Change the sample size to 200, repeat the ttest to compare Weight between Sports.
subset = athletes.sample(n=200, random_state= 1007)
ttestind = stats.ttest_ind(subset[subset.Sport == "Athletics"].Weight, subset[subset.Sport == "Swimming"].Weight)
print(f'Test Statistic = {round(ttestind[0],2)}, Pvalue = {round(ttestind[1],2)}')
# Change the sample size to 50, repeat the ttest to compare Weight between Sports.
subset = athletes.sample(n=50, random_state= 1007)
ttestind = stats.ttest_ind(subset[subset.Sport == "Athletics"].Weight, subset[subset.Sport == "Swimming"].Weight)
print(f'Test Statistic = {round(ttestind[0],2)}, Pvalue = {round(ttestind[1],2)}')
Look how smaller samples affected your pvalue. With sample sizes of 1000
, 200
and 50
, we got pvalues of 0.07
, 0.0
and 0.57
, respectively. The smallest sample size didn't give a significant result. Now you see how much difference sample size can make.
Now that we've seen the importance of sample size, let's have another look at the same athletes dataset and see if we can determine the sample size we would need to get a significant result. We will determine the sample size we would need to have an 80%
chance of detecting a small (0.4)
difference between these two samples.
from statsmodels.stats import power as pwr
# Set parameters
effect = 0.4
power = 0.8
alpha = 0.05
# Calculate ratio
swimmercount = float(len(athletes[athletes.Sport == "Swimming"].index))
athletecount = float(len(athletes[athletes.Sport == "Athletics"].index))
ratio = swimmercount/athletecount
# Initialize analysis and calculate sample size
analysis = pwr.TTestIndPower()
ssresult = analysis.solve_power(effect_size=effect, power=power, alpha=alpha, nobs1=None, ratio=ratio)
print(f'Sample Size: {round(ssresult,3)}')
Now, we're going to have a look at effect sizes. We are examining the same slice from the atheletes dataset, comparing the body weights of competitors from swimming and athletics events. We are going to calculate the effect sizes we are able to detect for a couple of different sampling strategies. First, we will determine the smallest effect size detectable using the complete dataset. Then, in Step 2
, we will determine the size effect we could detect using 300
samples from each Sport.
# Set parameters for entire dataset
alpha = 0.05
power = 0.8
ratio = float(len(athletes[athletes.Sport == "Swimming"])) / len(athletes[athletes.Sport == "Athletics"])
samp_size = len(athletes[athletes.Sport == "Athletics"])
# Initialize analysis & calculate effect size
analysis = pwr.TTestIndPower()
esresult = analysis.solve_power(effect_size = None,
power = power,
nobs1 = samp_size,
ratio = ratio,
alpha = alpha)
print(f'Minimum detectable effect size: {round(esresult,2)}')
# Set parameters for the sample size found from the power calculation
alpha = 0.05
power = 0.8
ratio = float(len(athletes[athletes.Sport == "Swimming"])) / len(athletes[athletes.Sport == "Athletics"])
samp_size = round(ssresult,0)
# Initialize analysis & calculate sample size
analysis = pwr.TTestIndPower()
esresult = analysis.solve_power(effect_size=None, power=power, nobs1=samp_size, ratio=ratio, alpha=alpha)
print(f'Minimum detectable effect size: {round(esresult,2)}')
Cohen's d is an effect size used to indicate the standardised difference between two means. It can be used, for example, to accompany reporting of ttest and ANOVA results.For the independent samples Ttest, Cohen's d is determined by calculating the mean difference between your two groups, and then dividing the result by the pooled standard deviation.
Now, using the same comparison of Weight difference between Sports, let's calculate the actual effect size for this comparison. We will calculate Cohen's d for the difference in Weight between each Sport.
# sebset DataFrame with sample size found from the power calculation
subset = athletes.sample(n=int(round(ssresult,0)), random_state=2020)
# Create series
athl = athletes[athletes.Sport == "Athletics"].Weight
swim = athletes[athletes.Sport == "Swimming"].Weight
# Calculate difference between means and pooled standard deviation
diff = swim.mean()  athl.mean()
pooledstdev = np.sqrt((athl.std()**2 + swim.std()**2)/2)
# Calculate Cohen's d
cohend = diff / pooledstdev
print(f"Cohen's d: {round(cohend,3)}")
This is a moderate sized effect size. As the effect size is bigger than the minimum effect size calculated in the last exercise, you would expect that the parameters from the last exercise would allow you to detect a difference similar to this output.
In this exercise, we will use the athletes dataset to examine whether American athletes are more successful in athletics or in swimming events. We will make use of the MedalTF column, which gives a True or False value for whether each athlete won any type of medal. First, we will crosstabulate MedalTF with Sport. Then perform a Fisher exact test. Finally, we will examine the significance and effect size of this result.
# Create a table of crosstabulations
table = pd.crosstab(athletes.MedalTF,athletes.Sport)
print(table)
# Perform the Fisher exact test
chi = stats.fisher_exact(table, alternative='twosided')
# Print pvalue
print(f"\npvalue: {round(chi[1], 5)}")
# Print odds ratio
print(f"\nOdds ratio between groups: {round(chi[0],1)}")
Now we're going to look at effect sizes for Pearson correlations. In a previous exercise, we used Pearson correlation to examine the link between Weight and Height for athletes from our Olympic dataset. Now, we're going to focus a single event, the 10,000
meter run, and compare the results we obtain for Pearson tests of correlation for competitors from two Teams, Kenya (ken DataFrame) and Ethiopia (eth DataFrame). Which has a higher effect strength?
ken = athletes[athletes['Team'] == "Kenya"]
eth = athletes[athletes['Team'] == "Ethiopia"]
fig, ax = plt.subplots(2,1, sharex=True, sharey=True, figsize=(8,6))
ax[0].scatter(x=ken[ken['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Women's 10,000 metres"])].Height,
y=ken[ken['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Women's 10,000 metres"])].Weight,
color='red', s=25)
ax[1].scatter(x=eth[eth['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Women's 10,000 metres"])].Height,
y=eth[eth['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Women's 10,000 metres"])].Weight,
color='blue', s=25)
ax[1].set_xlabel('Height')
ax[0].set_title('Kenya 10,000 metres')
ax[1].set_title('Ethiopia 10,000 metres')
fig.text(0.001, 0.5, 'Weight', va='center', rotation='vertical')
plt.tight_layout()
plt.show()
# Perform Pearson correlation Kenya
pearsonken = stats.pearsonr(ken[ken['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Women's 10,000 metres"])].Weight,
ken[ken['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Women's 10,000 metres"])].Height)
print(f'Kenya Correlation: {round(pearsonken[0],3)}, Pvalue: {round(pearsonken[1],3)}')
# Perform Pearson correlation Ethiopia
pearsoneth = stats.pearsonr(eth[eth['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Women's 10,000 metres"])].Weight,
eth[eth['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Women's 10,000 metres"])].Height)
print(f'Ethiopia Correlation: {round(pearsoneth[0],3)}, Pvalue: {round(pearsoneth[1],3)}')
The Kenyan dataset gives a higher value than that obtained for the Ethiopian samples. This makes sense, as the scatterplot for the former looks much less noisy.
Now we're going to have another look at the same example, drawn from our Olympic dataset (Weight difference between Sports). Knowing the effect size of the difference between these two groups, we will work out what the odds are that our ttest will pick up this difference.
# Set parameters
effect_size = esresult
alpha = 0.05
samp_size = round(ssresult,0)
ratio = 1
# Initialize analysis & calculate power
analysis = pwr.TTestIndPower()
pwresult = analysis.solve_power(effect_size=effect_size, power=None, alpha=alpha, nobs1=samp_size, ratio=ratio)
print(f'Power: {round(pwresult,3)}')
Power curves are line plots that show how the change in variables, such as effect size and sample size, impact the power of the statistical test.
The plot_power()
function can be used to create power curves. The dependent variable (xaxis) must be specified by name in the â€˜dep_varâ€˜ argument. Arrays of values can then be specified for the sample size (nobs), effect size (effect_size), and significance (alpha) parameters. One or multiple curves will then be plotted showing the impact on statistical power.
# parameters for power analysis
effect_sizes = np.array([0.1, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.95])
sample_sizes = np.array(range(2, 151))
sample_sizes0 = np.array([10, 20, 30, 40, 50, 60, 70, 100])
# calculate power curves from multiple power analyses
analysis = pwr.TTestIndPower()
fig, ax = plt.subplots(2,1, figsize=(10,7))
analysis.plot_power(dep_var='nobs', nobs=sample_sizes, effect_size=effect_sizes, alpha=0.05, ax=ax[0])
analysis.plot_power(dep_var='es', nobs=sample_sizes0, effect_size=effect_sizes, alpha=0.05, ax=ax[1])
ax[1].set_title('')
ax[0].legend(bbox_to_anchor=(1, 1.05))
ax[1].legend(bbox_to_anchor=(1, 1.05))
ax[0].margins(0.015)
ax[1].margins(0.015)
fig.text(0.001, 0.5, 'Power', va='center', rotation='vertical')
plt.tight_layout()
plt.show()
ShapiroWilk test:
The ShapiroWilk test is a way to tell if a set of observations comes from a normal distribution. The test gives you a $W$ value; small values indicate your sample is not normally distributed (you can reject the null hypothesis that your population is normally distributed if your values are under a certain threshold).
pd.set_option('display.max_columns', None)
countrydata = pd.read_csv('https://assets.datacamp.com/production/repositories/4371/datasets/f5c1016b818f97ec200236fb161ae711944fb2cb/undata_country_profile_variables.csv')
countrydata.info()
countrydata['Unemployment (% of labour force)'].replace('...', np.nan, inplace=True)
countrydata['Unemployment (% of labour force)'] = countrydata['Unemployment (% of labour force)'].astype(float)
Now, we will look at at countrylevel Unemployment and GDP per capita (GDP_per_cap) data. We will use a ShapiroWilk test to examine whether the distribution of values seen in these samples, as seen in the QQ plots below, departs significantly from the normal distribution. This test tells us how closely a given sample fits the patterns expected from a normal distribution.
from scipy.stats import probplot
import warnings
warnings.filterwarnings('ignore')
fig, ax = plt.subplots(2,2, figsize=(10,7))
sns.kdeplot(countrydata['Unemployment (% of labour force)'].dropna(), ax=ax[0][0], color='red', shade=True, legend=False)
sns.kdeplot(countrydata['GDP per capita (current US$)'].dropna(), ax=ax[0][1], color='blue', shade=True, legend=False)
probplot(countrydata['Unemployment (% of labour force)'].dropna(), plot=ax[1][0])
probplot(countrydata['GDP per capita (current US$)'].dropna(), plot=ax[1][1])
ax[0][0].set_title('Unemployment (% of labour force)')
ax[0][1].set_title('GDP per capita (current US$)')
ax[1][0].set_title('')
ax[1][1].set_title('')
plt.tight_layout()
plt.show()
# Perform ShapiroWilk test on Unemployment and print result
shapiroUnem = stats.shapiro(countrydata['Unemployment (% of labour force)'].dropna())
print(f'W: {round(shapiroUnem[0],3)}, Pvalue: {round(shapiroUnem[1],3)}')
# Perform ShapiroWilk test on GDP_per_cap and print result
shapiroGDP = stats.shapiro(countrydata['GDP per capita (current US$)'].dropna())
print(f'W: {round(shapiroGDP[0],3)}, Pvalue: {round(shapiroGDP[1],3)}')
Both distributions are nonnormal.
Tests whether a data sample has a Gaussian distribution. The test is based on transformations of the sample kurtosis and skewness, and has power only against the alternatives that the distribution is skewed and/or kurtic.
Interpretation:
# Perform D'Agostino's K^2 test on Unemployment and print result
k2_Unem, k2_Unem_p = stats.normaltest(countrydata['Unemployment (% of labour force)'].dropna())
print('stat = %.3f, p_val = %.3f' % (k2_Unem, k2_Unem_p))
if k2_Unem_p >= 0.05:
print('\nProbably Gaussian')
else:
print('\nProbably not Gaussian')
# Perform D'Agostino's K^2 test on GDP_per_cap and print result
k2_GDP, k2_GDP_p = stats.normaltest(countrydata['GDP per capita (current US$)'].dropna())
print('stat = %.3f, p_val = %.3f' % (k2_GDP, k2_GDP_p))
if k2_GDP_p >= 0.05:
print('\nProbably Gaussian')
else:
print('\nProbably not Gaussian')
The AndersonDarling Goodness of Fit Test (ADTest) is a measure of how well your data fits a specified distribution. Itâ€™s commonly used as a test for normality.
Assumptions:
# Example of the AndersonDarling Normality Test
data = countrydata['GDP per capita (current US$)'].dropna()
result = stats.anderson(data)
print('stat=%.3f' % (result.statistic))
for i in range(len(result.critical_values)):
sl, cv = result.significance_level[i], result.critical_values[i]
if result.statistic < cv:
print('\nProbably Gaussian at the %.1f%% level' % (sl))
else:
print('\nProbably not Gaussian at the %.1f%% level' % (sl))
# Example of the AndersonDarling Normality Test
data = countrydata['Unemployment (% of labour force)'].dropna()
result = stats.anderson(data)
print('stat=%.3f' % (result.statistic))
for i in range(len(result.critical_values)):
sl, cv = result.significance_level[i], result.critical_values[i]
if result.statistic < cv:
print('\nProbably Gaussian at the %.1f%% level' % (sl))
else:
print('\nProbably not Gaussian at the %.1f%% level' % (sl))
A parametric statistical test makes an assumption about the population parameters and the distributions that the data came from. These types of test includes Studentâ€™s T tests and ANOVA tests, which assume data is from a normal distribution.
The opposite is a nonparametric test, which doesnâ€™t assume anything about the population parameters. Nonparametric tests include chisquare, Fisherâ€™s exact test and the MannWhitney test.
Every parametric test has a nonparametric equivalent. For example, if you have parametric data from two independent groups, you can run a 2 sample t test to compare means. If you have nonparametric data, you can run a Wilcoxon ranksum test to compare means.
Wilcoxon ranksum (Mannâ€“Whitney U test):
The Wilcoxon ranksum test is a nonparametric test of the null hypothesis that it is equally likely that a randomly selected value from one population will be less than or greater than a randomly selected value from a second population. This test can be used to investigate whether two independent samples were selected from populations having the same distribution.
Wilcoxon signedrank:
The Wilcoxon signedrank test is a nonparametric statistical hypothesis test used to compare two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ (i.e. it is a paired difference test). It can be used as an alternative to the paired Student's ttest when the distribution of the differences between the two samples cannot be assumed to be normally distributed. A Wilcoxon signedrank test is a nonparametric test that can be used to determine whether two dependent samples were selected from populations having the same distribution.
Here we have the sex ratio of European Union and the United States (US_Sex_ratio) with the sex ratio of Asian countries (EU_Sex_ratio).
gender = pd.read_csv('sexratioatbirth.csv')
gender.columns = ['Country', 'Code', 'Year', 'Sex_ratio']
gender.head()
US_Sex_ratio = gender[gender.Country == "United States"].Sex_ratio
EU_Sex_ratio = gender[gender.Country == "European Union"].Sex_ratio
sns.kdeplot(US_Sex_ratio, shade=True, color='blue')
sns.kdeplot(EU_Sex_ratio, shade=True, color='red')
plt.legend(('United States','European Union'))
plt.show()
Given that these samples are not normally distributed, a Wilcoxon ranksum test would be appropriate. Does the sex ratio differ between countries?
# Perform the twosample ttest
t_result = stats.ttest_ind(US_Sex_ratio, EU_Sex_ratio)
# Print ttest result
print(f'Test Statistic = {t_result[0]}, Pvalue = {t_result[1]}')
# Perform Wilcoxon ranksum test
wilc = stats.ranksums(US_Sex_ratio, EU_Sex_ratio)
# Print Wilcoxon ranksum testresult
print(f'\nTest Statistic = {wilc[0]}, Pvalue = {wilc[1]}')
We are interested in potato yield in tons/hectare, as seen in previous exercises. For each Field, we have a value for Yield2018, before the application of a new fertilizer, and Yield2019, after the application of the new fertilizer. Does the yield differ significantly between years?
yields2018 = [60.2, 12.0, 13.8, 91.8, 50.0, 45.0, 32.0, 87.5, 60.1, 88.0, 11.3, 10.7, 12.7, 15.9, 16.0]
yields2019 = [63.2, 15.6, 14.8, 96.7, 53.0, 47.0, 31.3, 89.8, 67.8, 90.0, 14.2, 12.0, 13.9, 17.8, 20.0]
podataframe = pd.DataFrame({'Yield2018': yields2018,
'Yield2019' : yields2019})
print(podataframe)
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12,5))
sns.boxplot(data=podataframe, ax=ax[0])
sns.kdeplot(podataframe.Yield2018, shade=True, color='blue', ax=ax[1])
sns.kdeplot(podataframe.Yield2019, shade=True, color='red', ax=ax[1])
ax[1].legend(('Yield2018', 'Yield2019'))
ax[1].set_xlabel('Yeild')
plt.tight_layout()
plt.show()
# Perform the twosample ttest
t_result = stats.ttest_ind(podataframe.Yield2018, podataframe.Yield2019)
print(f'TTest Test Statistic = {t_result[0]}, Pvalue = {t_result[1]}')
# Perform ShapiroWilk test on Yield2018 and print result
shapiroYield2018 = stats.shapiro(podataframe.Yield2018)
print(f'\nYield2018 W: {round(shapiroYield2018[0],3)}, Pvalue: {round(shapiroYield2018[1],3)}')
# Perform ShapiroWilk test on Yield2019 and print result
shapiroYield2019 = stats.shapiro(podataframe.Yield2019)
print(f'\nYield2019 W: {round(shapiroYield2019[0],3)}, Pvalue: {round(shapiroYield2019[1],3)}')
# Perform Wilcoxon SignedRank test
wilcsr = stats.wilcoxon(podataframe.Yield2018, podataframe.Yield2019)
print(f'\nWilcoxon SignedRank Test Statistic = {wilcsr[0]}, Pvalue = {wilcsr[1]}')
Note that both of your ShapiroWilks tests gave significant results, indicating nonnormal distributions. The Wilcoxon test gives a higher pvalue for the same data, due to its lower sensitivity. However, due to the low sample sizes and nonnormal distributions, the Wilcoxon test is a better choice of test for this dataset.
Once again we'll be using the Olympic dataset. Here we're going to compare the Height of athletes from both the Norwegian and Chinese Teams. Do they differ?
# Separate the heights by country
NorwayHeights = athletes[athletes['Team'] == "Norway"].Height
ChinaHeights = athletes[athletes['Team'] == "China"].Height
sns.kdeplot(NorwayHeights, shade=True, color='blue')
sns.kdeplot(ChinaHeights, shade=True, color='red')
plt.legend(('Norway', 'China'))
plt.xlabel('Height')
plt.show()