Data
How do we get answers?
Approach
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid", color_codes=True)
People = pd.read_csv('weight-height.csv')
People.info()
People.head()
Data contains patterns
Dealing with this
Null hypothesis (H0)
A = B
Alternative hypothesis (H1)
A != B
p-value
100%
sure that are pattern couldn't have emerged due to random chance, but we can quantify the probability that random chance will produce a given pattern. The smaller the p-value the less likely it is that the null hypothesis can account for our observations.alpha
0.05
: reject null hypothesisThe t-test tells you how significant the differences between groups are; In other words it lets you know if those differences (measured in means or averages
) could have happened by random chance.
Two basic types:
One-sample:
Mean of population different from a given value?Two-sample:
Two population means equal?We want to compare the mean heights in cm of the Sample_A
group with a given value. We want to see whether the mean weight of the people in this sample is significantly different from the chosen cut-off point of 65 kg
. We will use a one-sample t-test, which allows us to compare the mean of a sample with a chosen value. We will perform this test on the sample provided versus the crucial value of 65 kg
, and test its significance by comparing the value of alpha to the p-value.
# Get Sample
Sample_A = People.Height.values
sns.kdeplot(Sample_A, shade=True, color="b")
plt.legend(('Height'))
plt.show()
from scipy import stats
# Perform t-test and print result
t_result = stats.ttest_1samp(Sample_A, 65)
# Test significance
alpha = 0.05
if (t_result[1] < alpha):
print(f"mean value of Sample A differs from given value \np-value: {t_result[1]}")
else:
print(f"No significant difference found \np-value: {t_result[1]}")
Now we'll compare two sets of samples. Again, we'll be looking at the weight
(weight of males and females) but this time we're going to compare the weight of males
with females
. Does the mean weight differ between the two genders? A two-sample t-test can tell us whether the means of two samples differ significantly.
# Get Sample
Sample_A = People[People['Gender']=='Male']['Weight']
Sample_B = People[People['Gender']=='Female']['Weight']
sns.kdeplot(Sample_A, shade=True, color="b")
sns.kdeplot(Sample_B, shade=True, color="r")
plt.legend(('Male Weight', 'Female Weight'))
plt.show()
t_result = stats.ttest_ind(Sample_A, Sample_B)
alpha = 0.05
if (t_result[1] < alpha):
print(f"mean value of Sample A differs from Sample_B \np-value: {t_result[1]}")
else:
print(f"No significant difference found \np-value: {t_result[1]}")
A chi-square goodness of fit test determines if a sample data matches a population.
A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.
small chi square test statistic
means that your observed data fits your expected data (there is a relationship).large chi square test statistic
means that the data does not fit very well (there isn’t a relationship).# Categorical Dataset
df = pd.read_csv('bank-full.csv', sep=';')
df.head()
The table was called a contingency table
, by Karl Pearson, because the intent is to help determine whether one variable is contingent upon or depends upon the other variable. For example, does an Education
in primary, secondary, tertiary, or unknown depend on Job
, or are they independent?
We can summarize the collected observations in a table with one variable corresponding to columns and another variable corresponding to rows. Each cell in the table corresponds to the count or frequency of observations that correspond to the row and column categories.
Historically, a table summarization of two categorical variables in this form is called a contingency table.
For example, the Education = rows
and Job = columns
table with contrived counts might look as follows:
table = pd.crosstab(df.job, df.education)
table
The Chi-Squared test is a statistical hypothesis test that assumes (the null hypothesis) that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable. The test calculates a statistic that has a chi-squared distribution, named for the Greek capital letter Chi (X)
pronounced ki
as in kite.
Given the Job/Education example above, the number of observations for a category may or may not be the same. Nevertheless, we can calculate the expected frequency of observations in each Education group and see whether the partitioning of Education by Job results in similar or different frequencies.
The Chi-Squared test does this for a contingency table, first calculating the expected frequencies for the groups, then determining whether the division of the groups, called the observed frequencies, matches the expected frequencies.
The result of the test is a test statistic that has a chi-squared distribution and can be interpreted to reject or fail to reject the assumption or null hypothesis that the observed and expected frequencies are the same.
We can interpret the test statistic
in the context of the chi-squared distribution with the requisite number of degress of freedom as follows:
If Statistic >= Critical Value:
significant result, reject null hypothesis (H0)
, dependent.If Statistic <= Critical Value:
not significant result, fail to reject null hypothesis (H0)
, independent.Degrees of Freedom:
1
from the number of items. Let’s say you were finding the mean weight loss for a low-carb diet. You could use 4
people, giving 3
degrees of freedom (4 – 1 = 3)
, or you could use 100
people with df = 99.
The degrees of freedom
for the chi-squared distribution is calculated based on the size of the contingency table as:
degrees of freedom = (rows - 1) * (cols - 1)
In terms of a p-value
and a chosen significance level (alpha)
, the test can be interpreted as follows:
If p-value <= alpha:
significant result, reject null hypothesis (H0)
, dependent.If p-value > alpha:
not significant result, fail to reject null hypothesis (H0)
, independent.For the test to be effective, at least five observations
are required in each cell of the contingency table.
The Pearson’s chi-squared test for independence can be calculated in Python using the chi2_contingency()
SciPy
function.
The function takes an array as input representing the contingency table for the categorical variables. It returns the calculated statistic and p-value for interpretation as well as the calculated degrees of freedom and table of expected frequencies.
# chi-squared test with similar proportions
from scipy.stats import chi2_contingency
from scipy.stats import chi2
stat, p_val, dof, expected = chi2_contingency(table)
print(table)
print(f'\ndof = {dof}')
We can interpret the statistic by retrieving the critical value from the chi-squared distribution for the probability and number of degrees of freedom.
For example, a probability of 95%
can be used, suggesting that the finding of the test is quite likely given the assumption of the test that the variable is independent. If the statistic is less than or equal to the critical value, we can fail to reject this assumption (independent), otherwise it can be rejected (dependent).
# interpret test-statistic
prob = 0.95
critical = chi2.ppf(prob, dof)
print(f'Probability = {prob}, Critical Value = {round(critical,3)}, Test Statistic = {round(stat,3)}')
if abs(stat) >= critical:
print('\nDependent (reject H0)')
else:
print('\nIndependent (fail to reject H0)')
We can also interpret the p-value by comparing it to a chosen significance level, which would be 5%
, calculated by inverting the 95%
probability used in the critical value interpretation.
# interpret p-value
alpha = 1.0 - prob
print(f'Significance Level = {round(alpha,3)}, P-Value = {round(p_val,3)}')
if p_val <= alpha:
print('\nDependent (reject H0)')
else:
print('\nIndependent (fail to reject H0)')
Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. Use the Fisher's exact test of independence when you have two nominal (categorical) variables and want to find out if proportions for one nominal variable are different among values of the other nominal variable.
Test distinguishes between:
Now, we will work with the Olympics dataset to look at the relative success of the American swimming and athletics teams. Whether each athlete received a medal is coded as True or False
in the MedalTF
column of athletes. Do a larger proportion of swimming or athletics participants come home with medals? A Fisher exact test is a useful way to compare proportions of samples falling into discrete categories. To test this, we will need to perform a Fisher exact test on MedalTF in relation to Sport.
athletes = pd.read_csv('https://assets.datacamp.com/production/repositories/4371/datasets/8fd0a14bfbc5f13719d92334eaf77b23f2e914d6/olyathswim.csv')
athletes.head()
athletes['Medal'].unique()
b = {'Bronze': True, 'Silver': True, 'Gold': True}
athletes['MedalTF'] = athletes['Medal'].map(b)
athletes['MedalTF'] = athletes['MedalTF'].fillna(False)
athletes['Medal'] = athletes['Medal'].fillna('None')
athletes = athletes.dropna()
athletes['MedalTF'].unique()
# Create a table of cross-tabulations
table = pd.crosstab(athletes.MedalTF, athletes.Sport)
print(table)
# Perform the Fisher exact test
fisher = stats.fisher_exact(table, alternative='two-sided')
# Is the result significant?
alpha = 0.05
if fisher[1] <= alpha:
print(f"\nProportions of medal winners differ significantly \np-value: {fisher[1]}")
else:
print(f"\nNo significant difference in proportions of medal winners found \np-value: {fisher[1]}")
According to these results, a higher proportion of swimmers come home with medals than athletics competitors.
In this exercise, we're going to look at random sampling. For the purposes of this exercise, we are interested in differences between the body Weight of competitors in swimming and athletics. In order to test this, we will be using a two-sample t-test. However, we will be performing this test on a random sample of the data. By playing with the random subset chosen, we will see how randomness affects the results.
# Define random seed
seed = 420
# Create two subsets, one for the athletics competitors and one for the swimmers
subsetathl = athletes[athletes.Sport == "Athletics"].dropna().sample(n=100, random_state = seed)
subsetswim = athletes[athletes.Sport == "Swimming"].dropna().sample(n=100, random_state = seed)
sns.kdeplot(subsetathl.Weight, shade=True, color="b")
sns.kdeplot(subsetswim.Weight, shade=True, color="r")
plt.legend(('Athletics Weight', 'Swimming Weight'))
plt.show()
# Perform the two-sample t-test
t_result = stats.ttest_ind(subsetathl.Weight, subsetswim.Weight)
print(f'Test Statistic = {round(t_result[0],2)}, P-value = {round(t_result[1],2)}')
Change the seed value to 1234
and re-run the code
# Define random seed
seed = 1234
# Create two subsets, one for the athletics competitors and one for the swimmers
subsetathl = athletes[athletes.Sport == "Athletics"].dropna().sample(n=100, random_state = seed)
subsetswim = athletes[athletes.Sport == "Swimming"].dropna().sample(n=100, random_state = seed)
# Perform the two-sample t-test
t_result = stats.ttest_ind(subsetathl.Weight, subsetswim.Weight)
print(f'Test Statistic = {round(t_result[0],2)}, P-value = {round(t_result[1],2)}')
Now we are familiar with randomizing samples and you can see how choosing different random samples can give you different results. Did you notice that our p-value changed when a different random sample was chosen? Improving our approach to create a balanced dataset would help avoid this problem.
Confounding variable
Blocking = Solution to confounding
We're going to have another look at the same data but, this time, we'll use blocking to improve our approach. Like last time, we will be using a two-sample t-test on athlete Weight within the DataFrame, athletes. This time, however, we will control for Sex as a blocking factor, sampling equally from male and female participants. We will need to extract a random subset of athletes from both events to run our test.
seed = 9000
# Create subset blocks
subsetathlm = athletes[(athletes.Sport == "Athletics") & (athletes.Sex == "M")].sample(n=100, random_state= seed)
subsetathlf = athletes[(athletes.Sport == "Athletics") & (athletes.Sex == "F")].sample(n=100, random_state= seed)
subsetswimm = athletes[(athletes.Sport == "Swimming") & (athletes.Sex == "M")].sample(n=100, random_state= seed)
subsetswimf = athletes[(athletes.Sport == "Swimming") & (athletes.Sex == "F")].sample(n=100, random_state= seed)
sns.boxplot(x="Sport", y="Weight", hue="Sex",
data=athletes);
# Combine blocks
subsetathl = pd.concat([subsetathlm, subsetathlf])
subsetswim = pd.concat([subsetswimm, subsetswimf])
# Perform the two-sample t-test
t_result = stats.ttest_ind(subsetathl.Weight, subsetswim.Weight)
print(f'Test Statistic = {round(t_result[0],2)}, P-value = {round(t_result[1],2)}')
Now our t-test is significant, with a p-value under 0.05
. This blocked design has resolved the issue we had. You can see how this type of blocking approach can be a useful way to improve your experimental design when a confounding variable is present.
A paired t-test is used to determine whether the mean difference between two sets of observations is zero. In a paired sample t-test, each subject or entity is measured twice, resulting in pairs of observations.
Here, we have a small DataFrame (podataframe)
containing information on 10
Fields. We are interested in potato yield in tons/hectare. For each Field, we have a value for its Yield2018
, before the application of a new fertilizer, and its Yield2019
, after the application of the new fertilizer. We will perform two t-tests, a standard two-sample test, and a paired t-test. A paired t-test will control for the variation between fields
.
yields2018 = [60.2, 12, 13.8, 91.8, 50, 27, 71.1, 41, 31, 55.9]
yields2019 = [63.2, 15.6, 14.8, 96.7, 53, 29.9, 73, 44.4, 38, 57.3]
podataframe = pd.DataFrame({'Yield2018': yields2018,
'Yield2019' : yields2019})
print(podataframe)
sns.kdeplot(yields2018, shade=True, color="b")
sns.kdeplot(yields2019, shade=True, color="r")
plt.legend(('yields2018', 'yields2019'))
plt.show()
# Perform independent t-test
ttestind = stats.ttest_ind(podataframe.Yield2018, podataframe.Yield2019)
print(f'Test Statistic = {round(ttestind[0],2)}, P-value = {round(ttestind[1],2)}')
# Perform paired t-test
ttestpair = stats.ttest_rel(podataframe.Yield2018, podataframe.Yield2019)
print(f'Test Statistic = {round(ttestpair[0],2)}, P-value = {round(ttestpair[1],2)}')
Recall, the paired test is more sensitive than the independent test and can pick up a difference that the independent test can't detect. This is because the difference within the samples each year (individual field effect) is quite large in comparison to the difference between the two years (effect of treatment). Paired tests are useful when a large variability exists.
Analysis of variance (ANOVA) is used to analyze the differences among group means in a sample. ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means.
ANOVA uses F-tests
to statistically test the equality of means. F-statistics are based on the ratio of mean squares. It is simply an estimate of population variance that accounts for the degrees of freedom (DF) used to calculate that estimate.
Test types:
One Way ANOVA
Two Way ANOVA
Let's have another look at some data from our Olympic dataset. How does the Weight of athletes vary between teams from different countries? We going to use a one-way ANOVA to check for the presence of significant variation in Weight of Olympic athletes. A one-way ANOVA will allow us to see whether any differences between these groups of values are significant.
sns.boxplot(x="Team", y="Weight",
data=athletes[athletes['Team'].isin(['United States','China', 'France'])]);
# Create arrays
France_athletes = athletes[athletes.Team == 'France'].Weight
US_athletes = athletes[athletes.Team == 'United States'].Weight
China_athletes = athletes[athletes.Team == 'China'].Weight
# Perform one-way ANOVA
anova = stats.f_oneway(France_athletes, US_athletes, China_athletes)
print(f'Test Statistic = {round(anova[0],2)}, P-value = {round(anova[1],3)}')
According to this test, significant differences exist between the weights of athletes for each country.
Let's have another look at some data from our Olympic dataset. How does the Weight of athletes vary between teams from different countries and of different sexes? We are going to use a two-way ANOVA to check for the presence of significant variation in the Weight of Olympic sprinters. A two-way ANOVA will allow you to see which of these two factors, Sex and Team, have a significant effect on Weight.
sns.boxplot(x="Team", y="Weight", hue="Sex",
data=athletes[athletes['Team'].isin(['United States','China'])]);
import statsmodels.api as sm
# Create model
formula = 'Weight ~ Sex + Team'
model = sm.formula.ols(formula, data=athletes).fit()
# Perform ANOVA and print table
aov_table = sm.stats.anova_lm(model, typ=2)
print(aov_table)
According to the results of your ANOVA, Sex and Team has a significant effect.
Interaction effects represent the combined effects of factors on the dependent measure. When an interaction effect is present, the impact of one factor depends on the level of the other factor. Part of the power of ANOVA is the ability to estimate and test interaction effects. As Pedhazur and Schmelkin note, the idea that multiple effects should be studied in research rather than the isolated effects of single variables is one of the important contributions of Sir Ronald Fisher. When interaction effects are present, it means that interpretation of the main effects is incomplete or misleading. For example, the relationship between condiments and enjoyment probably depends on the type of food.
Once again, we are going to look at our dataset of Olympic athletes. As in previous exercises, we will be looking at the variation in athlete Weight. We are going to look at athletes of either Sex competing in one of two Events: the 100
meter and 10,000
meter run. An ANOVA will allow us to work out which of these variables affect Weight and whether an interactive effect is present.
df = athletes[athletes['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Men's 100 metres",
"Athletics Women's 10,000 metres",
"Athletics Women's 100 metres"])]
ax = sns.boxplot(data=df, x="Event", y="Weight", hue="Sex")
ax.set_xticklabels(['10,000m M','100m M','10,000m W','100m W']);
# Create model
formula = 'Weight ~ Sex + Event + Sex:Event'
model = sm.formula.ols(formula, data=athletes).fit()
# Perform ANOVA and print table
aov_table = sm.stats.anova_lm(model)
print(aov_table)
Look at the p-values for the effects of Sex, Event, and Sex:Event. Using a standard alpha of 0.05
, Sex, Event, and Sex:Event have a significant effect on Weight. This means that both factors influence Weight, and the effect of one factor is dependent on the other.
# Create model
formula = 'Weight ~ Sex + Event + Age + Sex:Event + Sex:Age + Sex:Event:Age'
model = sm.formula.ols(formula, data=athletes).fit()
# Perform ANOVA and print table
aov_table = sm.stats.anova_lm(model)
pd.DataFrame(aov_table)
When we run a test:
Real effect present |
No real effect present |
|
---|---|---|
Effect found (positive : alternative hypothesis) |
True Positive |
False Positive |
No effect found (negative: null hypothesis) |
False Negative |
True Negative |
Basis of tests:
Consider:
Picking a single result can be misleading. Account for multiple tests.
Correction methods: Choose method based on independence of tests.
Bonferroni correction: Better to use when tests are not independent from each other.
α
-value by the number of analyses on the dependent variable.Šídák correction: Better to use when tests are independent from each other.
α
when the tests are independent from each other and all null hypotheses are true. For positively-correlated tests the procedure is conservative and it is liberal for negatively-correlated tests.Improved nutrition and healthcare has lead to increased human heights in most societies over the past century. But is this trend also reflected amongst elite athletes? To examine this, we'll be looking at another slice from our Olympic dataset and performing multiple tests.
We will slice the athletes dataset containing information about the American male Olympic athletes from three years: 1924
, 1952
, and 2016
. We will perform two-sample t-tests to compare the three time-points, seen in boxplots. Between which times do significant differences exist? We will be performing multiple non-independent tests, so we will need to perform Bonferroni correction on the results.
df = athletes[athletes['Year'].isin(["1924",
"1952",
"2016"])]
sns.boxplot(data=df, x="Year", y="Height");
# Perform three two-sample t-tests
t_result_1924v2016 = stats.ttest_ind(athletes[athletes.Year == 1924].Height,
athletes[athletes.Year == 2016].Height)
t_result_1952v2016 = stats.ttest_ind(athletes[athletes.Year == 1952].Height,
athletes[athletes.Year == 2016].Height)
t_result_1924v1952 = stats.ttest_ind(athletes[athletes.Year == 1924].Height,
athletes[athletes.Year == 1952].Height)
# Create an array of p-value results
pvals_array = [t_result_1924v2016[1],t_result_1952v2016[1],t_result_1924v1952[1]]
print('P-values:')
print(pd.DataFrame(pvals_array))
# Perform Bonferroni correction
adjustedvalues = sm.stats.multipletests(pvals_array, alpha=0.05, method='b')
print('\nCorrected P-values:')
print(pd.DataFrame(adjustedvalues[1]))
We're looking at how the Height of Olympic athletes from the athletes dataset has changed over time. In this exercise, we're considering three events, the 100 meter
, the High Jump
, and the Marathon
. We will be examining the correlation between Height
and Year
separately for each Event. As we did before, we will need to correct for multiple hypothesis tests, but, since these tests are independent, we can use the less-strict Šídák correction.
fig, ax = plt.subplots(3,1, sharex=True, sharey=True, figsize=(8,6))
ax[0].scatter(x=athletes[athletes['Event'].isin(["Athletics Men's 100 metres",
"Athletics Women's 100 metres"])].Year,
y=athletes[athletes['Event'].isin(["Athletics Men's 100 metres",
"Athletics Women's 100 metres"])].Height,
color='red', s=10, alpha=0.5)
ax[1].scatter(x=athletes[athletes['Event'].isin(["Athletics Men's High Jump",
"Athletics Women's High Jump"])].Year,
y=athletes[athletes['Event'].isin(["Athletics Men's High Jump",
"Athletics Women's High Jump"])].Height,
color='lime', s=10, alpha=0.5)
ax[2].scatter(x=athletes[athletes['Event'].isin(["Athletics Men's Marathon",
"Athletics Women's Marathon"])].Year,
y=athletes[athletes['Event'].isin(["Athletics Men's Marathon",
"Athletics Women's Marathon"])].Height,
color='blue', s=10, alpha=0.5)
ax[2].set_xlabel('Date')
ax[0].set_title('100 Meter')
ax[1].set_title('High Jump')
ax[2].set_title('Marathon')
fig.text(0.001, 0.5, 'Height (meters)', va='center', rotation='vertical')
plt.tight_layout()
plt.show()
# Perform Pearson correlations
pearson100 = stats.pearsonr(athletes[athletes['Event'].isin(["Athletics Men's 100 metres",
"Athletics Women's 100 metres"])].Height,
athletes[athletes['Event'].isin(["Athletics Men's 100 metres",
"Athletics Women's 100 metres"])].Year)
pearsonHigh = stats.pearsonr(athletes[athletes['Event'].isin(["Athletics Men's High Jump",
"Athletics Women's High Jump"])].Height,
athletes[athletes['Event'].isin(["Athletics Men's High Jump",
"Athletics Women's High Jump"])].Year)
pearsonMara = stats.pearsonr(athletes[athletes['Event'].isin(["Athletics Men's Marathon",
"Athletics Women's Marathon"])].Height,
athletes[athletes['Event'].isin(["Athletics Men's Marathon",
"Athletics Women's Marathon"])].Year)
# Create array of p-values
pvals_array = [pearson100[1],pearsonHigh[1],pearsonMara[1]]
print('P-values:')
print(pd.DataFrame(pvals_array))
# Perform Šídák correction
adjustedvalues= sm.stats.multipletests(pvals_array, alpha=0.05, method='s')
print('\nCorrected P-values:')
print(pd.DataFrame(adjustedvalues[1]))
The power of a binary hypothesis test is the probability that the test rejects the null hypothesis (H0)
when a specific alternative hypothesis (H1)
is true. The statistical power ranges from 0
to 1
, and as statistical power increases, the probability of making a type II error (failing to find difference that does exist) decreases. For a type II error probability of β
, the corresponding statistical power is 1 − β
. For example, if experiment 1
has a statistical power of 0.7
, and experiment 2
has a statistical power of 0.95
, then there is a stronger probability that experiment 1
had a type II error than experiment 2
, and experiment 2
is more reliable than experiment 1
due to the reduction in probability of a type II error. It can be equivalently thought of as the probability of accepting the alternative hypothesis (H1)
when it is true—that is, the ability of a test to detect a specific effect, if that specific effect actually exists. That is,
Power Analysis:
alpha
. Often set to 5%
or0.05
.Measures of effect size:
For a given experiment with Significance level (alpha): 5% or 0.05
, Effect Size: of at least 0.80
, Statistical Power: 80% or 0.80
. We may be interested in estimating a suitable sample size. That is, how many observations are required from each sample in order to at least detect an effect of 0.80
with an 80%
chance of detecting the effect if it is true (20%
of a Type II error) and a 5%
chance of detecting an effect if there is no such effect (Type I error).
The function solve_power()
can be used to calculate one of the four parameters in a power analysis. In our case, we are interested in calculating the sample size. We can use the function by providing the three pieces of information we know (alpha, effect, and power)
and setting the size of argument we wish to calculate the answer of (nobs1)
to None
. This tells the function what to calculate.
The function has an argument called ratio that is the ratio of the number of samples in one sample to the other. If both samples are expected to have the same number of observations, then the ratio is 1.0
. If, for example, the second sample is expected to have half as many observations, then the ratio would be 0.5
.
Now we'll explore the effect of sample size on the results of statistical tests. Here, we'll be comparing the Weight of American Athletics and Swimming competitors in the athletes dataset. The boxplot show the difference between these two groups.
sns.boxplot(x="Sport", y="Weight", data=athletes);
Using a defined seed and varying sample sizes, we will perform t-tests comparing the Weight of samples from both Sports.
# Create a random subset of 1000 samples, then use this to do a t-test to compare Weight between Sports.
subset = athletes.sample(n=1000, random_state= 1007)
ttestind = stats.ttest_ind(subset[subset.Sport == "Athletics"].Weight, subset[subset.Sport == "Swimming"].Weight)
print(f'Test Statistic = {round(ttestind[0],2)}, P-value = {round(ttestind[1],2)}')
# Change the sample size to 200, repeat the t-test to compare Weight between Sports.
subset = athletes.sample(n=200, random_state= 1007)
ttestind = stats.ttest_ind(subset[subset.Sport == "Athletics"].Weight, subset[subset.Sport == "Swimming"].Weight)
print(f'Test Statistic = {round(ttestind[0],2)}, P-value = {round(ttestind[1],2)}')
# Change the sample size to 50, repeat the t-test to compare Weight between Sports.
subset = athletes.sample(n=50, random_state= 1007)
ttestind = stats.ttest_ind(subset[subset.Sport == "Athletics"].Weight, subset[subset.Sport == "Swimming"].Weight)
print(f'Test Statistic = {round(ttestind[0],2)}, P-value = {round(ttestind[1],2)}')
Look how smaller samples affected your p-value. With sample sizes of 1000
, 200
and 50
, we got p-values of 0.07
, 0.0
and 0.57
, respectively. The smallest sample size didn't give a significant result. Now you see how much difference sample size can make.
Now that we've seen the importance of sample size, let's have another look at the same athletes dataset and see if we can determine the sample size we would need to get a significant result. We will determine the sample size we would need to have an 80%
chance of detecting a small (0.4)
difference between these two samples.
from statsmodels.stats import power as pwr
# Set parameters
effect = 0.4
power = 0.8
alpha = 0.05
# Calculate ratio
swimmercount = float(len(athletes[athletes.Sport == "Swimming"].index))
athletecount = float(len(athletes[athletes.Sport == "Athletics"].index))
ratio = swimmercount/athletecount
# Initialize analysis and calculate sample size
analysis = pwr.TTestIndPower()
ssresult = analysis.solve_power(effect_size=effect, power=power, alpha=alpha, nobs1=None, ratio=ratio)
print(f'Sample Size: {round(ssresult,3)}')
Now, we're going to have a look at effect sizes. We are examining the same slice from the atheletes dataset, comparing the body weights of competitors from swimming and athletics events. We are going to calculate the effect sizes we are able to detect for a couple of different sampling strategies. First, we will determine the smallest effect size detectable using the complete dataset. Then, in Step 2
, we will determine the size effect we could detect using 300
samples from each Sport.
# Set parameters for entire dataset
alpha = 0.05
power = 0.8
ratio = float(len(athletes[athletes.Sport == "Swimming"])) / len(athletes[athletes.Sport == "Athletics"])
samp_size = len(athletes[athletes.Sport == "Athletics"])
# Initialize analysis & calculate effect size
analysis = pwr.TTestIndPower()
esresult = analysis.solve_power(effect_size = None,
power = power,
nobs1 = samp_size,
ratio = ratio,
alpha = alpha)
print(f'Minimum detectable effect size: {round(esresult,2)}')
# Set parameters for the sample size found from the power calculation
alpha = 0.05
power = 0.8
ratio = float(len(athletes[athletes.Sport == "Swimming"])) / len(athletes[athletes.Sport == "Athletics"])
samp_size = round(ssresult,0)
# Initialize analysis & calculate sample size
analysis = pwr.TTestIndPower()
esresult = analysis.solve_power(effect_size=None, power=power, nobs1=samp_size, ratio=ratio, alpha=alpha)
print(f'Minimum detectable effect size: {round(esresult,2)}')
Cohen's d is an effect size used to indicate the standardised difference between two means. It can be used, for example, to accompany reporting of t-test and ANOVA results.For the independent samples T-test, Cohen's d is determined by calculating the mean difference between your two groups, and then dividing the result by the pooled standard deviation.
Now, using the same comparison of Weight difference between Sports, let's calculate the actual effect size for this comparison. We will calculate Cohen's d for the difference in Weight between each Sport.
# sebset DataFrame with sample size found from the power calculation
subset = athletes.sample(n=int(round(ssresult,0)), random_state=2020)
# Create series
athl = athletes[athletes.Sport == "Athletics"].Weight
swim = athletes[athletes.Sport == "Swimming"].Weight
# Calculate difference between means and pooled standard deviation
diff = swim.mean() - athl.mean()
pooledstdev = np.sqrt((athl.std()**2 + swim.std()**2)/2)
# Calculate Cohen's d
cohend = diff / pooledstdev
print(f"Cohen's d: {round(cohend,3)}")
This is a moderate sized effect size. As the effect size is bigger than the minimum effect size calculated in the last exercise, you would expect that the parameters from the last exercise would allow you to detect a difference similar to this output.
In this exercise, we will use the athletes dataset to examine whether American athletes are more successful in athletics or in swimming events. We will make use of the MedalTF column, which gives a True or False value for whether each athlete won any type of medal. First, we will cross-tabulate MedalTF with Sport. Then perform a Fisher exact test. Finally, we will examine the significance and effect size of this result.
# Create a table of cross-tabulations
table = pd.crosstab(athletes.MedalTF,athletes.Sport)
print(table)
# Perform the Fisher exact test
chi = stats.fisher_exact(table, alternative='two-sided')
# Print p-value
print(f"\np-value: {round(chi[1], 5)}")
# Print odds ratio
print(f"\nOdds ratio between groups: {round(chi[0],1)}")
Now we're going to look at effect sizes for Pearson correlations. In a previous exercise, we used Pearson correlation to examine the link between Weight and Height for athletes from our Olympic dataset. Now, we're going to focus a single event, the 10,000
meter run, and compare the results we obtain for Pearson tests of correlation for competitors from two Teams, Kenya (ken DataFrame) and Ethiopia (eth DataFrame). Which has a higher effect strength?
ken = athletes[athletes['Team'] == "Kenya"]
eth = athletes[athletes['Team'] == "Ethiopia"]
fig, ax = plt.subplots(2,1, sharex=True, sharey=True, figsize=(8,6))
ax[0].scatter(x=ken[ken['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Women's 10,000 metres"])].Height,
y=ken[ken['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Women's 10,000 metres"])].Weight,
color='red', s=25)
ax[1].scatter(x=eth[eth['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Women's 10,000 metres"])].Height,
y=eth[eth['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Women's 10,000 metres"])].Weight,
color='blue', s=25)
ax[1].set_xlabel('Height')
ax[0].set_title('Kenya 10,000 metres')
ax[1].set_title('Ethiopia 10,000 metres')
fig.text(0.001, 0.5, 'Weight', va='center', rotation='vertical')
plt.tight_layout()
plt.show()
# Perform Pearson correlation Kenya
pearsonken = stats.pearsonr(ken[ken['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Women's 10,000 metres"])].Weight,
ken[ken['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Women's 10,000 metres"])].Height)
print(f'Kenya Correlation: {round(pearsonken[0],3)}, P-value: {round(pearsonken[1],3)}')
# Perform Pearson correlation Ethiopia
pearsoneth = stats.pearsonr(eth[eth['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Women's 10,000 metres"])].Weight,
eth[eth['Event'].isin(["Athletics Men's 10,000 metres",
"Athletics Women's 10,000 metres"])].Height)
print(f'Ethiopia Correlation: {round(pearsoneth[0],3)}, P-value: {round(pearsoneth[1],3)}')
The Kenyan dataset gives a higher value than that obtained for the Ethiopian samples. This makes sense, as the scatterplot for the former looks much less noisy.
Now we're going to have another look at the same example, drawn from our Olympic dataset (Weight difference between Sports). Knowing the effect size of the difference between these two groups, we will work out what the odds are that our t-test will pick up this difference.
# Set parameters
effect_size = esresult
alpha = 0.05
samp_size = round(ssresult,0)
ratio = 1
# Initialize analysis & calculate power
analysis = pwr.TTestIndPower()
pwresult = analysis.solve_power(effect_size=effect_size, power=None, alpha=alpha, nobs1=samp_size, ratio=ratio)
print(f'Power: {round(pwresult,3)}')
Power curves are line plots that show how the change in variables, such as effect size and sample size, impact the power of the statistical test.
The plot_power()
function can be used to create power curves. The dependent variable (x-axis) must be specified by name in the ‘dep_var‘ argument. Arrays of values can then be specified for the sample size (nobs), effect size (effect_size), and significance (alpha) parameters. One or multiple curves will then be plotted showing the impact on statistical power.
# parameters for power analysis
effect_sizes = np.array([0.1, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.95])
sample_sizes = np.array(range(2, 151))
sample_sizes0 = np.array([10, 20, 30, 40, 50, 60, 70, 100])
# calculate power curves from multiple power analyses
analysis = pwr.TTestIndPower()
fig, ax = plt.subplots(2,1, figsize=(10,7))
analysis.plot_power(dep_var='nobs', nobs=sample_sizes, effect_size=effect_sizes, alpha=0.05, ax=ax[0])
analysis.plot_power(dep_var='es', nobs=sample_sizes0, effect_size=effect_sizes, alpha=0.05, ax=ax[1])
ax[1].set_title('')
ax[0].legend(bbox_to_anchor=(1, 1.05))
ax[1].legend(bbox_to_anchor=(1, 1.05))
ax[0].margins(0.015)
ax[1].margins(0.015)
fig.text(0.001, 0.5, 'Power', va='center', rotation='vertical')
plt.tight_layout()
plt.show()
Shapiro-Wilk test:
The Shapiro-Wilk test is a way to tell if a set of observations comes from a normal distribution. The test gives you a $W$ value; small values indicate your sample is not normally distributed (you can reject the null hypothesis that your population is normally distributed if your values are under a certain threshold).
pd.set_option('display.max_columns', None)
countrydata = pd.read_csv('https://assets.datacamp.com/production/repositories/4371/datasets/f5c1016b818f97ec200236fb161ae711944fb2cb/undata_country_profile_variables.csv')
countrydata.info()
countrydata['Unemployment (% of labour force)'].replace('...', np.nan, inplace=True)
countrydata['Unemployment (% of labour force)'] = countrydata['Unemployment (% of labour force)'].astype(float)
Now, we will look at at country-level Unemployment and GDP per capita (GDP_per_cap) data. We will use a Shapiro-Wilk test to examine whether the distribution of values seen in these samples, as seen in the Q-Q plots below, departs significantly from the normal distribution. This test tells us how closely a given sample fits the patterns expected from a normal distribution.
from scipy.stats import probplot
import warnings
warnings.filterwarnings('ignore')
fig, ax = plt.subplots(2,2, figsize=(10,7))
sns.kdeplot(countrydata['Unemployment (% of labour force)'].dropna(), ax=ax[0][0], color='red', shade=True, legend=False)
sns.kdeplot(countrydata['GDP per capita (current US$)'].dropna(), ax=ax[0][1], color='blue', shade=True, legend=False)
probplot(countrydata['Unemployment (% of labour force)'].dropna(), plot=ax[1][0])
probplot(countrydata['GDP per capita (current US$)'].dropna(), plot=ax[1][1])
ax[0][0].set_title('Unemployment (% of labour force)')
ax[0][1].set_title('GDP per capita (current US$)')
ax[1][0].set_title('')
ax[1][1].set_title('')
plt.tight_layout()
plt.show()
# Perform Shapiro-Wilk test on Unemployment and print result
shapiroUnem = stats.shapiro(countrydata['Unemployment (% of labour force)'].dropna())
print(f'W: {round(shapiroUnem[0],3)}, P-value: {round(shapiroUnem[1],3)}')
# Perform Shapiro-Wilk test on GDP_per_cap and print result
shapiroGDP = stats.shapiro(countrydata['GDP per capita (current US$)'].dropna())
print(f'W: {round(shapiroGDP[0],3)}, P-value: {round(shapiroGDP[1],3)}')
Both distributions are non-normal.
Tests whether a data sample has a Gaussian distribution. The test is based on transformations of the sample kurtosis and skewness, and has power only against the alternatives that the distribution is skewed and/or kurtic.
Interpretation:
# Perform D'Agostino's K^2 test on Unemployment and print result
k2_Unem, k2_Unem_p = stats.normaltest(countrydata['Unemployment (% of labour force)'].dropna())
print('stat = %.3f, p_val = %.3f' % (k2_Unem, k2_Unem_p))
if k2_Unem_p >= 0.05:
print('\nProbably Gaussian')
else:
print('\nProbably not Gaussian')
# Perform D'Agostino's K^2 test on GDP_per_cap and print result
k2_GDP, k2_GDP_p = stats.normaltest(countrydata['GDP per capita (current US$)'].dropna())
print('stat = %.3f, p_val = %.3f' % (k2_GDP, k2_GDP_p))
if k2_GDP_p >= 0.05:
print('\nProbably Gaussian')
else:
print('\nProbably not Gaussian')
The Anderson-Darling Goodness of Fit Test (AD-Test) is a measure of how well your data fits a specified distribution. It’s commonly used as a test for normality.
Assumptions:
# Example of the Anderson-Darling Normality Test
data = countrydata['GDP per capita (current US$)'].dropna()
result = stats.anderson(data)
print('stat=%.3f' % (result.statistic))
for i in range(len(result.critical_values)):
sl, cv = result.significance_level[i], result.critical_values[i]
if result.statistic < cv:
print('\nProbably Gaussian at the %.1f%% level' % (sl))
else:
print('\nProbably not Gaussian at the %.1f%% level' % (sl))
# Example of the Anderson-Darling Normality Test
data = countrydata['Unemployment (% of labour force)'].dropna()
result = stats.anderson(data)
print('stat=%.3f' % (result.statistic))
for i in range(len(result.critical_values)):
sl, cv = result.significance_level[i], result.critical_values[i]
if result.statistic < cv:
print('\nProbably Gaussian at the %.1f%% level' % (sl))
else:
print('\nProbably not Gaussian at the %.1f%% level' % (sl))
A parametric statistical test makes an assumption about the population parameters and the distributions that the data came from. These types of test includes Student’s T tests and ANOVA tests, which assume data is from a normal distribution.
The opposite is a nonparametric test, which doesn’t assume anything about the population parameters. Nonparametric tests include chi-square, Fisher’s exact test and the Mann-Whitney test.
Every parametric test has a nonparametric equivalent. For example, if you have parametric data from two independent groups, you can run a 2 sample t test to compare means. If you have nonparametric data, you can run a Wilcoxon rank-sum test to compare means.
Wilcoxon rank-sum (Mann–Whitney U test):
The Wilcoxon rank-sum test is a nonparametric test of the null hypothesis that it is equally likely that a randomly selected value from one population will be less than or greater than a randomly selected value from a second population. This test can be used to investigate whether two independent samples were selected from populations having the same distribution.
Wilcoxon signed-rank:
The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used to compare two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ (i.e. it is a paired difference test). It can be used as an alternative to the paired Student's t-test when the distribution of the differences between the two samples cannot be assumed to be normally distributed. A Wilcoxon signed-rank test is a nonparametric test that can be used to determine whether two dependent samples were selected from populations having the same distribution.
Here we have the sex ratio of European Union and the United States (US_Sex_ratio) with the sex ratio of Asian countries (EU_Sex_ratio).
gender = pd.read_csv('sex-ratio-at-birth.csv')
gender.columns = ['Country', 'Code', 'Year', 'Sex_ratio']
gender.head()
US_Sex_ratio = gender[gender.Country == "United States"].Sex_ratio
EU_Sex_ratio = gender[gender.Country == "European Union"].Sex_ratio
sns.kdeplot(US_Sex_ratio, shade=True, color='blue')
sns.kdeplot(EU_Sex_ratio, shade=True, color='red')
plt.legend(('United States','European Union'))
plt.show()
Given that these samples are not normally distributed, a Wilcoxon rank-sum test would be appropriate. Does the sex ratio differ between countries?
# Perform the two-sample t-test
t_result = stats.ttest_ind(US_Sex_ratio, EU_Sex_ratio)
# Print t-test result
print(f'Test Statistic = {t_result[0]}, P-value = {t_result[1]}')
# Perform Wilcoxon rank-sum test
wilc = stats.ranksums(US_Sex_ratio, EU_Sex_ratio)
# Print Wilcoxon rank-sum testresult
print(f'\nTest Statistic = {wilc[0]}, P-value = {wilc[1]}')
We are interested in potato yield in tons/hectare, as seen in previous exercises. For each Field, we have a value for Yield2018, before the application of a new fertilizer, and Yield2019, after the application of the new fertilizer. Does the yield differ significantly between years?
yields2018 = [60.2, 12.0, 13.8, 91.8, 50.0, 45.0, 32.0, 87.5, 60.1, 88.0, 11.3, 10.7, 12.7, 15.9, 16.0]
yields2019 = [63.2, 15.6, 14.8, 96.7, 53.0, 47.0, 31.3, 89.8, 67.8, 90.0, 14.2, 12.0, 13.9, 17.8, 20.0]
podataframe = pd.DataFrame({'Yield2018': yields2018,
'Yield2019' : yields2019})
print(podataframe)
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12,5))
sns.boxplot(data=podataframe, ax=ax[0])
sns.kdeplot(podataframe.Yield2018, shade=True, color='blue', ax=ax[1])
sns.kdeplot(podataframe.Yield2019, shade=True, color='red', ax=ax[1])
ax[1].legend(('Yield2018', 'Yield2019'))
ax[1].set_xlabel('Yeild')
plt.tight_layout()
plt.show()
# Perform the two-sample t-test
t_result = stats.ttest_ind(podataframe.Yield2018, podataframe.Yield2019)
print(f'T-Test Test Statistic = {t_result[0]}, P-value = {t_result[1]}')
# Perform Shapiro-Wilk test on Yield2018 and print result
shapiroYield2018 = stats.shapiro(podataframe.Yield2018)
print(f'\nYield2018 W: {round(shapiroYield2018[0],3)}, P-value: {round(shapiroYield2018[1],3)}')
# Perform Shapiro-Wilk test on Yield2019 and print result
shapiroYield2019 = stats.shapiro(podataframe.Yield2019)
print(f'\nYield2019 W: {round(shapiroYield2019[0],3)}, P-value: {round(shapiroYield2019[1],3)}')
# Perform Wilcoxon Signed-Rank test
wilcsr = stats.wilcoxon(podataframe.Yield2018, podataframe.Yield2019)
print(f'\nWilcoxon Signed-Rank Test Statistic = {wilcsr[0]}, P-value = {wilcsr[1]}')
Note that both of your Shapiro-Wilks tests gave significant results, indicating non-normal distributions. The Wilcoxon test gives a higher p-value for the same data, due to its lower sensitivity. However, due to the low sample sizes and non-normal distributions, the Wilcoxon test is a better choice of test for this dataset.
Once again we'll be using the Olympic dataset. Here we're going to compare the Height of athletes from both the Norwegian and Chinese Teams. Do they differ?
# Separate the heights by country
NorwayHeights = athletes[athletes['Team'] == "Norway"].Height
ChinaHeights = athletes[athletes['Team'] == "China"].Height
sns.kdeplot(NorwayHeights, shade=True, color='blue')
sns.kdeplot(ChinaHeights, shade=True, color='red')
plt.legend(('Norway', 'China'))
plt.xlabel('Height')
plt.show()
Do these two distributions look normal or not? First, we'll test whether these sets of samples follow a normal distribution, via a couple of Shapiro-Wilks tests. Then we'll choose what statistical test to use. Finally, we'll test the difference between the Norwegian and Chinese samples.
# Shapiro-wilks test on the heights
print(f'P-value:{stats.shapiro(NorwayHeights)[1]}')
print(f'\nP-value: {stats.shapiro(ChinaHeights)[1]}')
# Perform the Wilcoxon rank-sum test
wilc = stats.ranksums(NorwayHeights, ChinaHeights)
print(f'\nWilcoxon Rank-Sum Test Statistic = {wilc[0]}, P-value = {wilc[1]}')
The Kruskal-Wallis test by ranks is a non-parametric method for testing whether samples originate from the same distribution. It is used for comparing two or more independent samples of equal or different sample sizes. It extends the Mann–Whitney U test, which is used for comparing only two groups. The parametric equivalent of the Kruskal–Wallis test is the one-way analysis of variance (ANOVA). Since it is a non-parametric method, the Kruskal–Wallis test does not assume a normal distribution of the residuals, unlike the analogous one-way analysis of variance.
Let's have another look at some data from our sex ratio dataset. How does the sex ratio of athletes vary between different countries? We going to use the Kruskal-Wallis H Test to check for the presence of significant variation in sex ratio of the population from different countries. A Kruskal-Wallis will allow us to see whether any differences between these groups of values are significant.
US_Sex_ratio = gender[gender.Country == "United States"].Sex_ratio
EU_Sex_ratio = gender[gender.Country == "European Union"].Sex_ratio
CA_Sex_ratio = gender[gender.Country == "Canada"].Sex_ratio
ratios = [US_Sex_ratio, EU_Sex_ratio, CA_Sex_ratio]
colors = ['r', 'b', 'g']
countries = ['US', 'EU', 'CA']
for ratio, color, country in zip(ratios, colors, countries):
sns.kdeplot(ratio, shade=True, color=color, label=f'{country} Sex_ratio')
shapiro = stats.shapiro(ratio)
print(f'{country} shapiro p-val = {round(shapiro[1],3)}')
plt.show()
# Example of the Kruskal-Wallis H Test
stat, p = stats.kruskal(US_Sex_ratio, EU_Sex_ratio, CA_Sex_ratio)
print('stat=%.3f, p=%.3f' % (stat, p))
if p >= 0.05:
print('\nProbably the same distribution')
else:
print('\nProbably different distributions')
The Friedman test is a non-parametric statistical test. Similar to the parametric repeated measures ANOVA, it is used to detect differences in treatments across multiple test attempts. The procedure involves ranking each row (or block) together, then considering the values of ranks by columns. Applicable to complete block designs, it is thus a special case of the Durbin test.
Classic examples of use are:
The Friedman test is used for one-way repeated measures analysis of variance by ranks. In its use of ranks it is similar to the Kruskal–Wallis one-way analysis of variance by ranks.
Here we are interested in potato yield in tons/hectare, as seen in previous exercises. For each Field, we have a value for Yield2018, before the application of a new fertilizer, Yield2019 and Yield2020, after the application of the new fertilizer. Does the yield differ significantly between years?
yields2018 = [60.2, 12.0, 13.8, 91.8, 50.0, 45.0, 32.0, 87.5, 60.1, 88.0, 11.3, 10.7, 12.7, 15.9, 16.0]
yields2019 = [63.2, 15.6, 14.8, 96.7, 53.0, 47.0, 31.3, 89.8, 67.8, 90.0, 14.2, 12.0, 13.9, 17.8, 20.0]
yields2020 = [65.2, 17.2, 15.5, 97.9, 55.5, 47.0, 33.5, 91.2, 68.8, 95.0, 16.1, 14.0, 14.1, 19.9, 24.0]
podataframe = pd.DataFrame({'Yield2018': yields2018,
'Yield2019' : yields2019,
'Yield2020' : yields2020})
print(podataframe)
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12,5))
sns.boxplot(data=podataframe, ax=ax[0])
sns.kdeplot(podataframe.Yield2018, shade=True, color='r', ax=ax[1])
sns.kdeplot(podataframe.Yield2019, shade=True, color='b', ax=ax[1])
sns.kdeplot(podataframe.Yield2020, shade=True, color='g', ax=ax[1])
ax[1].legend(('Yield2018', 'Yield2019', 'Yield2020'))
ax[1].set_xlabel('Yeild')
plt.tight_layout()
plt.show()
# Example of the Friedman Test
stat, p = stats.friedmanchisquare(podataframe.Yield2018, podataframe.Yield2019, podataframe.Yield2020)
print('stat=%.3f, p=%.3f' % (stat, p))
if p >= 0.05:
print('\nProbably the same distribution')
else:
print('\nProbably different distributions')
Spearman correlation:
Spearman's rank correlation coefficient is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function.
A monotonic function (or monotone function) is a function between ordered sets that preserves or reverses the given order.
The Spearman correlation between two variables is equal to the Pearson correlation between the rank values of those two variables; while Pearson's correlation assesses linear relationships, Spearman's correlation assesses monotonic relationships (whether linear or not) and is more robust to outliers. If there are no repeated data values, a perfect Spearman correlation of +1
or −1
occurs when each of the variables is a perfect monotone function of the other.
We're going to return to our Olympic dataset, where, as in previous exercises, we'll be looking at the correlation between Height and Weight amongst athletics competitors since 2000
. This relationship is seen below for both men (athletesM DataFrame) and women (athletesF DataFrame).
since2000 = athletes[athletes.Year >= 2000]
athletesM = since2000[since2000.Sex >= 'M']
athletesF = since2000[since2000.Sex >= 'F']
fig, ax = plt.subplots(1,2, sharex=True, sharey=True, figsize=(10,5))
ax[0].scatter(x=athletesF.Height, y=athletesF.Weight, color='red', s=25, alpha=0.3)
ax[0].plot(np.unique(athletesF.Height),
np.poly1d(np.polyfit(athletesF.Height, athletesF.Weight, 2))(np.unique(athletesF.Height)),
color='black')
ax[1].scatter(x=athletesM.Height, y=athletesM.Weight, color='blue', s=25, alpha=0.3)
ax[1].plot(np.unique(athletesM.Height),
np.poly1d(np.polyfit(athletesM.Height, athletesM.Weight, 1))(np.unique(athletesM.Height)),
color='black')
ax[0].set_ylabel('Weight')
ax[1].set_title('Male')
ax[0].set_title('Female')
fig.text(0.55, 0.001, 'Height', ha='center')
plt.tight_layout()
plt.show()
Note that the trend seen in each panel, represented by the line, isn't perfectly linear, particularly for the female samples. How will this affect correlation test results?
# Perform Pearson and Spearman correlations of Height and Weight for females
pearcorr = stats.pearsonr(athletesF.Height, athletesF.Weight)
print(f'Female Pearson Correlation: {round(pearcorr[0],3)}, P-value: {round(pearcorr[1],3)}')
spearcorr = stats.spearmanr(athletesF.Height, athletesF.Weight)
print(f'\nFemale Spearman Correlation: {round(spearcorr[0],3)}, P-value: {round(spearcorr[1],3)}')
# Perform Pearson and Spearman correlations of Height and Weight for males
pearcorr = stats.pearsonr(athletesM.Height, athletesM.Weight)
print(f'Male Pearson Correlation: {round(pearcorr[0],3)}, P-value: {round(pearcorr[1],3)}')
spearcorr = stats.spearmanr(athletesM.Height, athletesM.Weight)
print(f'\nMale Spearman Correlation: {round(spearcorr[0],3)}, P-value: {round(spearcorr[1],3)}')
Notice how the Spearman correlation outperforms the Pearson correlation (by finding stronger correlation) in both cases. Non-parametric Spearman correlation works well for non-linear relationships.
Also commonly known as “Kendall’s tau $\tau$ coefficient”. Kendall’s Tau coefficient and Spearman’s rank correlation coefficient assess statistical associations based on the ranks of the data. Kendall rank correlation (non-parametric) is an alternative to Pearson’s correlation (parametric) when the data you’re working with has failed one or more assumptions of the test. This is also the best alternative to Spearman correlation (non-parametric) when your sample size is small and has many tied ranks.
Kendall rank correlation is used to test the similarities in the ordering of data when it is ranked by quantities. Other types of correlation coefficients use the observations as the basis of the correlation, Kendall’s correlation coefficient uses pairs of observations and determines the strength of association based on the patter on concordance and discordance between the pairs.
Kendall’s Tau coefficient of correlation is usually smaller values than Spearman’s rho correlation. The calculations are based on concordant and discordant pairs. Insensitive to error. P values are more accurate with smaller sample sizes.
Questions that Kendall rank correlation answers:
# Example of the Kendall's Rank Correlation Test
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [0.353, 3.517, 0.125, -7.545, -0.555, -1.536, 3.350, -1.578, -3.537, -1.579]
stat, p = stats.kendalltau(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p >= 0.05:
print('\nProbably independent')
else:
print('\nProbably dependent')