# What is A/B testing?¶

• A/B Testing: Test different ideas against each other in the real world
• Choose the one that statistically performs better

Why is A/B testing important?

• No guessing
• Provides accurate answers - quickly
• Allows to rapidly iterate on ideas and establish causal relationships

## A/B test process¶

1. Randomly assign users to two different groups
1. Expose:
• Group 1 to the the current product rules
• Group 2 to a product that tests the hypothesis

### Where can A/B testing be used?¶

• impact of drugs
• incentivizing spending
• driving user growth
• ...and many more!
1. Pick whichever performs better according to a set of KPIs Key Performance Indicators

## Key performance indicators (KPIs)¶

A/B Tests: Measure impact of changes on KPIs

• KPIs — metrics are important to an organization:
• likelihood of a side-effect
• revenue
• conversion rate

# How to identify KPIs¶

• Experience + Domain knowledge + Exploratory data analysis

• Experience & Knowledge - What is important to a business

• Exploratory Analysis - What metrics and relationships impact these KPIs

# Imports¶

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from functools import reduce
from sklearn import preprocessing
from scipy import stats


# Control and treatment (test) groups¶

Testing two or more ideas against each other:

• Control: The current state of your product
• Treatment(s): The variant(s) that you want to test

# Our problem¶

## A/B Test - improving our app paywall¶

Question: Which paywall has a higher conversion rate?

• Current Paywall: ”I hope you enjoyed your free-trial, please consider subscribing” (control)
• Proposed Paywall: “Your free-trial has ended, don’t miss out, subscribe today!” (treatment)

### A/B testing process¶

• Randomly subset the users and show one set the control and one the treatment
• Monitor the conversion rates of each group to see which is better

### The importance of randomness¶

• Random assignment helps to...
• isolate the impact of the change made
• reduce the potential impact of confounding variables
• Using an assignment criteria may introduce confounders

#### Good problems for A/B testing¶

• Users are impacted individually
• Testing changes that can directly impact their behavior

#### Bad problems for A/B testing¶

• Cases with network effects among users
• Challenging to segment the users into groups
• Difficult to untangle the impact of the test

# Our data set¶

We are looking at data from an app. The app is very simple and has just $4$ pages:

• The first page is the home page. When you come to the site for the first time, you can only land on the home page as a first page.

• From the home page, the user can perform a search and land on the search page.

• From the search page, if the user clicks on a product, she will get to the payment page (paywall), where she is asked to provide payment information in order to subscribe.

• If she does decide to buy, she ends up on the confirmation page

Data set overview We have $5$ files, $4$ of them contains page_visit information and $1$ of them contains user information.

• page_visit_information
• home_page_table.csv
• search_page_table.csv
• payment_page_table.csv
• payment_confirmation_table.csv
• user_information page
• user_table.csv

## Create test and control groups¶

In [2]:
user_table = pd.read_csv('user_table.csv')
length = len(user_table['user_id'])
k = np.random.binomial(1, 0.495, length)
user_table['group'] = k
user_table['group'] = user_table['group'].replace(False, 'Control', regex=True)
user_table['group'] = user_table['group'].replace(True, 'Test', regex=True)

Out[2]:
user_id date device sex group
0 450007 2015-02-28 Desktop Female Test
1 756838 2015-01-13 Desktop Male Control
2 568983 2015-04-09 Desktop Male Test
3 190794 2015-02-18 Desktop Female Control
4 537909 2015-01-15 Desktop Male Test

## Merging¶

Merge all csv files together by user_id.

In [3]:
# Read in all csv files

# Compile the list of dataframes you want to merge
data_frames = [user_table, home_page_table, search_page_table, payment_page_table, payment_confirmation_table]

# Merge all dataframes in the list together on user_id
df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['user_id'], how='outer'), data_frames)
df_merged.columns = ['user_id', 'date', 'device', 'sex', 'group', 'home_page', 'search_page',
'payment_page', 'payment_confirm']
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90400 entries, 0 to 90399
Data columns (total 9 columns):
user_id            90400 non-null int64
date               90400 non-null object
device             90400 non-null object
sex                90400 non-null object
group              90400 non-null object
home_page          90400 non-null object
search_page        45200 non-null object
payment_page       6030 non-null object
payment_confirm    452 non-null object
dtypes: int64(1), object(8)
memory usage: 6.9+ MB


## Data Preprocessing¶

We create $4$ new columns indicating whether a user is in home_page, search_page, payment_page, confirmation_page, $1$ indicating that one person is in this page and $0$ other wise.

In [4]:
df_merged['date'] = pd.to_datetime(df_merged['date'])

trans_features = df_merged[['home_page', 'search_page', 'payment_page', 'payment_confirm']]
trans_features = trans_features.replace(np.nan, 'none', regex=True)
other_features = df_merged[['user_id', 'date', 'device', 'sex', 'group']]

le = preprocessing.LabelEncoder()
trans_features = trans_features.apply(lambda x: le.fit_transform(x))

df_merged = pd.concat([other_features, trans_features], axis=1)
df_merged['home_page'] = df_merged['home_page'].replace(0, 1)


# Poisson distribution¶

The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.

• For instance the number of phone calls received by a call center per hour may obeys a Poisson distribution.
• The Poisson distribution is a limit of the Binomial distribution for rare events.

Ideally, payment_confirm (binary outcome of a customer subscribing) should be a Poisson distribution. There will be customers with no subscription and we will have less customers that subscribe. Let’s use numpy.random.poisson() for assigning different distributions to the test and control group

In [5]:
test_n = len(df_merged.loc[df_merged.group == 'Test'])
cont_n = len(df_merged.loc[df_merged.group == 'Control'])
df_merged.loc[df_merged.group == 'Test', 'payment_confirm'] = np.random.poisson(0.089, test_n)
df_merged.loc[df_merged.group == 'Control', 'payment_confirm'] = np.random.poisson(0.079, cont_n)
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90400 entries, 0 to 90399
Data columns (total 9 columns):
user_id            90400 non-null int64
date               90400 non-null datetime64[ns]
device             90400 non-null object
sex                90400 non-null object
group              90400 non-null object
home_page          90400 non-null int64
search_page        90400 non-null int64
payment_page       90400 non-null int64
payment_confirm    90400 non-null int64
dtypes: datetime64[ns](1), int64(5), object(3)
memory usage: 6.9+ MB

In [6]:
df_merged.head()

Out[6]:
user_id date device sex group home_page search_page payment_page payment_confirm
0 450007 2015-02-28 Desktop Female Test 1 0 0 0
1 756838 2015-01-13 Desktop Male Control 1 0 0 0
2 568983 2015-04-09 Desktop Male Test 1 1 0 0
3 190794 2015-02-18 Desktop Female Control 1 1 0 0
4 537909 2015-01-15 Desktop Male Test 1 0 0 0

## Grouping and aggregating our combined dataset¶

In [7]:
daily_purchase_data = df_merged.groupby(by=['date'], as_index=False)
daily_purchase_data = daily_purchase_data.agg({'payment_confirm': ['sum', 'count']})
daily_purchase_data.columns = daily_purchase_data.columns.droplevel(level=0)
daily_purchase_data.columns = ['date', 'sum', 'count']

Male = df_merged[df_merged.sex == 'Male']
Female = df_merged[df_merged.sex == 'Female']

Desktop = df_merged[df_merged.device == 'Desktop']
Mobile = df_merged[df_merged.device == 'Mobile']

Male_Desktop = Male[Male.device == 'Desktop']
Male_Mobile = Male[Male.device == 'Mobile']

Female_Desktop = Female[Female.device == 'Desktop']
Female_Mobile = Female[Female.device == 'Mobile']

Male_daily_purchase_data = Male.groupby(by=['date'], as_index=False)
Male_daily_purchase_data = Male_daily_purchase_data.agg({'payment_confirm': ['sum', 'count']})
Male_daily_purchase_data.columns = Male_daily_purchase_data.columns.droplevel(level=0)
Male_daily_purchase_data.columns = ['date', 'sum', 'count']

Female_daily_purchase_data = Female.groupby(by=['date'], as_index=False)
Female_daily_purchase_data = Female_daily_purchase_data.agg({'payment_confirm': ['sum', 'count']})
Female_daily_purchase_data.columns = Female_daily_purchase_data.columns.droplevel(level=0)
Female_daily_purchase_data.columns = ['date', 'sum', 'count']

Desktop_daily_purchase_data = Desktop.groupby(by=['date'], as_index=False)
Desktop_daily_purchase_data = Desktop_daily_purchase_data.agg({'payment_confirm': ['sum', 'count']})
Desktop_daily_purchase_data.columns = Desktop_daily_purchase_data.columns.droplevel(level=0)
Desktop_daily_purchase_data.columns = ['date', 'sum', 'count']

Mobile_daily_purchase_data = Mobile.groupby(by=['date'], as_index=False)
Mobile_daily_purchase_data = Mobile_daily_purchase_data.agg({'payment_confirm': ['sum', 'count']})
Mobile_daily_purchase_data.columns = Mobile_daily_purchase_data.columns.droplevel(level=0)
Mobile_daily_purchase_data.columns = ['date', 'sum', 'count']

daily_visitor_data = df_merged.groupby(by=['date'], as_index=False)
daily_visitor_data = daily_visitor_data.agg({'home_page': ['sum', 'count']})
daily_visitor_data.columns = daily_visitor_data.columns.droplevel(level=0)
daily_visitor_data.columns = ['date', 'sum', 'count']

daily_visitor_Male = Male.groupby(by=['date'], as_index=False)
daily_visitor_Male = daily_visitor_Male.agg({'home_page': ['sum', 'count']})
daily_visitor_Male.columns = daily_visitor_Male.columns.droplevel(level=0)
daily_visitor_Male.columns = ['date', 'sum', 'count']

daily_visitor_Female = Female.groupby(by=['date'], as_index=False)
daily_visitor_Female = daily_visitor_Female.agg({'home_page': ['sum', 'count']})
daily_visitor_Female.columns = daily_visitor_Female.columns.droplevel(level=0)
daily_visitor_Female.columns = ['date', 'sum', 'count']

daily_visitor_Desktop = Desktop.groupby(by=['date'], as_index=False)
daily_visitor_Desktop = daily_visitor_Desktop.agg({'home_page': ['sum', 'count']})
daily_visitor_Desktop.columns = daily_visitor_Desktop.columns.droplevel(level=0)
daily_visitor_Desktop.columns = ['date', 'sum', 'count']

daily_visitor_Mobile = Mobile.groupby(by=['date'], as_index=False)
daily_visitor_Mobile = daily_visitor_Mobile.agg({'home_page': ['sum', 'count']})
daily_visitor_Mobile.columns = daily_visitor_Mobile.columns.droplevel(level=0)
daily_visitor_Mobile.columns = ['date', 'sum', 'count']

daily_visitor_Mobile_Female = Female_Mobile.groupby(by=['date'], as_index=False)
daily_visitor_Mobile_Female = daily_visitor_Mobile_Female.agg({'home_page': ['sum', 'count']})
daily_visitor_Mobile_Female.columns = daily_visitor_Mobile_Female.columns.droplevel(level=0)
daily_visitor_Mobile_Female.columns = ['date', 'sum', 'count']

daily_visitor_Desktop_Female = Female_Desktop.groupby(by=['date'], as_index=False)
daily_visitor_Desktop_Female = daily_visitor_Desktop_Female.agg({'home_page': ['sum', 'count']})
daily_visitor_Desktop_Female.columns = daily_visitor_Desktop_Female.columns.droplevel(level=0)
daily_visitor_Desktop_Female.columns = ['date', 'sum', 'count']

daily_visitor_Mobile_Male = Male_Mobile.groupby(by=['date'], as_index=False)
daily_visitor_Mobile_Male = daily_visitor_Mobile_Male.agg({'home_page': ['sum', 'count']})
daily_visitor_Mobile_Male.columns = daily_visitor_Mobile_Male.columns.droplevel(level=0)
daily_visitor_Mobile_Male.columns = ['date', 'sum', 'count']

daily_visitor_Desktop_Male = Male_Desktop.groupby(by=['date'], as_index=False)
daily_visitor_Desktop_Male = daily_visitor_Desktop_Male.agg({'home_page': ['sum', 'count']})
daily_visitor_Desktop_Male.columns = daily_visitor_Desktop_Male.columns.droplevel(level=0)
daily_visitor_Desktop_Male.columns = ['date', 'sum', 'count']


# EDA¶

In [8]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(13,8))

ax[0,0].plot(daily_visitor_data['date'], daily_visitor_data['count'], color='b', linestyle='-', marker='o')
ax[0,1].plot(daily_visitor_data['date'], daily_visitor_data['count'].rolling(3).std(), color='r',
linestyle='-', marker='o')

ax[1,0].plot(daily_visitor_Female['date'], daily_visitor_Female['count'], color='r', linestyle='-', marker='o',
label='Female')
ax[1,0].plot(daily_visitor_Male['date'], daily_visitor_Male['count'], color='b', linestyle='-', marker='o',
label='Male')

ax[1,1].plot(daily_visitor_Desktop['date'], daily_visitor_Desktop['count'], color='b', linestyle='-', marker='o',
label='Desktop')
ax[1,1].plot(daily_visitor_Mobile['date'], daily_visitor_Mobile['count'], color='r', linestyle='-', marker='o',
label='Mobile')

ax[1,0].set_xlabel('Date', fontsize=14)
ax[1,1].set_xlabel('Date', fontsize=14)
ax[0,1].set_ylabel('Count Std', fontsize=14)
ax[0,0].set_ylabel('Count', fontsize=14)
ax[1,1].set_ylabel('Count', fontsize=14)
ax[1,0].set_ylabel('Count', fontsize=14)
ax[1,0].legend()
ax[1,1].legend()
fig.autofmt_xdate()
plt.tight_layout()
fig.suptitle(f'Daily Visitors', fontsize=24)
plt.show()

In [9]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(13,5))
ax[0].plot(daily_purchase_data['date'], daily_purchase_data['sum'], color='b', linestyle='-', marker='o')
ax[1].plot(daily_purchase_data['date'], daily_purchase_data['count'], color='r', linestyle='-', marker='o')
ax[0].set_xlabel('Date', fontsize=14)
ax[1].set_xlabel('Date', fontsize=14)
ax[0].set_ylabel('Sum', fontsize=14)
ax[1].set_ylabel('Count', fontsize=14)
fig.autofmt_xdate()
plt.tight_layout()
fig.suptitle(f'Daily Payment Confirmations', fontsize=24)
plt.show()

In [10]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(13,5))
ax[0].plot(Male_daily_purchase_data['date'], Male_daily_purchase_data['count'], color='b', label='Male',
linestyle='-', marker='o')
ax[0].plot(Female_daily_purchase_data['date'], Female_daily_purchase_data['count'], color='r', label='Female',
linestyle='-', marker='o')
ax[1].plot(Desktop_daily_purchase_data['date'], Desktop_daily_purchase_data['count'], color='g',
label='Desktop', linestyle='-', marker='o')
ax[1].plot(Mobile_daily_purchase_data['date'], Mobile_daily_purchase_data['count'], color='y', label='Mobile',
linestyle='-', marker='o')
ax[0].set_xlabel('Date', fontsize=14)
ax[1].set_xlabel('Date', fontsize=14)
ax[0].set_ylabel('Sex Count', fontsize=14)
ax[1].set_ylabel('Device Count', fontsize=14)
ax[0].legend(bbox_to_anchor=(1.22, 1.02))
ax[1].legend(bbox_to_anchor=(1.24, 1.02))
plt.tight_layout()
fig.suptitle(f'Daily Payment Confirmations', fontsize=24)
fig.autofmt_xdate()
plt.show()

In [11]:
# Group and aggregate our combined dataset
grouped_purchase_data = df_merged.groupby(by = ['device', 'sex'])
purchase_summary = grouped_purchase_data.agg({'payment_confirm': ['sum', 'count']})

Out[11]:
payment_confirm
sum count
device sex
Desktop Female 2521 29997
Male 2530 30203
Mobile Female 1232 15078
Male 1271 15122

# Initial A/B test design¶

## Response variable¶

• The quantity used to measure the impact of your change
• Should either be a KPI or directly related to a KPI
• The easier to measure the better

### Factors & variants¶

• Factors: The type of variable you are changing
• The paywall customer greeting
• Variants: Particular changes you are testing
• Current Paywall: “I hope you enjoyed your free-trial, please consider subscribing” (control)
• Proposed Paywall: “Your free-trial has ended, don’t miss out, subscribe today!” (treatment)

# KPI: Conversion Rate¶

• Conversion Rate: Percentage of users who subscribe after the free trial

• Across all users or just a subset?

• Of users who convert within one week? One month?

## Why is conversion rate important?¶

• Strong measure of growth
• Potential early warning sign of problems
• Sensitive to changes in the overall ecosystem
• Choosing a KPI
• Stability over time
• Importance across different user groups
• Correlation with other business factors

# Conversion rate sensitivities¶

Here were working with the conversion rate metric. Specifically we will work to examine what that value becomes under different percentage lifts and look at how many more conversions per day this change would result in. First we will find the average number of paywall views and purchases that were made per day in our observed sample.

In [12]:
# Find the mean of each field and then multiply by 1000 to scale the result
daily_purchases = daily_purchase_data['sum'].mean()
daily_paywall_views = daily_purchase_data['count'].mean()
daily_purchases = daily_purchases * 1000
daily_paywall_views = daily_paywall_views * 1000

print(f'Daily Purchses = {round(daily_purchases,2)}')
print(f'Daily Paywall Views = {round(daily_paywall_views,2)}')

Daily Purchses = 62950.0
Daily Paywall Views = 753333.33


# Test sensitivity¶

• First question: What size of impact is meaningful to detect
• $1\%$...?
• $20\%$...?
• Smaller changes = more difficult to detect
• can be hidden by randomness
• Sensitivity: The minimum level of change we want to be able to detect in our test
• Evaluate different sensitivity values

# Sensitivity¶

Continuing with the conversion rate metric, we will now utilize the results from the previously to evaluate a few potential sensitivities that we could make use of in planning our experiment. The baseline conversion_rate has been loaded for you, calculated in the same way we saw in Chapter One. Additionally the daily_paywall_views and daily_purchases.

In [13]:
# Find the conversion rate
total_subs_count = np.sum(df_merged['payment_confirm'])
total_users_count = len(df_merged['user_id'].unique())
conversion_rate = total_subs_count / total_users_count

# Find the conversion rate std
pop_std = df_merged['payment_confirm'].std()

print(f'Total number of users = {total_users_count}')
print(f'Total number of subscribers = {total_subs_count}')
print(f'Conversion rate = {conversion_rate}, std = {pop_std}')

Total number of users = 90400
Total number of subscribers = 7554
Conversion rate = 0.08356194690265487, std = 0.2892784878459718

In [14]:
small_sensitivity = 0.1

# Find the conversion rate when increased by the percentage of the sensitivity above
small_conversion_rate = conversion_rate * (1 + 0.1)

# Apply the new conversion rate to find how many more users per day that translates to
small_purchasers = daily_paywall_views * small_conversion_rate

# Subtract the initial daily_purcahsers number from this new value to see the lift
purchaser_lift = small_purchasers - daily_purchases

print('small_conversion_rate:',small_conversion_rate)
print('small_purchasers:',small_purchasers)
print('purchaser_lift:',purchaser_lift)

small_conversion_rate: 0.09191814159292036
small_purchasers: 69245.00000000001
purchaser_lift: 6295.000000000015

In [15]:
medium_sensitivity = 0.2

# Find the conversion rate when increased by the percentage of the sensitivity above
medium_conversion_rate = conversion_rate * (1 + medium_sensitivity)

# Apply the new conversion rate to find how many more users per day that translates to
medium_purchasers = daily_paywall_views * medium_conversion_rate

# Subtract the initial daily_purcahsers number from this new value to see the lift
purchaser_lift = medium_purchasers - daily_purchases

print('medium_conversion_rate:',medium_conversion_rate)
print('medium_purchasers:',medium_purchasers)
print('purchaser_lift:',purchaser_lift)

medium_conversion_rate: 0.10027433628318584
medium_purchasers: 75540.0
purchaser_lift: 12590.0

In [16]:
large_sensitivity = 0.5

# Find the conversion rate lift with the sensitivity above
large_conversion_rate = conversion_rate * (1 + large_sensitivity)

# Find how many more users per day that translates to
large_purchasers = daily_paywall_views * large_conversion_rate
purchaser_lift = large_purchasers - daily_purchases

print('large_conversion_rate:',large_conversion_rate)
print('large_purchasers:',large_purchasers)
print('purchaser_lift:',purchaser_lift)

large_conversion_rate: 0.1253429203539823
large_purchasers: 94425.00000000001
purchaser_lift: 31475.000000000015


# Data variability¶

• Important to understand the variability in your data
• Does the conversion rate vary a lot among users?
• If it does not then it will be easier to detect a change

# Standard error¶

Here, we will explore how to calculate standard deviation for a conversion rate. We will calculate this step by step in this exercise.

In [17]:
# Find the number of paywall views
n = df_merged['payment_confirm'].count()

# Calculate the quantitiy "v"
v = conversion_rate * (1 - conversion_rate)

# Calculate the variance and standard error of the estimate
var = v / n
se = var**0.5

print('Variance:', var)
print('Standard Error:', se)

Variance: 8.471166806691676e-07
Standard Error: 0.0009203894179471903


# Calculating the sample size of our test¶

## Null hypothesis¶

• Hypothesis that control & treatment have the same impact on the response
• Updated paywall does not improve conversion rate
• Any observed difference is due to randomness
• Rejecting the Null Hypothesis
• Determine their is a difference between the treatment and control
• Statistically significant result

## Types of error & confidence level¶

• Confidence Level: Probability of not making
• Type 1 Error: False positive, claiming something has happened when it has not
• Type ll Error: False negative, claiming something has not happened when it has
• Higher this value, larger test sample needed
• Common values: $0.90$ & $0.95$

### Statistical Power:¶

Statistical Power: Probability of finding a statistically significant result when the Null Hypothesis is false

• Sample size increases = Power increases
• Confidence level increases = Power decreases

### Connecting the Different Components¶

• Estimate our needed sample size from:
• needed level of sensitivity
• our desired test power & confidence level

# Calculating the sample size and effect Size¶

To reach statistical significance, our sample size should be enough. To determine how many users we need for the test and control groups under various circumstances we will use the solve_power() function leaving nobs1 as None to get the needed sample size for our experiment.

Effect Size: The quantified magnitude of a result present in the population. Effect size is calculated using a specific statistical measure, such as Pearson’s correlation coefficient for the relationship between variables or Cohen’s d for the difference between groups.

In [18]:
from statsmodels.stats import power as pwr

# Calculate conversion rate mean and std
purchase_mean = df_merged.payment_confirm.mean()
purchase_std = df_merged.payment_confirm.std()

# Setting the parameters and we want to increase the purchase_mean to 0.1 in this experiment
effect_size = (0.1 - purchase_mean)/purchase_std
power = 0.8
alpha = 0.05

# Calculate ratio
sizes = [cont_n,test_n]
ratio = max(sizes)/min(sizes)

# Initialize analysis and calculate sample size
analysis = pwr.TTestIndPower()
ssresult = analysis.solve_power(effect_size=effect_size, power=power, alpha=alpha, nobs1=None, ratio=ratio)

print(f'Sample Size: {int(ssresult)}')

Sample Size: 4776


### Effect Size¶

Knowing the needed sample size we calculate the minimum detectable effect size.

In [19]:
# Set parameters for entire dataset
alpha = 0.05
power = 0.8
samp_size = int(ssresult)

# Initialize analysis & calculate effect size
analysis = pwr.TTestIndPower()
esresult = analysis.solve_power(effect_size = None,
power = power,
nobs1 = samp_size,
ratio = ratio,
alpha = alpha)

print(f'Minimum detectable effect size: {round(esresult,2)}')

Minimum detectable effect size: 0.06


### Statistical Power¶

Knowing the effect size and needed sample size we calculate Statistical Power.

In [20]:
# Set parameters
effect_size = esresult
alpha = 0.05

# Initialize analysis & calculate power
analysis = pwr.TTestIndPower()
pwresult = analysis.solve_power(effect_size=effect_size, power=None, alpha=alpha, nobs1=samp_size, ratio=ratio)

print(f'Power: {round(pwresult,3)}')

Power: 0.8


# Analyzing the A/B test results¶

## Confirming our test results¶

We will confirm that everything ran correctly for an A/B test. The checks we will perform will allow us to confidently report any results we uncover.

In [21]:
# Find the unique users in each group
results = df_merged.groupby('group').agg({'user_id': pd.Series.nunique})

# Find the overall number of unique users using "len" and "unique"
unique_users = len(df_merged.user_id.unique())

# Find the percentage in each group
results = results / unique_users * 100
print('Percentage of users in each group:','\n', results)

Percentage of users in each group:
user_id
group
Control  50.899336
Test     49.100664

In [22]:
# Find the unique users in each group, by device and gender
results = df_merged.groupby(by=['group', 'device', 'sex']).agg({'user_id': pd.Series.nunique})

# Find the overall number of unique users using "len" and "unique"
unique_users = len(df_merged.user_id.unique())

# Find the percentage in each group
results = results / unique_users * 100
print('Percentage of users in each group:','\n', results)

Percentage of users in each group:
user_id
group   device  sex
Control Desktop Female  16.853982
Male    17.161504
Mobile  Female   8.337389
Male     8.546460
Test    Desktop Female  16.328540
Male    16.248894
Mobile  Female   8.341814
Male     8.181416


# Is the result statistically significant?¶

• Statistical Significance: Are the conversion rates different enough?
• If yes then we reject the null hypothesis
• Conclude that the paywall's have different effects
• If no then it may just be randomness

## p-values¶

• probability if the Null Hypothesis is true...
• of observing a value as or more extreme than the one we observed
• Low p-values
• represent potentially significant results
• the observation is unlikely to have happened due to randomness

### Interpreting p-values¶

• Controversial concept in some ways

#### Revisiting statistical significance¶

• Distribution of expected difference between control and test groups if the Null Hypothesis true
• Red line: The observed difference in conversion rates from our test
• p-value: Probability of being as or more extreme than the red line on either side of the distribution

# Student's t-test¶

The t-test tells you how significant the differences between groups are; In other words it lets you know if those differences (measured in means or averages) could have happened by random chance.

Two basic types:

• One-sample: Mean of population different from a given value?
• Two-sample: Two population means equal?

## Checking for statistical significance l¶

Now that we have an intuitive understanding of statistical significance and p-values, we will apply it to our test result data.

Here we calculate the size of the test and control groups and calculate their respective conversion rates.

In [23]:
test = df_merged[df_merged.group == 'Test']
control = df_merged[df_merged.group == 'Control']

test_size = len(test['user_id'])
cont_size = len(control['user_id'])

cont_conv = control.payment_confirm.mean()
test_conv = test.payment_confirm.mean()

cont_conv_std = control.payment_confirm.std()
test_conv_std = test.payment_confirm.std()

print('Control Group Size:', cont_size)
print('Test Group Size:', test_size)

print(f'\nControl group conversion rate = {cont_conv}, std = {cont_conv_std}')
print(f'Test group conversion rate = {test_conv}, std = {test_conv_std}')

Control Group Size: 46013
Test Group Size: 44387

Control group conversion rate = 0.07780409884163171, std = 0.2792274166940439
Test group conversion rate = 0.08953071845360128, std = 0.29922792721917507


## Checking for statistical significance ll¶

How we can certainly say this experiment is successful and the difference didn’t happen due to other factors?

To answer this question, we need to check if the uptick in the test group is statistically significant. scipy library allows us to programmatically check this with the stats.ttest_ind() function:

In [24]:
test_results = df_merged[df_merged.group == 'Test']['payment_confirm']
control_results = df_merged[df_merged.group == 'Control']['payment_confirm']

test_result = stats.ttest_ind(test_results, control_results)

statistic = test_result[0]
p_value = test_result[1]

print('statistic = ', statistic)
print('p_value = ', p_value)

# Check for statistical significance
if p_value >= 0.05:
print("Not Significant")
else:
print("Significant Result")

statistic =  6.094350845714621
p_value =  1.1032472003129373e-09
Significant Result


# Sample Statistics versus Population¶

We will construct a sample by drawing points at random from the full dataset (population). We will compute the mean and standard deviation of the sample taken from that population to test whether the sample is representative of the population. Our goal is to see where the sample statistics are the same or very close to the population statistics.

In [25]:
subset_convs, test_sub_convs, cont_sub_convs = [], [], []
subset_convs_std, test_sub_convs_std, cont_sub_convs_std = [], [], []

for i in range(1000):
subset = df_merged.sample(n=int(ssresult))

test_sub = subset[subset.group == 'Test']
control_sub = subset[subset.group == 'Control']

subset_conv = subset.payment_confirm.mean()
test_sub_conv = test_sub.payment_confirm.mean()
control_sub_conv = control_sub.payment_confirm.mean()

subset_conv_std = subset.payment_confirm.std()
test_sub_conv_std = test_sub.payment_confirm.std()
control_sub_conv_std = control_sub.payment_confirm.std()

subset_convs.append(subset_conv)
test_sub_convs.append(test_sub_conv)
cont_sub_convs.append(control_sub_conv)

subset_convs_std.append(subset_conv_std)
test_sub_convs_std.append(test_sub_conv_std)
cont_sub_convs_std.append(control_sub_conv_std)


## Visualizing Variation of the samples¶

In [26]:
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(13,5))

ax[0].hist(subset_convs, bins=50, color='r', alpha=0.5, rwidth=0.75, label='Sample')
ax[1].hist(test_sub_convs, bins=50, color='b', alpha=0.5, rwidth=0.75, label='Test Sample')
ax[2].hist(cont_sub_convs, bins=50, color='g', alpha=0.5, rwidth=0.75, label='Control Sample')
ax[0].set_ylabel('Density', fontsize=14)
ax[0].set_title(f'Population sample mean = {round(np.mean(subset_convs),4)}, std = {round(np.mean(subset_convs_std),4)}', fontsize=12)
ax[1].set_title(f'Test sample mean = {round(np.mean(test_sub_convs),4)}, std = {round(np.mean(test_sub_convs_std),4)}', fontsize=12)
ax[2].set_title(f'Control sample mean = {round(np.mean(cont_sub_convs),4)}, std = {round(np.mean(cont_sub_convs_std),4)}', fontsize=12)
ax[0].legend()
ax[1].legend()
ax[2].legend()
plt.tight_layout()
fig.text(0.5, 0.001, 'Conversion Rate', ha='center', fontsize=14)
fig.suptitle(f'1k Random samples of conversion rate\'s', fontsize=24)
plt.show()

print(f'Population: Conversion rate = {round(conversion_rate,4)}, Sample Conversion rate = {round(np.mean(subset_convs),4)}')
print(f'Control group: Population conversion rate = {round(cont_conv,4)}, Sample Conversion rate = {round(np.mean(cont_sub_convs),4)}')
print(f'Test group: Population conversion rate = {round(test_conv,4)}, Sample Conversion rate = {round(np.mean(test_sub_convs),4)}')

print(f'\nPopulation: Conversion std = {round(pop_std,4)}, Sample Conversion std = {round(np.mean(subset_convs_std),4)}')
print(f'Control group: Population conversion std = {round(cont_conv_std,4)}, Sample Conversion std = {round(np.mean(test_sub_convs_std),4)}')
print(f'Test group: Population conversion std = {round(test_conv_std,4)}, Sample Conversion std = {round(np.mean(cont_sub_convs_std),4)}')

Population: Conversion rate = 0.0836, Sample Conversion rate = 0.0837
Control group: Population conversion rate = 0.0778, Sample Conversion rate = 0.0778
Test group: Population conversion rate = 0.0895, Sample Conversion rate = 0.0899

Population: Conversion std = 0.2893, Sample Conversion std = 0.2895
Control group: Population conversion std = 0.2792, Sample Conversion std = 0.2994
Test group: Population conversion std = 0.2992, Sample Conversion std = 0.2792


# What is a confidence interval¶

• Range of values for our estimation rather than single number
• Provides context for our estimation process
• Series of repeated experiments...
• the calculated intervals will contain the true parameter X% of the time
• The true conversion rate is a fixed quantity, our estimation and the interval are variable

## Calculating confidence intervals¶

We will calculate the confidence intervals for the A/B test results.

In [27]:
def get_ci(value, cl, sd):
loc = stats.norm.ppf(1 - cl/2)
rng_val = stats.norm.cdf(loc - value/sd)

lwr_bnd = value - rng_val
upr_bnd = value + rng_val

return_val = (lwr_bnd, upr_bnd)
return(return_val)

In [28]:
# Calculate the mean of our lift distribution
lift_mean = test_conv - cont_conv

# Calculate variance and standard deviation
lift_variance = (1 - test_conv) * test_conv / test_size + (1 - cont_conv) * cont_conv / cont_size
lift_sd = lift_variance**0.5

# Find the confidence intervals with cl = 0.95
confidence_interval = get_ci(lift_mean, 0.95, lift_sd)
print('confidence_interval = ', confidence_interval)

confidence_interval =  (0.011726619463972564, 0.011726619759966568)


# Plotting the distribution¶

Here, we will visualize the test and control conversion rates as distributions. Additionally, viewing the data in this way can give a sense of the variability inherent in our estimation.

In [29]:
# Compute the variance
cont_var = (cont_conv * (1 - cont_conv)) / cont_size
test_var = (test_conv * (1 - test_conv)) / test_size

# Compute the standard deviations
control_sd = cont_var**0.5
test_sd = test_var**0.5

# Create the range of x values
control_line = np.linspace(cont_conv - 3 * control_sd, cont_conv + 3 * control_sd, 100)
test_line = np.linspace(test_conv - 3 * test_sd ,test_conv +  3 * test_sd, 100)

# Plot the distribution
plt.plot(control_line, stats.norm.pdf(control_line, cont_conv, control_sd), label='Test')
plt.plot(test_line, stats.norm.pdf(test_line, test_conv, test_sd), label='Control')
plt.legend()
plt.show()


# Plotting the difference distribution¶

Now lets plot the difference distribution of our results that is, the distribution of our lift.

In [30]:
# Find the lift mean and standard deviation
sizes = [test_conv, cont_conv]
lift_mean = max(sizes) - min(sizes)
lift_sd = (test_var + cont_var) ** 0.5

# Generate the range of x-values
lift_line = np.linspace(lift_mean - 3 * lift_sd, lift_mean + 3 * lift_sd, 100)

# Find the confidence intervals with cl = 0.95
confidence_interval = get_ci(lift_mean, 0.95, lift_sd)

# Plot the lift distribution
plt.plot(lift_line, stats.norm.pdf(lift_line, lift_mean, lift_sd))