The PyMC3 workflow¶

PyMC3 aims for intuitive and readable, yet powerful syntax that reflects how statisticians describe models. The modeling process generally follows these five steps:

Encode a probability model by defining the following:
1. The prior distributions that quantify knowledge and uncertainty about latent variables
2. The likelihood function that conditions the parameters on observed data

Analyze the posterior using one of the options:
1. Obtain a point estimate using MAP inference
2. Sample from the posterior using MCMC methods

Approximate the posterior using variational Bayes.
Check the model using various diagnostic tools.
Generate predictions.

The resulting model can be used for inference to gain detailed insights into parameter values as well as to predict outcomes for new data points.

Logistic regression with PyMC3¶

Logistic regression estimates a linear relationship between a set of features and a binary outcome, mediated by a sigmoid function to ensure the model produces probabilities. The frequentist approach resulted in point estimates for the parameters that measure the influence of each feature on the probability that a data point belongs to the positive class, with confidence intervals based on assumptions about the parameter distribution.

In contrast, Bayesian logistic regression estimates the posterior distribution over the parameters itself. The posterior allows for more robust estimates of what is called a Bayesian credible interval for each parameter with the benefit of more transparency about the model’s uncertainty.

Imports¶

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
from pathlib import Path
import pickle
from collections import OrderedDict
import pandas as pd
import numpy as np
from scipy import stats

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import (roc_curve, roc_auc_score, confusion_matrix, accuracy_score, f1_score, 
                             precision_recall_curve) 
from mlxtend.plotting import plot_confusion_matrix

import theano
import pymc3 as pm
from pymc3.variational.callbacks import CheckParametersConvergence
import statsmodels.formula.api as smf

import arviz as az
import matplotlib.pyplot as plt
import matplotlib.cm as cm

import seaborn as sns
from IPython.display import HTML

data_path = Path('data')
fig_path = Path('figures')
model_path = Path('models')
for p in [data_path, fig_path, model_path]:
    if not p.exists():
        p.mkdir()

The Data¶

The goal of this dataset is to create a binary classification model that predicts whether or not over $11,000$ customers will subscribe to a term deposit after a marketing campaign the bank has performed, based on $17$ predictor variables. The target variable is given as deposit and takes on a value of $1$ if the customer has subscribed and $0$ otherwise.

This is a slightly imbalanced class problem because there are more customers that did not subscribe the term deposit than did.

Load from Disk¶

data = pd.read_csv('bank.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11162 entries, 0 to 11161
Data columns (total 17 columns):
age          11162 non-null int64
job          11162 non-null object
marital      11162 non-null object
education    11162 non-null object
default      11162 non-null object
balance      11162 non-null int64
housing      11162 non-null object
loan         11162 non-null object
contact      11162 non-null object
day          11162 non-null int64
month        11162 non-null object
duration     11162 non-null int64
campaign     11162 non-null int64
pdays        11162 non-null int64
previous     11162 non-null int64
poutcome     11162 non-null object
deposit      11162 non-null object
dtypes: int64(7), object(10)
memory usage: 1.4+ MB

data.head()

Data Preprocessing¶

categorical_features = data.select_dtypes(include=[np.object])

numeric_features = data.select_dtypes(include=[np.number])

cols = numeric_features.columns
numeric_features.loc[:, cols] = preprocessing.scale(numeric_features.loc[:, cols])

le = preprocessing.LabelEncoder()

categorical_features = data[categorical_features.columns].apply(lambda x: le.fit_transform(x))

data = pd.concat([categorical_features, numeric_features], axis=1)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11162 entries, 0 to 11161
Data columns (total 17 columns):
job          11162 non-null int64
marital      11162 non-null int64
education    11162 non-null int64
default      11162 non-null int64
housing      11162 non-null int64
loan         11162 non-null int64
contact      11162 non-null int64
month        11162 non-null int64
poutcome     11162 non-null int64
deposit      11162 non-null int64
age          11162 non-null float64
balance      11162 non-null float64
day          11162 non-null float64
duration     11162 non-null float64
campaign     11162 non-null float64
pdays        11162 non-null float64
previous     11162 non-null float64
dtypes: float64(7), int64(10)
memory usage: 1.4 MB

data.head()

Quick EDA¶

total = len(data)
plt.figure(figsize=(7,5))
g = sns.countplot(x='deposit', data=data)
g.set_ylabel('Count', fontsize=14)
for p in g.patches:
    height = p.get_height()
    g.text(p.get_x()+p.get_width()/2.,
            height + 1.5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=14, fontweight='bold')
plt.margins(y=0.1)
plt.show()

plt.figure(figsize=(13, 13))
corr = data.corr() 
mask = np.tri(*corr.shape).T 
sns.heatmap(corr.abs(), mask=mask, annot=True)
b, t = plt.ylim() 
b += 0.5 
t -= 0.5 
plt.ylim(b, t) 
plt.show()

n_fts = len(data.columns)
colors = cm.rainbow(np.linspace(0, 1, n_fts))

data.drop('deposit',axis=1).corrwith(data.deposit).sort_values(ascending=True).plot(kind='barh', 
                                                                                     color=colors, figsize=(12, 6))
plt.title('Correlation to Target (deposit)')
plt.show()

print('\n',data.drop('deposit',axis=1).corrwith(data.deposit).sort_values(ascending=False))

 duration     0.451919
pdays        0.151593
previous     0.139867
education    0.095948
balance      0.081129
marital      0.067610
job          0.063395
age          0.034901
month       -0.037121
default     -0.040680
day         -0.056326
loan        -0.110580
poutcome    -0.122369
campaign    -0.128081
housing     -0.203888
contact     -0.249847
dtype: float64

data['housing'] = -data['housing']
data['contact'] = -data['contact']
data['campaign'] = -data['campaign']
data['poutcome'] = -data['poutcome']
data['loan'] = -data['loan']

corr = data.corr() 

cor_target = corr["deposit"]

relevant_features = cor_target[cor_target > 0.08]
relevant_features.sort_values(ascending=False)

deposit      1.000000
duration     0.451919
contact      0.249847
housing      0.203888
pdays        0.151593
previous     0.139867
campaign     0.128081
poutcome     0.122369
loan         0.110580
education    0.095948
balance      0.081129
Name: deposit, dtype: float64

sns.pairplot(data[relevant_features.index]);

Models¶

simple_model = 'deposit ~ duration + pdays + previous'
full_model = 'deposit ~ duration + pdays + previous + education + balance +  loan + poutcome + campaign +  housing + contact'

MAP Inference¶

A probabilistic program consists of observed and unobserved random variables (RVs). We define the observed RVs via likelihood distributions and unobserved RVs via prior distributions. PyMC3 includes numerous probability distributions for this purpose.

The PyMC3 library makes it very straightforward to perform approximate Bayesian inference for logistic regression. Logistic regression models the probability that individual $i$ subscribes to a deposit based on $k$ features.

Logistic Regression:

$$ p(y_i = 1 \mid \beta) = \sigma \beta_0 + \beta_1 x_{i1} + \dots + \beta_k x_{ik}) $$

Where $\sigma$ is the logistic function: $$ \sigma(t) = \frac{1}{1 + e^{-1}} $$

Manual Model Specification¶

We will use the context manager with to define a manual_logistic_model that we can refer to later as a probabilistic model:

The random variables for the unobserved parameters for intercept and $3$ features are expressed using uninformative priors that assume normal distributions with mean $0$ and standard deviation of $100$.

The likelihood combines the parameters with the data according to the specification of the logistic regression.

The outcome is modeled as a Bernoulli RV with success probability given by the likelihood.

with pm.Model() as manual_logistic_model:
    # random variables for coefficients with
    # uninformative priors for each parameter

    intercept = pm.Normal('intercept', 0, sd=100)
    beta_1 = pm.Normal('beta_1', 0, sd=100)
    beta_2 = pm.Normal('beta_2', 0, sd=100)
    beta_3 = pm.Normal('beta_3', 0, sd=100)

    # Transform random variables into vector of probabilities p(y_i=1)
    # according to logistic regression model specification.
    likelihood = pm.invlogit(intercept + beta_1 * data.duration + beta_2 * data.pdays + beta_3 * data.previous)

    # Bernoulli random vector with probability of success
    # given by sigmoid function and actual data as observed
    pm.Bernoulli(name='logit', p=likelihood, observed=data.deposit)

manual_logistic_model.model

The command pm.model_to_graphviz(manual_logistic_model) produces the plate notation displayed below.

It shows the unobserved parameters as light and the observed elements as dark circles. The rectangle indicates the number of repetitions of the observed model element implied by the data included in the model definition.

pm.model_to_graphviz(manual_logistic_model)

MAP Estimate¶

Maximum a posteriori probability (MAP) estimation leverages that the evidence is a constant factor that scales the posterior to meet the requirements for a probability distribution. Since the evidence does not depend on $θ$, the posterior distribution is proportional to the product of the likelihood and the prior. Hence, MAP estimation chooses the value of $θ$ that maximizes the posterior given the observed data and the prior belief, that is, the mode of the posterior.

$$ \theta_{MAP} = \operatorname*{arg\,max}_\theta P(X \mid \theta) P(\theta) $$

We obtain point MAP estimates for the three parameters using the just defined model’s .find_MAP() method:

with manual_logistic_model:
    # compute maximum a-posteriori estimate
    # for logistic regression weights
    manual_map_estimate = pm.find_MAP()

logp = -7,759, ||grad|| = 2,784: 100%|██████████| 11/11 [00:00<00:00, 450.08it/s]

def print_map(result):
    return pd.Series({k: np.asscalar(v) for k, v in result.items()})

print_map(manual_map_estimate)

intercept    0.051369
beta_1       1.561395
beta_2       0.260651
beta_3       0.306804
dtype: float64

GLM Model¶

PyMC3 includes numerous common models so that we can usually leave the manual specification for custom applications.

The following code defines the same logistic regression as a member of the Generalized Linear Models (GLM) family using the formula format inspired by the statistical language R and ported to python by the patsy library:

with pm.Model() as logistic_model:
    pm.glm.GLM.from_formula(simple_model,
                            data,
                            family=pm.glm.families.Binomial())

pm.model_to_graphviz(logistic_model)

PyMC3 solves the optimization problem of finding the posterior point with the highest density using the quasi-Newton Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm but offers several alternatives provided by the scipy library. The result is virtually identically to the corresponding statsmodels estimate:

model = smf.logit(formula=simple_model, data=data[['deposit', 'duration', 'pdays', 'previous']])
result = model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.538611
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                deposit   No. Observations:                11162
Model:                          Logit   Df Residuals:                    11158
Method:                           MLE   Df Model:                            3
Date:                Thu, 07 May 2020   Pseudo R-squ.:                  0.2214
Time:                        17:55:19   Log-Likelihood:                -6012.0
converged:                       True   LL-Null:                       -7721.6
Covariance Type:            nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0514      0.023      2.192      0.028       0.005       0.097
duration       1.5614      0.037     42.443      0.000       1.489       1.633
pdays          0.2607      0.027      9.816      0.000       0.209       0.313
previous       0.3068      0.032      9.585      0.000       0.244       0.370
==============================================================================

print_map(manual_map_estimate)

intercept    0.051369
beta_1       1.561395
beta_2       0.260651
beta_3       0.306804
dtype: float64

result.params

Intercept    0.051370
duration     1.561394
pdays        0.260652
previous     0.306804
dtype: float64

Markov Chain Monte Carlo¶

Markov chains are stochastic models that describe sequences of possible events. Each event comes from a set of outcomes, and each outcome determines which outcome occurs next, according to a fixed set of probabilities. An important feature of Markov chains is that they are memoryless: everything that you would possibly need to predict the next event is available in the current state, and no new information comes from knowing historical events.

Monte Carlo methods rely on repeated random sampling to approximate results that may be deterministic, but that does not permit an analytic, exact solution.

Many algorithms apply the Monte Carlo method to a Markov Chain, and generally proceed as follows:

Start at the current position.
Draw a new position from a proposal distribution.
Evaluate the probability of the new position in light of data and prior distributions:
1. If sufficiently likely, move to the new position
2. Otherwise, remain at the current position
Repeat from step 1.
After a given number of iterations, return all accepted positions.

Define the Model¶

We will use a slightly more complicated model to illustrate Markov chain Monte Carlo inference:

full_model = 'deposit ~ duration + pdays + previous + education + balance + loan + poutcome + campaign + housing + contact'

with pm.Model() as logistic_model:
    pm.glm.GLM.from_formula(formula=full_model,
                            data=data,
                            family=pm.glm.families.Binomial())

logistic_model.basic_RVs

[Intercept,
 duration,
 pdays,
 previous,
 education,
 balance,
 loan,
 poutcome,
 campaign,
 housing,
 contact,
 y]

Hamiltonian Monte Carlo – going NUTS¶

By default, PyMC3 automatically selects the most efficient sampler and initializes the sampling process for efficient convergence. For a continuous model, PyMC3 chooses the NUTS sampler. It also runs variational inference via ADVI to find good starting parameters for the sampler. One among several alternatives is to use the MAP estimate.

Hamiltonian Monte Carlo (HMC) is a hybrid method that leverages the first-order derivative information of the gradient of the likelihood to propose new states for exploration and overcome some of the challenges of MCMC. In addition, it incorporates momentum to efficiently jump around the posterior. As a result, it converges faster to a high-dimensional target distribution than simpler random-walk Metropolis or Gibbs sampling.

To see what the convergence looks like, we first draw $1,000$ samples after tuning the sampler for $1,000$ iterations that will be discarded. The sampling process can be parallelized for multiple chains using the cores argument (except when using GPU).

with logistic_model:
    trace = pm.sample(tune=1000,
                         draws=1000,
                         chains=4,
                         init = 'adapt_diag',
                         cores=3)

Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (4 chains in 3 jobs)
NUTS: [contact, housing, campaign, poutcome, loan, balance, education, previous, pdays, duration, Intercept]
Sampling 4 chains, 0 divergences: 100%|██████████| 8000/8000 [01:23<00:00, 95.79draws/s]

Inspect Trace¶

pm.plot_trace(trace);

pm.trace_to_dataframe(trace).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 11 columns):
Intercept    4000 non-null float64
duration     4000 non-null float64
pdays        4000 non-null float64
previous     4000 non-null float64
education    4000 non-null float64
balance      4000 non-null float64
loan         4000 non-null float64
poutcome     4000 non-null float64
campaign     4000 non-null float64
housing      4000 non-null float64
contact      4000 non-null float64
dtypes: float64(11)
memory usage: 343.9 KB

with open(model_path / 'logistic_model_mh.pkl', 'wb') as buff:
    pickle.dump({'model': logistic_model, 'trace': trace}, buff)

PyMC3 produces various summary statistics for a sampler. These are available as individual functions in the stats module, or by providing a trace to the pm.summary() function:

pm.summary(trace)

draws = 100

trace_df = pm.trace_to_dataframe(trace).assign(
    chain=lambda x: x.index // draws)

trace_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 12 columns):
Intercept    4000 non-null float64
duration     4000 non-null float64
pdays        4000 non-null float64
previous     4000 non-null float64
education    4000 non-null float64
balance      4000 non-null float64
loan         4000 non-null float64
poutcome     4000 non-null float64
campaign     4000 non-null float64
housing      4000 non-null float64
contact      4000 non-null float64
chain        4000 non-null int64
dtypes: float64(11), int64(1)
memory usage: 375.1 KB

Persist Results¶

with open(model_path / 'logistic_model_nuts.pkl', 'wb') as buff:
    pickle.dump({'model': logistic_model,
                 'trace': trace}, buff)

with open(model_path / 'logistic_model_nuts.pkl', 'rb') as buff:
    data0 = pickle.load(buff)  

logistic_model, trace_NUTS = data0['model'], data0['trace']

Combine Traces¶

draws = 10000

df = pm.trace_to_dataframe(trace_NUTS).iloc[200:].reset_index(
    drop=True).assign(chain=lambda x: x.index // draws)

trace_df = pd.concat([trace_df.assign(samples=100),
                      df.assign(samples=10000)])
trace_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7800 entries, 0 to 3799
Data columns (total 13 columns):
Intercept    7800 non-null float64
duration     7800 non-null float64
pdays        7800 non-null float64
previous     7800 non-null float64
education    7800 non-null float64
balance      7800 non-null float64
loan         7800 non-null float64
poutcome     7800 non-null float64
campaign     7800 non-null float64
housing      7800 non-null float64
contact      7800 non-null float64
chain        7800 non-null int64
samples      7800 non-null int64
dtypes: float64(11), int64(2)
memory usage: 853.1 KB

Visualize both traces¶

trace_df_long = pd.melt(trace_df, id_vars=['samples', 'chain'])
trace_df_long.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85800 entries, 0 to 85799
Data columns (total 4 columns):
samples     85800 non-null int64
chain       85800 non-null int64
variable    85800 non-null object
value       85800 non-null float64
dtypes: float64(1), int64(2), object(1)
memory usage: 2.6+ MB

Convergence¶

We can visualize the samples over time and their distributions to check the quality of the results. The following charts show the posterior distributions after an initial $100$ and an additional $10,000$ samples, respectively, and illustrate how convergence implies that multiple chains identify the same distribution. The pm.trace_plot() function shows the evolution of the samples as well.

g = sns.FacetGrid(trace_df_long, col='variable', row='samples',
                  hue='chain', sharex='col', sharey=False)
g = g.map(sns.distplot, 'value', hist=False, rug=False)

Estimate Odds Ratio¶

An odds ratio (OR) is a measure of association between a certain property $A$ and a second property $B$ in a population. Specifically, it tells you how the presence or absence of property $A$ has an effect on the presence or absence of property $B$. The OR is also used to figure out if a particular exposure (like eating processed meat) is a risk factor for a particular outcome (such as colon cancer), and to compare the various risk factors for that outcome. As long as you have two properties you think are linked, you can calculate the odds.

You have two choices for the formula: $$(a \div c) \div (b \div d)$$

or, equivalently: $$(a \times d) \div (b \times c)$$

$OR > 1$ indicates increased occurrence of event
$OR < 1$ indicates decreased occurrence of event (protective exposure)

b = trace['duration']
lb, ub = np.percentile(b, 2.5), np.percentile(b, 97.5)
lb, ub = np.exp(lb), np.exp(ub)
print(f'P({lb:.3f} < Odds Ratio < {ub:.3f}) = 0.95')

P(5.323 < Odds Ratio < 6.214) = 0.95

We can interpret something along those lines: "With probability $0.95$ the odds ratio is greater than $5.310$ and less than $6.233$, so the duration effect takes place because a person with a longer duration has at least $5.310$ higher probability to subscribe to a term deposit than a person with a lower duration, while holding all the other variables constant.

Computing Credible Intervals¶

We can compute the credible intervals, the Bayesian counterpart of confidence intervals, as percentiles of the trace. The resulting boundaries reflect our confidence about the range of the parameter value for a given probability threshold, as opposed to the number of times the parameter will be within this range for a large number of trials.

fig, ax = plt.subplots(figsize=(8, 4))
sns.distplot(np.exp(b), axlabel='Odds Ratio', ax=ax)
ax.set_title(f'Credible Interval: P({lb:.3f} < Odds Ratio < {ub:.3f}) = 0.95')
ax.axvspan(lb, ub, alpha=0.5, color='gray');

Variational Inference¶

Variational Inference (VI) is a machine learning method that approximates probability densities through optimization. In the Bayesian context, it approximates the posterior distribution as follows:

Select a parametrized family of probability distributions
Find the member of this family closest to the target, as measured by Kullback-Leibler divergence

Compared to MCMC, Variational Bayes tends to converge faster and scales to large data better. While MCMC approximates the posterior with samples from the chain that will eventually converge arbitrarily close to the target, variational algorithms approximate the posterior with the result of the optimization, which is not guaranteed to coincide with the target.

Variational Inference is better suited for large datasets and to quickly explore many models.In contrast, MCMC will deliver more accurate results on smaller datasets or when time and computational resources pose fewer constraints.

Run Automatic Differentation Variational Inference (ADVI)¶

The interface for variational inference is very similar to the MCMC implementation. We just use the fit() instead of the sample() function, with the option to include an early stopping CheckParametersConvergence callback if the distribution-fitting process converged up to a given tolerance:

with logistic_model:
    callback = CheckParametersConvergence(diff='absolute')
    approx = pm.fit(n=100000, callbacks=[callback])

Average Loss = 5,365.7: 100%|██████████| 100000/100000 [02:08<00:00, 777.98it/s]
Finished [100%]: Average Loss = 5,365.7

Persist Result¶

with open(model_path / 'logistic_model_advi.pkl', 'wb') as buff:
    pickle.dump({'model': logistic_model,
                 'approx': approx}, buff)

Sample from approximated distribution¶

We can draw samples from the approximated distribution to obtain a trace object as above for the MCMC sampler:

trace_advi = approx.sample(2000)

pm.summary(trace_advi)

arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)

# pm.summary(trace_advi).to_csv(model_path / 'trace_advi.csv')

Let us visualize the covariance structure of the model

az.plot_pair(trace_NUTS, figsize=(10, 10));

Let us visualize the covariance structure of the ADVI model

az.plot_pair(trace_advi, figsize=(10, 10));

Clearly, ADVI does not capture (as expected) the interactions between variables because of the mean field approximation, and so it underestimates the overall variance.

Model Diagnostics¶

Bayesian model diagnostics includes validating that the sampling process has converged and consistently samples from high-probability areas of the posterior, and confirming that the model represents the data well.

For high-dimensional models with many variables, it becomes cumbersome to inspect numerous traces. When using NUTS, the energy plot helps to assess problems of convergence. It summarizes how efficiently the random process explores the posterior. The plot shows the energy and the energy transition matrix that should be well matched as in the below example.

Energy Plot¶

When using NUTS, the energy plot helps to assess problems of convergence. It summarizes how efficiently the random process explores the posterior. The plot shows the energy and the energy transition matrix, which should be well-matched.

pm.energyplot(trace_NUTS);

Forest Plot¶

A forest plot, also known as a blobbogram, is a graphical display of estimated results from a number of scientific studies addressing the same question, along with the overall results. It was developed for use in medical research as a means of graphically representing a meta-analysis of the results of randomized controlled trials.

az.plot_forest([trace_advi, trace_NUTS]);

Posterior Plot¶

pm.plot_posterior(trace_NUTS);

Posterior Predictive Checks¶

PPCs are very useful for examining how well a model fits the data. They do so by generating data from the model using parameters from draws from the posterior. We use the function pm.sample_ppc for this purpose and obtain $n$ samples for each observation (the GLM module automatically names the outcome $y$):

ppc = pm.sample_ppc(trace_NUTS, samples=500, model=logistic_model)

100%|██████████| 500/500 [00:00<00:00, 561.36it/s]

ppc['y'].shape

(500, 11162)

Check AUC Score¶

y_score = np.mean(ppc['y'], axis=0)

pred_scores = dict(y_true=data.deposit,y_score=y_score)
roc_auc_score(**pred_scores)

0.8665637154908409

Prediction¶

Follows PyMC3 docs

Predictions use theano’s shared variables to replace the training data with test data before running posterior predictive checks. To facilitate visualization, we create the train and test datasets, and convert the former to a shared variable. Note that we need to use numpy arrays and provide a list of column labels:

Train-test split¶

X = data.drop('deposit', axis=1)
y = data.deposit
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)
labels = X_train.columns

Create shared theano variable¶

X_shared = theano.shared(X_train.values)

Define logistic model¶

with pm.Model() as logistic_model_pred:
    pm.glm.GLM(x=X_shared, labels=labels,
               y=y_train, family=pm.glm.families.Binomial())

Run NUTS sampler¶

with logistic_model_pred:
    pred_trace = pm.sample(draws=1000, 
                           tune=1000,
                           chains=2,
                           cores=2,
                           init='adapt_diag')

Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [previous, pdays, campaign, duration, day, balance, age, poutcome, month, contact, loan, housing, default, education, marital, job, Intercept]
Sampling 2 chains, 1 divergences: 100%|██████████| 4000/4000 [01:07<00:00, 59.48draws/s]

Replace shared variable with test set¶

We then run the sampler as before, and apply the pm.sample_ppc function to the resulting trace after replacing the train with test data:

X_shared.set_value(X_test)

ppc = pm.sample_ppc(pred_trace,
                    model=logistic_model_pred,
                    samples=100)

100%|██████████| 100/100 [00:00<00:00, 365.49it/s]

AUC Score¶

y_score = np.mean(ppc['y'], axis=0)
roc_auc_score(y_score=np.mean(ppc['y'], axis=0), 
              y_true=y_test)

0.8632915421477958

pred_scores = dict(y_true=y_test, y_score=y_score)
cols = ['False Positive Rate', 'True Positive Rate', 'threshold']
roc = pd.DataFrame(dict(zip(cols, roc_curve(**pred_scores))))

precision_recall_curve¶

precision, recall, ts = precision_recall_curve(y_true=y_test, probas_pred=y_score)
pr_curve = pd.DataFrame({'Precision': precision, 'Recall': recall})

f1_score¶

f1 = pd.Series({t: f1_score(y_true=y_test, y_pred=y_score>t) for t in ts})
best_threshold = f1.idxmax()

Plot results¶

fig, axes = plt.subplots(ncols=3, figsize=(13, 5))

ax = sns.scatterplot(x='False Positive Rate', y='True Positive Rate', data=roc, s=50, legend=False, ax=axes[0])
axes[0].plot('False Positive Rate', 'True Positive Rate', data=roc, lw=1, color='k')
axes[0].plot(np.linspace(0,1,100), np.linspace(0,1,100), color='k', ls='--', lw=1)
axes[0].fill_between(y1=roc['True Positive Rate'], x=roc['False Positive Rate'], alpha=.3, color='red')
axes[0].set_title('Receiver Operating Characteristic')


sns.scatterplot(x='Recall', y='Precision', data=pr_curve, ax=axes[1])
axes[1].set_ylim(0,1)
axes[1].set_title('Precision-Recall Curve')


f1.plot(ax=axes[2], title='F1 Scores', ylim=(0,1))
axes[2].set_xlabel('Threshold')
axes[2].axvline(best_threshold, lw=1, ls='--', color='k')
axes[2].text(text=f'Max F1 @ {best_threshold:.2f}', x=.60, y=.95, s=5)
fig.suptitle(f'roc_auc_score = {round(roc_auc_score(**pred_scores),2)}', fontsize=24)
fig.tight_layout()
plt.subplots_adjust(top=.8)
plt.show();

	job	marital	education	housing	contact	month	poutcome	deposit	age	balance	day	duration	campaign	pdays	previous
0	0	1	1	1	2	8	3	1	1.491505	0.252525	-1.265746	1.930226	-0.554168	-0.481184	-0.36326
1	0	1	1	0	2	8	3	1	1.239676	-0.459974	-1.265746	3.154612	-0.554168	-0.481184	-0.36326
2	9	1	1	1	2	8	3	1	-0.019470	-0.080160	-1.265746	2.929901	-0.554168	-0.481184	-0.36326
3	7	1	1	1	2	8	3	1	1.155733	0.293762	-1.265746	0.596366	-0.554168	-0.481184	-0.36326
4	0	1	2	0	2	8	3	1	1.071790	-0.416876	-1.265746	0.867171	-0.186785	-0.481184	-0.36326

	mean	sd	hpd_3%	hpd_97%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
Intercept	0.156	0.116	-0.056	0.376	0.003	0.002	1891.0	1891.0	1906.0	2212.0	1.0
duration	1.748	0.040	1.673	1.822	0.001	0.000	4568.0	4568.0	4592.0	3454.0	1.0
pdays	0.354	0.041	0.282	0.439	0.001	0.001	2357.0	2330.0	2361.0	2793.0	1.0
previous	0.285	0.034	0.223	0.351	0.001	0.000	4348.0	4348.0	4328.0	3031.0	1.0
education	0.219	0.033	0.152	0.276	0.001	0.000	3818.0	3818.0	3825.0	2841.0	1.0
balance	0.122	0.027	0.073	0.172	0.000	0.000	5275.0	4906.0	5312.0	3038.0	1.0
loan	0.766	0.079	0.625	0.916	0.001	0.001	4739.0	4684.0	4741.0	3131.0	1.0
poutcome	-0.172	0.042	-0.248	-0.092	0.001	0.001	1909.0	1880.0	1914.0	2212.0	1.0
campaign	0.395	0.035	0.327	0.461	0.001	0.000	4779.0	4768.0	4766.0	2748.0	1.0
housing	1.066	0.051	0.970	1.163	0.001	0.001	4692.0	4681.0	4693.0	3010.0	1.0
contact	0.637	0.036	0.569	0.702	0.000	0.000	5148.0	5148.0	5145.0	3171.0	1.0

	mean	sd	hpd_3%	hpd_97%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
Intercept	0.150	0.027	0.097	0.200	0.001	0.000	2043.0	2033.0	2044.0	2003.0	NaN
duration	1.749	0.040	1.674	1.823	0.001	0.001	2059.0	2058.0	2062.0	1956.0	NaN
pdays	0.354	0.027	0.306	0.407	0.001	0.000	2036.0	2026.0	2038.0	1921.0	NaN
previous	0.287	0.029	0.235	0.344	0.001	0.000	2047.0	2041.0	2046.0	1829.0	NaN
education	0.218	0.018	0.183	0.250	0.000	0.000	2064.0	2064.0	2066.0	1820.0	NaN
balance	0.124	0.028	0.073	0.177	0.001	0.000	2267.0	2267.0	2267.0	2004.0	NaN
loan	0.763	0.081	0.616	0.915	0.002	0.001	1952.0	1942.0	1949.0	1821.0	NaN
poutcome	-0.172	0.010	-0.191	-0.153	0.000	0.000	1968.0	1968.0	1962.0	1959.0	NaN
campaign	0.402	0.038	0.333	0.473	0.001	0.001	1871.0	1839.0	1853.0	1926.0	NaN
housing	1.070	0.041	0.988	1.144	0.001	0.001	1959.0	1949.0	1971.0	2004.0	NaN
contact	0.639	0.034	0.577	0.703	0.001	0.001	1989.0	1984.0	1984.0	1845.0	NaN

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	deposit
0	59	admin.	married	secondary	no	2343	yes	no	unknown	5	may	1042	1	-1	unknown	yes
1	56	admin.	married	secondary	no	45	no	no	unknown	5	may	1467	1	-1	unknown	yes
2	41	technician	married	secondary	no	1270	yes	no	unknown	5	may	1389	1	-1	unknown	yes
3	55	services	married	secondary	no	2476	yes	no	unknown	5	may	579	1	-1	unknown	yes
4	54	admin.	married	tertiary	no	184	no	no	unknown	5	may	673	2	-1	unknown	yes