The PyMC3 workflow

PyMC3 aims for intuitive and readable, yet powerful syntax that reflects how statisticians describe models. The modeling process generally follows these five steps:

  1. Encode a probability model by defining the following:
    1. The prior distributions that quantify knowledge and uncertainty about latent variables
    2. The likelihood function that conditions the parameters on observed data
  1. Analyze the posterior using one of the options:
    1. Obtain a point estimate using MAP inference
    2. Sample from the posterior using MCMC methods
  1. Approximate the posterior using variational Bayes.
  2. Check the model using various diagnostic tools.
  3. Generate predictions.

The resulting model can be used for inference to gain detailed insights into parameter values as well as to predict outcomes for new data points.

Logistic regression with PyMC3

Logistic regression estimates a linear relationship between a set of features and a binary outcome, mediated by a sigmoid function to ensure the model produces probabilities. The frequentist approach resulted in point estimates for the parameters that measure the influence of each feature on the probability that a data point belongs to the positive class, with confidence intervals based on assumptions about the parameter distribution.

In contrast, Bayesian logistic regression estimates the posterior distribution over the parameters itself. The posterior allows for more robust estimates of what is called a Bayesian credible interval for each parameter with the benefit of more transparency about the model’s uncertainty.

Imports

In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
%matplotlib inline
from pathlib import Path
import pickle
from collections import OrderedDict
import pandas as pd
import numpy as np
from scipy import stats

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import (roc_curve, roc_auc_score, confusion_matrix, accuracy_score, f1_score, 
                             precision_recall_curve) 
from mlxtend.plotting import plot_confusion_matrix

import theano
import pymc3 as pm
from pymc3.variational.callbacks import CheckParametersConvergence
import statsmodels.formula.api as smf

import arviz as az
import matplotlib.pyplot as plt
import matplotlib.cm as cm

import seaborn as sns
from IPython.display import HTML
In [3]:
data_path = Path('data')
fig_path = Path('figures')
model_path = Path('models')
for p in [data_path, fig_path, model_path]:
    if not p.exists():
        p.mkdir()

The Data

The goal of this dataset is to create a binary classification model that predicts whether or not over $11,000$ customers will subscribe to a term deposit after a marketing campaign the bank has performed, based on $17$ predictor variables. The target variable is given as deposit and takes on a value of $1$ if the customer has subscribed and $0$ otherwise.

This is a slightly imbalanced class problem because there are more customers that did not subscribe the term deposit than did.

Load from Disk

In [4]:
data = pd.read_csv('bank.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11162 entries, 0 to 11161
Data columns (total 17 columns):
age          11162 non-null int64
job          11162 non-null object
marital      11162 non-null object
education    11162 non-null object
default      11162 non-null object
balance      11162 non-null int64
housing      11162 non-null object
loan         11162 non-null object
contact      11162 non-null object
day          11162 non-null int64
month        11162 non-null object
duration     11162 non-null int64
campaign     11162 non-null int64
pdays        11162 non-null int64
previous     11162 non-null int64
poutcome     11162 non-null object
deposit      11162 non-null object
dtypes: int64(7), object(10)
memory usage: 1.4+ MB
In [5]:
data.head()
Out[5]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit
0 59 admin. married secondary no 2343 yes no unknown 5 may 1042 1 -1 0 unknown yes
1 56 admin. married secondary no 45 no no unknown 5 may 1467 1 -1 0 unknown yes
2 41 technician married secondary no 1270 yes no unknown 5 may 1389 1 -1 0 unknown yes
3 55 services married secondary no 2476 yes no unknown 5 may 579 1 -1 0 unknown yes
4 54 admin. married tertiary no 184 no no unknown 5 may 673 2 -1 0 unknown yes

Data Preprocessing

In [6]:
categorical_features = data.select_dtypes(include=[np.object])

numeric_features = data.select_dtypes(include=[np.number])

cols = numeric_features.columns
numeric_features.loc[:, cols] = preprocessing.scale(numeric_features.loc[:, cols])

le = preprocessing.LabelEncoder()

categorical_features = data[categorical_features.columns].apply(lambda x: le.fit_transform(x))

data = pd.concat([categorical_features, numeric_features], axis=1)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11162 entries, 0 to 11161
Data columns (total 17 columns):
job          11162 non-null int64
marital      11162 non-null int64
education    11162 non-null int64
default      11162 non-null int64
housing      11162 non-null int64
loan         11162 non-null int64
contact      11162 non-null int64
month        11162 non-null int64
poutcome     11162 non-null int64
deposit      11162 non-null int64
age          11162 non-null float64
balance      11162 non-null float64
day          11162 non-null float64
duration     11162 non-null float64
campaign     11162 non-null float64
pdays        11162 non-null float64
previous     11162 non-null float64
dtypes: float64(7), int64(10)
memory usage: 1.4 MB
In [7]:
data.head()
Out[7]:
job marital education default housing loan contact month poutcome deposit age balance day duration campaign pdays previous
0 0 1 1 0 1 0 2 8 3 1 1.491505 0.252525 -1.265746 1.930226 -0.554168 -0.481184 -0.36326
1 0 1 1 0 0 0 2 8 3 1 1.239676 -0.459974 -1.265746 3.154612 -0.554168 -0.481184 -0.36326
2 9 1 1 0 1 0 2 8 3 1 -0.019470 -0.080160 -1.265746 2.929901 -0.554168 -0.481184 -0.36326
3 7 1 1 0 1 0 2 8 3 1 1.155733 0.293762 -1.265746 0.596366 -0.554168 -0.481184 -0.36326
4 0 1 2 0 0 0 2 8 3 1 1.071790 -0.416876 -1.265746 0.867171 -0.186785 -0.481184 -0.36326

Quick EDA

In [8]:
total = len(data)
plt.figure(figsize=(7,5))
g = sns.countplot(x='deposit', data=data)
g.set_ylabel('Count', fontsize=14)
for p in g.patches:
    height = p.get_height()
    g.text(p.get_x()+p.get_width()/2.,
            height + 1.5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=14, fontweight='bold')
plt.margins(y=0.1)
plt.show()
In [9]:
plt.figure(figsize=(13, 13))
corr = data.corr() 
mask = np.tri(*corr.shape).T 
sns.heatmap(corr.abs(), mask=mask, annot=True)
b, t = plt.ylim() 
b += 0.5 
t -= 0.5 
plt.ylim(b, t) 
plt.show()
In [10]:
n_fts = len(data.columns)
colors = cm.rainbow(np.linspace(0, 1, n_fts))

data.drop('deposit',axis=1).corrwith(data.deposit).sort_values(ascending=True).plot(kind='barh', 
                                                                                     color=colors, figsize=(12, 6))
plt.title('Correlation to Target (deposit)')
plt.show()

print('\n',data.drop('deposit',axis=1).corrwith(data.deposit).sort_values(ascending=False))
 duration     0.451919
pdays        0.151593
previous     0.139867
education    0.095948
balance      0.081129
marital      0.067610
job          0.063395
age          0.034901
month       -0.037121
default     -0.040680
day         -0.056326
loan        -0.110580
poutcome    -0.122369
campaign    -0.128081
housing     -0.203888
contact     -0.249847
dtype: float64
In [11]:
data['housing'] = -data['housing']
data['contact'] = -data['contact']
data['campaign'] = -data['campaign']
data['poutcome'] = -data['poutcome']
data['loan'] = -data['loan']

corr = data.corr() 

cor_target = corr["deposit"]

relevant_features = cor_target[cor_target > 0.08]
relevant_features.sort_values(ascending=False)
Out[11]:
deposit      1.000000
duration     0.451919
contact      0.249847
housing      0.203888
pdays        0.151593
previous     0.139867
campaign     0.128081
poutcome     0.122369
loan         0.110580
education    0.095948
balance      0.081129
Name: deposit, dtype: float64
In [12]:
sns.pairplot(data[relevant_features.index]);

Models

In [13]:
simple_model = 'deposit ~ duration + pdays + previous'
full_model = 'deposit ~ duration + pdays + previous + education + balance +  loan + poutcome + campaign +  housing + contact'

MAP Inference

A probabilistic program consists of observed and unobserved random variables (RVs). We define the observed RVs via likelihood distributions and unobserved RVs via prior distributions. PyMC3 includes numerous probability distributions for this purpose.

The PyMC3 library makes it very straightforward to perform approximate Bayesian inference for logistic regression. Logistic regression models the probability that individual $i$ subscribes to a deposit based on $k$ features.

Logistic Regression:

$$ p(y_i = 1 \mid \beta) = \sigma \beta_0 + \beta_1 x_{i1} + \dots + \beta_k x_{ik}) $$

Where $\sigma$ is the logistic function: $$ \sigma(t) = \frac{1}{1 + e^{-1}} $$

Manual Model Specification

We will use the context manager with to define a manual_logistic_model that we can refer to later as a probabilistic model:

  1. The random variables for the unobserved parameters for intercept and $3$ features are expressed using uninformative priors that assume normal distributions with mean $0$ and standard deviation of $100$.
  1. The likelihood combines the parameters with the data according to the specification of the logistic regression.
  1. The outcome is modeled as a Bernoulli RV with success probability given by the likelihood.
In [14]:
with pm.Model() as manual_logistic_model:
    # random variables for coefficients with
    # uninformative priors for each parameter

    intercept = pm.Normal('intercept', 0, sd=100)
    beta_1 = pm.Normal('beta_1', 0, sd=100)
    beta_2 = pm.Normal('beta_2', 0, sd=100)
    beta_3 = pm.Normal('beta_3', 0, sd=100)

    # Transform random variables into vector of probabilities p(y_i=1)
    # according to logistic regression model specification.
    likelihood = pm.invlogit(intercept + beta_1 * data.duration + beta_2 * data.pdays + beta_3 * data.previous)

    # Bernoulli random vector with probability of success
    # given by sigmoid function and actual data as observed
    pm.Bernoulli(name='logit', p=likelihood, observed=data.deposit)
In [15]:
manual_logistic_model.model
Out[15]:
$$ \begin{array}{rcl} \text{intercept} &\sim & \text{Normal}(\mathit{mu}=0.0,~\mathit{sigma}=100.0)\\\text{beta_1} &\sim & \text{Normal}(\mathit{mu}=0.0,~\mathit{sigma}=100.0)\\\text{beta_2} &\sim & \text{Normal}(\mathit{mu}=0.0,~\mathit{sigma}=100.0)\\\text{beta_3} &\sim & \text{Normal}(\mathit{mu}=0.0,~\mathit{sigma}=100.0)\\\text{logit} &\sim & \text{Bernoulli}(\mathit{p}=f(f(f(),~f(f(),~f(f(f(f(f(f(\text{intercept}),~f(f(\text{beta_1}),~array)),~f(f(\text{beta_2}),~array)),~f(f(\text{beta_3}),~array)))))),~f())) \end{array} $$

The command pm.model_to_graphviz(manual_logistic_model) produces the plate notation displayed below.

It shows the unobserved parameters as light and the observed elements as dark circles. The rectangle indicates the number of repetitions of the observed model element implied by the data included in the model definition.

In [16]:
pm.model_to_graphviz(manual_logistic_model)
Out[16]:
%3 cluster11,162 11,162 beta_3 beta_3 ~ Normal logit logit ~ Bernoulli beta_3->logit beta_1 beta_1 ~ Normal beta_1->logit beta_2 beta_2 ~ Normal beta_2->logit intercept intercept ~ Normal intercept->logit

MAP Estimate

Maximum a posteriori probability (MAP) estimation leverages that the evidence is a constant factor that scales the posterior to meet the requirements for a probability distribution. Since the evidence does not depend on $θ$, the posterior distribution is proportional to the product of the likelihood and the prior. Hence, MAP estimation chooses the value of $θ$ that maximizes the posterior given the observed data and the prior belief, that is, the mode of the posterior.

$$ \theta_{MAP} = \operatorname*{arg\,max}_\theta P(X \mid \theta) P(\theta) $$

We obtain point MAP estimates for the three parameters using the just defined model’s .find_MAP() method:

In [17]:
with manual_logistic_model:
    # compute maximum a-posteriori estimate
    # for logistic regression weights
    manual_map_estimate = pm.find_MAP()
logp = -7,759, ||grad|| = 2,784: 100%|██████████| 11/11 [00:00<00:00, 450.08it/s]
In [18]:
def print_map(result):
    return pd.Series({k: np.asscalar(v) for k, v in result.items()})
In [19]:
print_map(manual_map_estimate)
Out[19]:
intercept    0.051369
beta_1       1.561395
beta_2       0.260651
beta_3       0.306804
dtype: float64

GLM Model

PyMC3 includes numerous common models so that we can usually leave the manual specification for custom applications.

The following code defines the same logistic regression as a member of the Generalized Linear Models (GLM) family using the formula format inspired by the statistical language R and ported to python by the patsy library:

In [20]:
with pm.Model() as logistic_model:
    pm.glm.GLM.from_formula(simple_model,
                            data,
                            family=pm.glm.families.Binomial())
In [21]:
pm.model_to_graphviz(logistic_model)
Out[21]:
%3 cluster11,162 11,162 pdays pdays ~ Normal y y ~ Binomial pdays->y duration duration ~ Normal duration->y Intercept Intercept ~ Flat Intercept->y previous previous ~ Normal previous->y

PyMC3 solves the optimization problem of finding the posterior point with the highest density using the quasi-Newton Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm but offers several alternatives provided by the scipy library. The result is virtually identically to the corresponding statsmodels estimate:

In [22]:
model = smf.logit(formula=simple_model, data=data[['deposit', 'duration', 'pdays', 'previous']])
result = model.fit()
print(result.summary())
Optimization terminated successfully.
         Current function value: 0.538611
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                deposit   No. Observations:                11162
Model:                          Logit   Df Residuals:                    11158
Method:                           MLE   Df Model:                            3
Date:                Thu, 07 May 2020   Pseudo R-squ.:                  0.2214
Time:                        17:55:19   Log-Likelihood:                -6012.0
converged:                       True   LL-Null:                       -7721.6
Covariance Type:            nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0514      0.023      2.192      0.028       0.005       0.097
duration       1.5614      0.037     42.443      0.000       1.489       1.633
pdays          0.2607      0.027      9.816      0.000       0.209       0.313
previous       0.3068      0.032      9.585      0.000       0.244       0.370
==============================================================================
In [23]:
print_map(manual_map_estimate)
Out[23]:
intercept    0.051369
beta_1       1.561395
beta_2       0.260651
beta_3       0.306804
dtype: float64
In [24]:
result.params 
Out[24]:
Intercept    0.051370
duration     1.561394
pdays        0.260652
previous     0.306804
dtype: float64

Markov Chain Monte Carlo

Markov chains are stochastic models that describe sequences of possible events. Each event comes from a set of outcomes, and each outcome determines which outcome occurs next, according to a fixed set of probabilities. An important feature of Markov chains is that they are memoryless: everything that you would possibly need to predict the next event is available in the current state, and no new information comes from knowing historical events.

Monte Carlo methods rely on repeated random sampling to approximate results that may be deterministic, but that does not permit an analytic, exact solution.

Many algorithms apply the Monte Carlo method to a Markov Chain, and generally proceed as follows:

  1. Start at the current position.
  2. Draw a new position from a proposal distribution.
  3. Evaluate the probability of the new position in light of data and prior distributions:
    1. If sufficiently likely, move to the new position
    2. Otherwise, remain at the current position
  4. Repeat from step 1.
  5. After a given number of iterations, return all accepted positions.

Define the Model

We will use a slightly more complicated model to illustrate Markov chain Monte Carlo inference:

  • full_model = 'deposit ~ duration + pdays + previous + education + balance + loan + poutcome + campaign + housing + contact'
In [25]:
with pm.Model() as logistic_model:
    pm.glm.GLM.from_formula(formula=full_model,
                            data=data,
                            family=pm.glm.families.Binomial())
In [26]:
logistic_model.basic_RVs
Out[26]:
[Intercept,
 duration,
 pdays,
 previous,
 education,
 balance,
 loan,
 poutcome,
 campaign,
 housing,
 contact,
 y]

Hamiltonian Monte Carlo – going NUTS

By default, PyMC3 automatically selects the most efficient sampler and initializes the sampling process for efficient convergence. For a continuous model, PyMC3 chooses the NUTS sampler. It also runs variational inference via ADVI to find good starting parameters for the sampler. One among several alternatives is to use the MAP estimate.

Hamiltonian Monte Carlo (HMC) is a hybrid method that leverages the first-order derivative information of the gradient of the likelihood to propose new states for exploration and overcome some of the challenges of MCMC. In addition, it incorporates momentum to efficiently jump around the posterior. As a result, it converges faster to a high-dimensional target distribution than simpler random-walk Metropolis or Gibbs sampling.

To see what the convergence looks like, we first draw $1,000$ samples after tuning the sampler for $1,000$ iterations that will be discarded. The sampling process can be parallelized for multiple chains using the cores argument (except when using GPU).

In [27]:
with logistic_model:
    trace = pm.sample(tune=1000,
                         draws=1000,
                         chains=4,
                         init = 'adapt_diag',
                         cores=3)
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (4 chains in 3 jobs)
NUTS: [contact, housing, campaign, poutcome, loan, balance, education, previous, pdays, duration, Intercept]
Sampling 4 chains, 0 divergences: 100%|██████████| 8000/8000 [01:23<00:00, 95.79draws/s] 

Inspect Trace

In [28]:
pm.plot_trace(trace);
In [29]:
pm.trace_to_dataframe(trace).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 11 columns):
Intercept    4000 non-null float64
duration     4000 non-null float64
pdays        4000 non-null float64
previous     4000 non-null float64
education    4000 non-null float64
balance      4000 non-null float64
loan         4000 non-null float64
poutcome     4000 non-null float64
campaign     4000 non-null float64
housing      4000 non-null float64
contact      4000 non-null float64
dtypes: float64(11)
memory usage: 343.9 KB
In [30]:
with open(model_path / 'logistic_model_mh.pkl', 'wb') as buff:
    pickle.dump({'model': logistic_model, 'trace': trace}, buff)

PyMC3 produces various summary statistics for a sampler. These are available as individual functions in the stats module, or by providing a trace to the pm.summary() function:

In [31]:
pm.summary(trace)
Out[31]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
Intercept 0.156 0.116 -0.056 0.376 0.003 0.002 1891.0 1891.0 1906.0 2212.0 1.0
duration 1.748 0.040 1.673 1.822 0.001 0.000 4568.0 4568.0 4592.0 3454.0 1.0
pdays 0.354 0.041 0.282 0.439 0.001 0.001 2357.0 2330.0 2361.0 2793.0 1.0
previous 0.285 0.034 0.223 0.351 0.001 0.000 4348.0 4348.0 4328.0 3031.0 1.0
education 0.219 0.033 0.152 0.276 0.001 0.000 3818.0 3818.0 3825.0 2841.0 1.0
balance 0.122 0.027 0.073 0.172 0.000 0.000 5275.0 4906.0 5312.0 3038.0 1.0
loan 0.766 0.079 0.625 0.916 0.001 0.001 4739.0 4684.0 4741.0 3131.0 1.0
poutcome -0.172 0.042 -0.248 -0.092 0.001 0.001 1909.0 1880.0 1914.0 2212.0 1.0
campaign 0.395 0.035 0.327 0.461 0.001 0.000 4779.0 4768.0 4766.0 2748.0 1.0
housing 1.066 0.051 0.970 1.163 0.001 0.001 4692.0 4681.0 4693.0 3010.0 1.0
contact 0.637 0.036 0.569 0.702 0.000 0.000 5148.0 5148.0 5145.0 3171.0 1.0
In [32]:
draws = 100

trace_df = pm.trace_to_dataframe(trace).assign(
    chain=lambda x: x.index // draws)

trace_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 12 columns):
Intercept    4000 non-null float64
duration     4000 non-null float64
pdays        4000 non-null float64
previous     4000 non-null float64
education    4000 non-null float64
balance      4000 non-null float64
loan         4000 non-null float64
poutcome     4000 non-null float64
campaign     4000 non-null float64
housing      4000 non-null float64
contact      4000 non-null float64
chain        4000 non-null int64
dtypes: float64(11), int64(1)
memory usage: 375.1 KB

Persist Results

In [33]:
with open(model_path / 'logistic_model_nuts.pkl', 'wb') as buff:
    pickle.dump({'model': logistic_model,
                 'trace': trace}, buff)
In [34]:
with open(model_path / 'logistic_model_nuts.pkl', 'rb') as buff:
    data0 = pickle.load(buff)  

logistic_model, trace_NUTS = data0['model'], data0['trace']

Combine Traces

In [35]:
draws = 10000

df = pm.trace_to_dataframe(trace_NUTS).iloc[200:].reset_index(
    drop=True).assign(chain=lambda x: x.index // draws)

trace_df = pd.concat([trace_df.assign(samples=100),
                      df.assign(samples=10000)])
trace_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7800 entries, 0 to 3799
Data columns (total 13 columns):
Intercept    7800 non-null float64
duration     7800 non-null float64
pdays        7800 non-null float64
previous     7800 non-null float64
education    7800 non-null float64
balance      7800 non-null float64
loan         7800 non-null float64
poutcome     7800 non-null float64
campaign     7800 non-null float64
housing      7800 non-null float64
contact      7800 non-null float64
chain        7800 non-null int64
samples      7800 non-null int64
dtypes: float64(11), int64(2)
memory usage: 853.1 KB

Visualize both traces

In [36]:
trace_df_long = pd.melt(trace_df, id_vars=['samples', 'chain'])
trace_df_long.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85800 entries, 0 to 85799
Data columns (total 4 columns):
samples     85800 non-null int64
chain       85800 non-null int64
variable    85800 non-null object
value       85800 non-null float64
dtypes: float64(1), int64(2), object(1)
memory usage: 2.6+ MB

Convergence

We can visualize the samples over time and their distributions to check the quality of the results. The following charts show the posterior distributions after an initial $100$ and an additional $10,000$ samples, respectively, and illustrate how convergence implies that multiple chains identify the same distribution. The pm.trace_plot() function shows the evolution of the samples as well.

In [37]:
g = sns.FacetGrid(trace_df_long, col='variable', row='samples',
                  hue='chain', sharex='col', sharey=False)
g = g.map(sns.distplot, 'value', hist=False, rug=False)

Estimate Odds Ratio

An odds ratio (OR) is a measure of association between a certain property $A$ and a second property $B$ in a population. Specifically, it tells you how the presence or absence of property $A$ has an effect on the presence or absence of property $B$. The OR is also used to figure out if a particular exposure (like eating processed meat) is a risk factor for a particular outcome (such as colon cancer), and to compare the various risk factors for that outcome. As long as you have two properties you think are linked, you can calculate the odds.

You have two choices for the formula: $$(a \div c) \div (b \div d)$$

or, equivalently: $$(a \times d) \div (b \times c)$$

  • $OR > 1$ indicates increased occurrence of event
  • $OR < 1$ indicates decreased occurrence of event (protective exposure)
In [38]:
b = trace['duration']
lb, ub = np.percentile(b, 2.5), np.percentile(b, 97.5)
lb, ub = np.exp(lb), np.exp(ub)
print(f'P({lb:.3f} < Odds Ratio < {ub:.3f}) = 0.95')
P(5.323 < Odds Ratio < 6.214) = 0.95

We can interpret something along those lines: "With probability $0.95$ the odds ratio is greater than $5.310$ and less than $6.233$, so the duration effect takes place because a person with a longer duration has at least $5.310$ higher probability to subscribe to a term deposit than a person with a lower duration, while holding all the other variables constant.

Computing Credible Intervals

We can compute the credible intervals, the Bayesian counterpart of confidence intervals, as percentiles of the trace. The resulting boundaries reflect our confidence about the range of the parameter value for a given probability threshold, as opposed to the number of times the parameter will be within this range for a large number of trials.

In [39]:
fig, ax = plt.subplots(figsize=(8, 4))
sns.distplot(np.exp(b), axlabel='Odds Ratio', ax=ax)
ax.set_title(f'Credible Interval: P({lb:.3f} < Odds Ratio < {ub:.3f}) = 0.95')
ax.axvspan(lb, ub, alpha=0.5, color='gray');

Variational Inference

Variational Inference (VI) is a machine learning method that approximates probability densities through optimization. In the Bayesian context, it approximates the posterior distribution as follows:

  1. Select a parametrized family of probability distributions
  2. Find the member of this family closest to the target, as measured by Kullback-Leibler divergence

Compared to MCMC, Variational Bayes tends to converge faster and scales to large data better. While MCMC approximates the posterior with samples from the chain that will eventually converge arbitrarily close to the target, variational algorithms approximate the posterior with the result of the optimization, which is not guaranteed to coincide with the target.

Variational Inference is better suited for large datasets and to quickly explore many models.In contrast, MCMC will deliver more accurate results on smaller datasets or when time and computational resources pose fewer constraints.

Run Automatic Differentation Variational Inference (ADVI)

The interface for variational inference is very similar to the MCMC implementation. We just use the fit() instead of the sample() function, with the option to include an early stopping CheckParametersConvergence callback if the distribution-fitting process converged up to a given tolerance:

In [40]:
with logistic_model:
    callback = CheckParametersConvergence(diff='absolute')
    approx = pm.fit(n=100000, callbacks=[callback])
Average Loss = 5,365.7: 100%|██████████| 100000/100000 [02:08<00:00, 777.98it/s]
Finished [100%]: Average Loss = 5,365.7

Persist Result

In [41]:
with open(model_path / 'logistic_model_advi.pkl', 'wb') as buff:
    pickle.dump({'model': logistic_model,
                 'approx': approx}, buff)

Sample from approximated distribution

We can draw samples from the approximated distribution to obtain a trace object as above for the MCMC sampler:

In [42]:
trace_advi = approx.sample(2000)
In [43]:
pm.summary(trace_advi)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 2000), minimum_shape: (chains=2, draws=4)
Out[43]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
Intercept 0.150 0.027 0.097 0.200 0.001 0.000 2043.0 2033.0 2044.0 2003.0 NaN
duration 1.749 0.040 1.674 1.823 0.001 0.001 2059.0 2058.0 2062.0 1956.0 NaN
pdays 0.354 0.027 0.306 0.407 0.001 0.000 2036.0 2026.0 2038.0 1921.0 NaN
previous 0.287 0.029 0.235 0.344 0.001 0.000 2047.0 2041.0 2046.0 1829.0 NaN
education 0.218 0.018 0.183 0.250 0.000 0.000 2064.0 2064.0 2066.0 1820.0 NaN
balance 0.124 0.028 0.073 0.177 0.001 0.000 2267.0 2267.0 2267.0 2004.0 NaN
loan 0.763 0.081 0.616 0.915 0.002 0.001 1952.0 1942.0 1949.0 1821.0 NaN
poutcome -0.172 0.010 -0.191 -0.153 0.000 0.000 1968.0 1968.0 1962.0 1959.0 NaN
campaign 0.402 0.038 0.333 0.473 0.001 0.001 1871.0 1839.0 1853.0 1926.0 NaN
housing 1.070 0.041 0.988 1.144 0.001 0.001 1959.0 1949.0 1971.0 2004.0 NaN
contact 0.639 0.034 0.577 0.703 0.001 0.001 1989.0 1984.0 1984.0 1845.0 NaN
In [44]:
# pm.summary(trace_advi).to_csv(model_path / 'trace_advi.csv')

Let us visualize the covariance structure of the model

In [45]:
az.plot_pair(trace_NUTS, figsize=(10, 10));

Let us visualize the covariance structure of the ADVI model

In [46]:
az.plot_pair(trace_advi, figsize=(10, 10));

Clearly, ADVI does not capture (as expected) the interactions between variables because of the mean field approximation, and so it underestimates the overall variance.

Model Diagnostics

Bayesian model diagnostics includes validating that the sampling process has converged and consistently samples from high-probability areas of the posterior, and confirming that the model represents the data well.

For high-dimensional models with many variables, it becomes cumbersome to inspect numerous traces. When using NUTS, the energy plot helps to assess problems of convergence. It summarizes how efficiently the random process explores the posterior. The plot shows the energy and the energy transition matrix that should be well matched as in the below example.

Energy Plot

When using NUTS, the energy plot helps to assess problems of convergence. It summarizes how efficiently the random process explores the posterior. The plot shows the energy and the energy transition matrix, which should be well-matched.

In [47]:
pm.energyplot(trace_NUTS);

Forest Plot

A forest plot, also known as a blobbogram, is a graphical display of estimated results from a number of scientific studies addressing the same question, along with the overall results. It was developed for use in medical research as a means of graphically representing a meta-analysis of the results of randomized controlled trials.

In [48]:
az.plot_forest([trace_advi, trace_NUTS]);

Posterior Plot

In [49]:
pm.plot_posterior(trace_NUTS);

Posterior Predictive Checks

PPCs are very useful for examining how well a model fits the data. They do so by generating data from the model using parameters from draws from the posterior. We use the function pm.sample_ppc for this purpose and obtain $n$ samples for each observation (the GLM module automatically names the outcome $y$):

In [50]:
ppc = pm.sample_ppc(trace_NUTS, samples=500, model=logistic_model)
100%|██████████| 500/500 [00:00<00:00, 561.36it/s]
In [51]:
ppc['y'].shape
Out[51]:
(500, 11162)

Check AUC Score

In [52]:
y_score = np.mean(ppc['y'], axis=0)
In [53]:
pred_scores = dict(y_true=data.deposit,y_score=y_score)
roc_auc_score(**pred_scores)
Out[53]:
0.8665637154908409

Prediction

Follows PyMC3 docs

Predictions use theano’s shared variables to replace the training data with test data before running posterior predictive checks. To facilitate visualization, we create the train and test datasets, and convert the former to a shared variable. Note that we need to use numpy arrays and provide a list of column labels:

Train-test split

In [54]:
X = data.drop('deposit', axis=1)
y = data.deposit
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)
labels = X_train.columns

Create shared theano variable

In [55]:
X_shared = theano.shared(X_train.values)

Define logistic model

In [56]:
with pm.Model() as logistic_model_pred:
    pm.glm.GLM(x=X_shared, labels=labels,
               y=y_train, family=pm.glm.families.Binomial())

Run NUTS sampler

In [57]:
with logistic_model_pred:
    pred_trace = pm.sample(draws=1000, 
                           tune=1000,
                           chains=2,
                           cores=2,
                           init='adapt_diag')
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [previous, pdays, campaign, duration, day, balance, age, poutcome, month, contact, loan, housing, default, education, marital, job, Intercept]
Sampling 2 chains, 1 divergences: 100%|██████████| 4000/4000 [01:07<00:00, 59.48draws/s]

Replace shared variable with test set

We then run the sampler as before, and apply the pm.sample_ppc function to the resulting trace after replacing the train with test data:

In [58]:
X_shared.set_value(X_test)
In [59]:
ppc = pm.sample_ppc(pred_trace,
                    model=logistic_model_pred,
                    samples=100)
100%|██████████| 100/100 [00:00<00:00, 365.49it/s]

AUC Score

In [60]:
y_score = np.mean(ppc['y'], axis=0)
roc_auc_score(y_score=np.mean(ppc['y'], axis=0), 
              y_true=y_test)
Out[60]:
0.8632915421477958
In [61]:
pred_scores = dict(y_true=y_test, y_score=y_score)
cols = ['False Positive Rate', 'True Positive Rate', 'threshold']
roc = pd.DataFrame(dict(zip(cols, roc_curve(**pred_scores))))

precision_recall_curve

In [62]:
precision, recall, ts = precision_recall_curve(y_true=y_test, probas_pred=y_score)
pr_curve = pd.DataFrame({'Precision': precision, 'Recall': recall})

f1_score

In [63]:
f1 = pd.Series({t: f1_score(y_true=y_test, y_pred=y_score>t) for t in ts})
best_threshold = f1.idxmax()

Plot results

In [64]:
fig, axes = plt.subplots(ncols=3, figsize=(13, 5))

ax = sns.scatterplot(x='False Positive Rate', y='True Positive Rate', data=roc, s=50, legend=False, ax=axes[0])
axes[0].plot('False Positive Rate', 'True Positive Rate', data=roc, lw=1, color='k')
axes[0].plot(np.linspace(0,1,100), np.linspace(0,1,100), color='k', ls='--', lw=1)
axes[0].fill_between(y1=roc['True Positive Rate'], x=roc['False Positive Rate'], alpha=.3, color='red')
axes[0].set_title('Receiver Operating Characteristic')


sns.scatterplot(x='Recall', y='Precision', data=pr_curve, ax=axes[1])
axes[1].set_ylim(0,1)
axes[1].set_title('Precision-Recall Curve')


f1.plot(ax=axes[2], title='F1 Scores', ylim=(0,1))
axes[2].set_xlabel('Threshold')
axes[2].axvline(best_threshold, lw=1, ls='--', color='k')
axes[2].text(text=f'Max F1 @ {best_threshold:.2f}', x=.60, y=.95, s=5)
fig.suptitle(f'roc_auc_score = {round(roc_auc_score(**pred_scores),2)}', fontsize=24)
fig.tight_layout()
plt.subplots_adjust(top=.8)
plt.show();
In [ ]: