Non-negative matrix factorization

  • NMF = "non-negative matrix factorization"
  • Dimension reduction technique
  • NMF models are interpretable (unlike PCA)
  • Easy to interpret and easy to explain!
  • However, all sample features must be non-negative (>= 0)

Alt text that describes the graphic

Alt text that describes the graphic

Using scikit-learn NMF

  • Follows fit() / transform() pa!ern
  • Must specify number of components e.g. NMF(n_components=2)
  • Works with NumPy arrays and with csr_matrix

NMF components

  • NMF has components
  • ... just like PCA has principal components
  • Dimension of components = dimension of samples
  • Entries are non-negative
In [1]:
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import datasets

iris = datasets.load_iris()
samples = iris.data

model = NMF(n_components=2)

model.fit(samples)

nmf_features = model.transform(samples)

print(model.components_)
[[5.75540836 2.3158674  5.26835446 1.88979127]
 [3.38412142 2.46082384 0.63366195 0.        ]]

NMF features

  • NMF feature values are non-negative
  • Can be used to reconstruct the samples
  • ... combine feature values with components
In [2]:
print(nmf_features[:5])
[[0.10703803 1.32370915]
 [0.1368093  1.16981601]
 [0.10323804 1.20941715]
 [0.14316969 1.12028074]
 [0.09897292 1.33139455]]

Sample reconstruction

  • Multiply components by feature values, and add up
  • Can also be expressed as a product of matrices
  • This is the "Matrix Factorization" in "NMF"

NMF fits to non-negative data, only

  • Word frequencies in each document
  • Images encoded as arrays
  • Audio spectrograms
  • Purchase histories on e-commerce sites
  • ... and many more!
In [3]:
print(samples[0, :])

#multiplying components by feature values and add up (dot product)
print(nmf_features[0, :].dot(model.components_))
[5.1 3.5 1.4 0.2]
[5.09564007 3.50530092 1.40269842 0.20227954]

NMF applied to Wikipedia articles

We will use the tf-idf word-frequency array of Wikipedia articles, given as a csr matrix articles. Here, we fit the model and transform the articles.

Data set available here

In [4]:
import pandas as pd 

df = pd.read_csv("data/wikipedia_articles_to_cluster.csv",  encoding = "ISO-8859-1")
titles = df.title

df.head()
Out[4]:
title article_text
0 HTTP 404 HTTP 404 The 404 or Not Found error message ...
1 Alexa Internet Alexa Internet Alexa Internet, Inc. is a Cal...
2 Internet Explorer Internet Explorer 10 Internet Explorer 10 (I...
3 HTTP cookie HTTP cookie A cookie, also known as an HTTP ...
4 Google Search Google Search Appliance The Google Search Ap...
In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer() 

# Apply fit_transform to document: csr_mat
articles = tfidf.fit_transform(df.article_text.values)
In [6]:
# Create an NMF instance: model
model = NMF(n_components=6)

# Fit the model to articles
model.fit(articles)

# Transform the articles: nmf_features
nmf_features = model.transform(articles)

# Print the NMF features
print(nmf_features[:5])
[[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 5.31732919e-01]
 [6.58310175e-03 1.11303440e-02 0.00000000e+00 0.00000000e+00
  6.15221814e-03 4.52432486e-01]
 [1.14575589e-02 5.57988908e-03 1.33252344e-02 6.65732948e-03
  7.14760081e-02 3.82118650e-01]
 [2.99281640e-02 2.54408933e-04 3.46480062e-02 4.20322794e-02
  2.84557923e-02 4.62687902e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 4.58904832e-01]]

NMF features of the Wikipedia articles

In [7]:
# Create a pandas DataFrame: df
df = pd.DataFrame(nmf_features, index=titles)

# Print the row for 'Anne Hathaway'
print(df.loc['Anne Hathaway'])

# Print the row for 'Denzel Washington'
print(df.loc['Denzel Washington'])
0    0.009781
1    0.463169
2    0.008657
3    0.023566
4    0.022404
5    0.021303
Name: Anne Hathaway, dtype: float64
0    0.167866
1    0.200954
2    0.007136
3    0.053830
4    0.153784
5    0.001440
Name: Denzel Washington, dtype: float64

When investigating the features, notice that for both actors, the NMF feature 1 has by far the highest value. This means that both articles are reconstructed using mainly the 1st NMF component.

In [8]:
# Create a DataFrame: components_df
components_df = pd.DataFrame(model.components_)

# Print the shape of the DataFrame
print(components_df.shape)

# Select row 3: component
component = components_df.iloc[3]

# Print result of nlargest
print(component.nlargest())
(6, 14187)
12772    0.555671
9000     0.431916
1010     0.331921
6592     0.320126
12917    0.260449
Name: 3, dtype: float64

NMF learns interpretable parts / Topic-modeling

Working with a grayscale images = no colors ony shades of gray measuring pixel brightness beteen 0 and 1

In [312]:
def show_as_image(vector):
    """
    Given a 1d vector representing an image, display that image in 
    black and white.  If there are negative values, then use red for 
    that pixel.
    """
    bitmap = vector.reshape((13, 8))  # make a square array
    bitmap /= np.abs(vector).max()  # normalise
    bitmap = bitmap[:,:,np.newaxis]
    rgb_layers = [np.abs(bitmap)] + [bitmap.clip(0)] * 2
    rgb_bitmap = np.concatenate(rgb_layers, axis=-1)
    plt.imshow(rgb_bitmap, cmap='gray', interpolation='nearest')
    plt.xticks([])
    plt.yticks([])
    plt.colorbar()
    plt.show()
In [313]:
digit_7 = np.array([0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
                    0.,  0.,  1.,  1.,  1.,  1.,  0.,  0.,
                    0.,  0.,  0.,  0.,  0.,  0.,  1.,  0., 
                    0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,
                    0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,
                    0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  
                    0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  
                    0.,  0.,  0.,  0.,  0.,  0.,  1.,  0., 
                    0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  
                    0.,  0.,  0.,  0.,  0.,  0.,  1.,  0., 
                    0.,  0.,  0.,  0.,  0.,  0.,  1.,  0., 
                    0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
                    0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

show_as_image(digit_7)

NMF learns the parts of images

Now we will use NMF to decompose the digits dataset.

In [316]:
samples = np.loadtxt('data/lcd-digits.csv', delimiter=',')

# Create an NMF model: model
model = NMF(n_components=7)

# Apply fit_transform to samples: features
features = model.fit_transform(samples)

# Call show_as_image on each component
for component in model.components_:
    show_as_image(component)

# Assign the 0th row of features: digit_features
digit_features = features[0,:]
In [315]:
# Print digit_features
print(digit_features)
[4.76823559e-01 0.00000000e+00 0.00000000e+00 5.90605054e-01
 4.81559442e-01 0.00000000e+00 7.37557191e-16]

PCA doesn't learn parts

Unlike NMF, PCA doesn't learn the parts of things. Its components do not correspond to topics (in the case of documents) or to parts of images, when trained on images. We verify this for by inspecting the components of a PCA model fit to the dataset.

In [317]:
# Import PCA
from sklearn.decomposition import PCA

# Create a PCA instance: model
model = PCA(n_components=7)

# Apply fit_transform to samples: features
features = model.fit_transform(samples)

# Call show_as_image on each component
for component in model.components_:
    show_as_image(component)

NMF learns interpretable parts

Alt text that describes the graphic

Alt text that describes the graphic

Alt text that describes the graphic

Finding similar articles

  • Engineer at a large online newspaper
  • Task: recommend articles similar to article being read by customers
  • Similar articles should have similar topics

Strategy

  • Apply NMF to the word-frequency array
  • NMF feature values describe the topics
  • ... so similar documents have similar NMF feature values
  • Compare NMF feature values?

Alt text that describes the graphic

  • Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

Alt text that describes the graphic

Which articles are similar to 'Cristiano Ronaldo'?

We will be finding the articles most similar to the article about the footballer Cristiano Ronaldo using NMF features and the cosine similarity to find similar articles.

Building recommender systems using NMF

  • Finding similar artciles
  • Recommend articles similar to one being read
  • Strategy : apply NMF to word-frequencey array
  • NMF features values describe the topics (similar artcile have similar NMF features values)
In [318]:
nmf = NMF(n_components=6)

#apply NMF to word frequence array
nmf_features = nmf.fit_transform(articles)

# Normalize the NMF features: norm_features
norm_features = normalize(nmf_features)

# Create a DataFrame: df
df = pd.DataFrame(norm_features, index=titles)

# Select the row corresponding to 'Cristiano Ronaldo': article
article = df.loc['Cristiano Ronaldo']

# Compute the dot products: similarities
similarities = df.dot(article)

# Display those with the largest cosine similarity
print(similarities.nlargest())
title
Cristiano Ronaldo     1.00000
Radamel Falcao        1.00000
Zlatan Ibrahimovi?    1.00000
Franck Ribéry         1.00000
Neymar                0.99918
dtype: float64
In [ ]: