Guangyuan's Research and Development Blog: DeepCTR: Hello World

In this post, we look at a "hello world" example of DeepCTR library. DeepCTR is a project that introduces classic CTR (Click Through Rate) prediction model and implements popular network designed for CTR prediction task.

What’s more, it provides a great number of experiments on open data and provide as benchmark. It also provides ready-to-go implementations of many SOTA (State of the Art) CTR prediction models in the literature, which is super convenient to use.

Import libraries required


import pandas as pd
import numpy as np
import pandas_profiling
import random
import tensorflow as tf

from sklearn.preprocessing import LabelEncoder
from deepctr.models import DeepFM
from deepctr.feature_column import SparseFeat, DenseFeat, get_feature_names

# For reproducible experiments
random.seed(200)
np.random.seed(200)
%load_ext line_profiler

Frappe dataset

We use the Frappe dataset, which is a small dataset for quick experimentation and testing of different models. It has been used for context-aware app recommendation, which contains 96,203 app usage logs of users under different contexts. The eight context variables are all categorical, including weather, city, daytime and so on.

data = pd.read_csv('datasets/Frappe/Mobile_Frappe/frappe/frappe.csv', sep='\t')
data['target'] = 1

num_users = len(data["user"].unique())
items = data["item"].unique()
num_items = len(items)
print(f'distinct users: {num_users}')
print(f'distinct items: {num_items}')
print(f'sparsity: {len(data)/(num_users*num_items)}')

sparse_features = [
    'user',
    'item',
    'daytime',
    'weekday',
    'isweekend',
    'homework',
    'cost',
    'weather',
    'country',
    'city'
]

# Shuffle
data = data.sample(frac=1,random_state=0)
# pandas_profiling.ProfileReport(data)
data.loc[23225]

Output:


distinct users: 957
distinct items: 4082
sparsity: 0.024626555814783357

Data preparation

Here we follow the benchmark setting from DeepCTR documentation page. After one-hot encoding of the features, we obtain 5,382 features. As all logs should be considered as positive sample when making CTR prediction, we construct two negative instances for each log through randomly replacing the item variable with other item. The data is randomly split into training data (70%), validation data (20%), test data (10%) before constructing negative instances.


def get_gt(user):
    """ Get ground truth of user """
    return data[data['user']==user]['item'].values

##### 70%, 20%, 10% split
num_train, num_val = int(len(data)*0.7), int(len(data)*0.2)
num_test = len(data) - num_train - num_val
print(num_train, num_val, num_test)

train_data = data.iloc[0:num_train]
val_data = data.iloc[num_train:num_train+num_val]
test_data = data.iloc[num_train+num_val:]

def neg_sampling(d, n):
    """ Get n negative samples for each pos samples """
    neg_samples = []
    
    def _get_neg_list():
        neg_list = []
        for r in d.iterrows():
            user = r[1]['user']
            num_sampled = 0
            while num_sampled != n:
                sampled_items = np.random.choice(items, size=n, replace=False)
    #             print(sampled_items)
                num_sampled = sum([x not in get_gt(user) for x in sampled_items])
            for i in sampled_items:
                neg_ex = r[1].copy()
                neg_ex['item'] = i
                neg_ex['target'] = 0
    #             d = d.append(neg_ex)
                neg_list.append(neg_ex)
    #         break
        return neg_list
    
    neg_samples += (_get_neg_list())
    neg_df = pd.concat(neg_samples, axis=1).transpose()
    d = d.append(neg_df)
    d = d.sample(frac=1,random_state=0)
    
    return d

num_neg = 2
train_data = neg_sampling(train_data, num_neg)
print('train neg sampling is finished')
val_data = neg_sampling(val_data, num_neg)
print('val neg sampling is finished')
test_data = neg_sampling(test_data, num_neg)
test_data

Encoding categorical values


for f in sparse_features:
    print(f)
    lbe = LabelEncoder()
    data[f] = lbe.fit_transform(data[f])
    train_data[f] = lbe.transform(train_data[f])
    val_data[f] = lbe.transform(val_data[f])
    test_data[f] = lbe.transform(test_data[f])

Prepare for the model input format with respect to training, validation, and testing data.


fixlen_feature_columns = [
    SparseFeat(feat, vocabulary_size=data[feat].max()+1, embedding_dim=4) \
        for feat in sparse_features 
]
feature_names = get_feature_names(fixlen_feature_columns)
print(feature_names)

train_model_input = {
    name:train_data[name].astype('float32').values \
    for name in feature_names
}
val_model_input = {
    name:val_data[name].astype('float32').values \
    for name in feature_names
}
test_model_input = {
    name:test_data[name].astype('float32').values \
    for name in feature_names
}

Using DeepFM without Early Stopping strategy

Here we use the DeepFM (Deep Factorization Machines) model for training with the prepared dataset mentioned above. We don't use any early stopping here, and trian 40 epochs.


model = DeepFM([], fixlen_feature_columns, task='binary')

model.compile('adam',
    'binary_crossentropy',
    metrics='binary_crossentropy')

# Early stopping
earlystopping_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

# Best model checkpoint
checkpoint_filepath = '/tmp/checkpoint'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_loss',
    mode='min',
    save_best_only=True)


history = model.fit(train_model_input,
    train_data['target'].astype('float32').values,
    batch_size=256,
    epochs=40,
    verbose=2,
    validation_data=(val_model_input, 
                     val_data['target'].astype('float32').values),
#     callbacks=[
#         earlystopping_callback, 
#         model_checkpoint_callback
#     ]
)

Check the performance on the test set


from sklearn.metrics import log_loss, roc_auc_score, accuracy_score

pred_ans = model.predict(test_model_input, batch_size=256)

print('log loss', log_loss(test_data['target'].astype('float32').values, pred_ans, eps=1e-7))
print('auc', roc_auc_score(test_data['target'].astype('float32').values, pred_ans))

During the 40 epochs, the log loss start decreasing and increasing again, ending up to the output below. Output:


log loss 0.2621498030364543
auc 0.9733840688051586

Using DeepFM with Early Stopping strategy


model = DeepFM([], fixlen_feature_columns, task='binary')

model.compile('adam',
    'binary_crossentropy',
    metrics='binary_crossentropy')

# Early stopping
earlystopping_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

# Best model checkpoint
checkpoint_filepath = '/tmp/checkpoint'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_loss',
    mode='min',
    save_best_only=True)


history = model.fit(train_model_input,
    train_data['target'].astype('float32').values,
    batch_size=256,
    epochs=40,
    verbose=2,
    validation_data=(val_model_input, 
                     val_data['target'].astype('float32').values),
     callbacks=[
         earlystopping_callback, 
         model_checkpoint_callback
     ]
)

This time we apply an early stopping strategy with a patience of 5 steps. That is, if there is no decrease in the 5 concecutive steps during trating, the training process will be stopped early before 40 epochs. Output:


log loss 0.19507425489108865
auc 0.9737500710457143

We can observe that our log loss and AUC have been improved with early stopping strategy compared to that without using early stopping.

DeepCTR: Hello World