In [None]:
# Introduction

#### ADVERTISING SYSTEM OVERVIEW
The overall scenario of the display advertising system is illustrated below. 
![](https://media.arxiv-vanity.com/render-output/2954884/images/omni/sys4.png)
When a user visits the e-commerce advertising system, it  

i) Checks user historical behavior data. 

ii) Generates candidate ads by matching module. 

iii) Predicts the click probability of each ad and selects appropriate ads which can attract attention (click) by ranking module. 

iv) Logs the user reactions given the displayed ads. 

This turns to be a closed-loop consumption and generation of user behavior data. 

To fetch user's interest by utilising and excavating the rich historical behavior data is very crucial for building the click-through rate (CTR) prediction model in the online advertising system in e-commerce industry. 

There are two key observations on user behavior data: 

**i) Diversity:** Users are interested in different kinds of goods when visiting e-commerce site. For example, a young mother may be interested in T-shits, leather handbag, shoes, earrings, children’s coat, etc at the same time. 

**ii) Local Activation:** Whether users click or not click an item depends only on part of their related historical behavior.For example, a swimmer will click a recommended goggle mostly due to the fact her recent purchase of bathing suit while not the books in her last week’s shopping list.

Before we dive deep into this subject let us understand some common terminologies.

**CPC(Cost-Per-Click):** In CPC advertising systems like the one in Alibaba, advertisements are ranked based on **eCPM(effective Cost Per Mille)** which is a product of bid price and **CTR( Click-Through-Rate)**.

Overall if we look at a performance of CTR prediction model it has a direct impact on the overall revenue and plays a crucial role in advertising systems.

Most traditional CTR models lack capturing the structures of behavioral data.

Deep learning methods because of its success rate are extensively used in CTR prediction models.They usually first employ embedding layer on the input, mapping original large scale sparse id features to the distributed representations, then add fully connected layers i.e. **MLP (Multi Layer Perceptrons)** to automatically learn the nonlinear relations among features.MLP reduce a lot of feature engineering jobs, which are time and effort consuming in industry applications and have become a popular model structure on CTR prediction problem. However, in the fields with rich internet-scale user behavior data, such as online advertising and recommendation system in e-commence industry, these MLPs models often lack a deeper understanding and exploiting the specific structures of behavior data, thus leaving a  space for further improvement.

 In this notebook, a new proposed model called **Deep Interest Network (DIN)** is introduced ,implemented .This model is developed and deployed in the display advertising system in Alibaba. 

**DIN** represents users’ diverse interests with an interest distribution and designs an attention-like network structure to locally activate the related interests according to the candidate ad, which is proven to be effective and significantly outperforms traditional model. Overfitting problem is easy to encounter on training such industrial deep network with large scale sparse inputs and will be handled with a new proposed adaptive regularization technique.

Inspired by the attention mechanism used in machine translation model, DIN represents users’ diverse interests with an interest distribution and designs an attention-like network structure to locally activate the related interests according to the candidate ad. Behaviors with higher relevance to the candidate ad get higher attention scores and dominate the prediction. Experiments on Alibaba’s productive CTR prediction datasets prove that the proposed DIN model significantly outperforms MLPs under the **GAUC (Group weighted AUC)** metric measurement.Let us understand the GAUC metric in detail.

Area under receiver operator curve (AUC)is a commonly used metric in CTR prediction area. In practice,a new metric named GAUC, which is the generalization of AUC is designed which is a weighted average of AUC calculated in the subset of samples group by each user. The weight can be impressions or clicks. An impression based GAUC is calculated as follows:

GAUC = Sigma(wi* AUCi)/Sigma(wi) where i = 1 to n

GAUC is practically proven to be more indicative in display advertisement settings, where CTR model is applied to rank candidate ads for each user and model performance is mainly measured by how good the ranking list is, that is, a user specific AUC. Hence, this method can remove the impact of user bias and measure more accurately the performance of the model over all users. With years of application implementation effectiveness in production systems, GAUC metric is verified to be more stable and reliable than AUC.

Overfitting problem is easy to encounter on training such industrial deep network with large scale sparse inputs. The deep network models easily fall into the overfitting trap and cause the model performance to drop rapidly which is overcome by proposing an efficient adaptive regularization technique.

Let us explore more in detail about Deep Interest Network model by looking its model architecture.

### DIN MODEL ARCHITECTURE
![](https://media.arxiv-vanity.com/render-output/2954884/images/omni/model_arch.png)

#### BASE MODEL
The base model is composed with two steps: 

i) Transfer each sparse id feature into a embedded vector space. 

ii) Apply MLPs to fit the output. 

Note that the input contains user behavior sequence ids, of which the length can be varied. Thus we add a pooling layer (e.g. sum operation) to summarize the sequence and get a fixed size vector. As illustrated in the left part of the model architecture, the base model works well practically, which now serves the main traffic of our online display advertising system.

However, going deep into the pooling operation, we will find that much information is lost, that is, it destroys the inner structure of user behavior data. This observation inspires us to design a better model.

#### DEEP INTEREST NETWORK (DIN) DESIGN
In our display advertising scenario, we wish our model to truly reveal the relationship between the candidate ad and users’ interest based on their historical behaviors.

As discussed above, behavior data contains two structures: diversity and local activation. 

The diversity of behavior data reflects users’ various interests. User click of ad often originates from just part of user’s interests. In NMT task it is assumed that the importance of each word in each decode process is different in a sentence. Attention network can be viewed as a specially designed pooling layer which learns to assign attention scores to each word in the sentence, which in other words follows the diversity structure of data.

Note :It is unsuitable and highly not recommended to directly apply the attention layer in our applications, where embedding vector of user interest varies with different candidate ads but rather should follow the local activation structure. Let’s check what will happen if the local activation structure is not followed.

Now we get the distributed representation of users(Vu) and ads(Va). 
For the same user,Vu is a fixed point in embedding space. It is the same to ad embedding Va. 

Let us assume that we use inner product to calculate the relevance between user and ad, 

F(U,A) =Vu∙Va. 

If both F(U,A) and F(U,B) are high, which means user U is relevant to both ads "A" and "B". Under this way of calculation, any point on the line between the vector of Va and Vb will get high relevance score. 

It brings a hard constraint to the learning of distributed representation vector for both user and ad. One may increase the embedding dimensionality of the vector space to satisfy the constraint, which can work perhaps, but will cause a huge increase of model parameters.

To overcome the above problem of having hugh increase in model parameters DIN is designed with two structures of data as illustrated in the right side of the above model architecture diagram. 

Mathematically, the embedding vector Vu of user U turns to be a function of the embedding vector Va of ad A, i.e.

Vu = f(Va)= Sigma(wi*Vi) where i  = 1 to N
          
Vu = sigma (g(Vi*Va)* Vi)

Where

Vi =  embedding of behavior id i, such as good_id,shop_id etc

Vu = weighted sum of all the behavior ids. 

wi = the attention score that the behavior id i contributes to the overall user interest embedding vector Vu with respect to the candidate ad A.

g = activation function = g(Vi*Va) = wi  --- In our implementation,wi is the output of activation unit (denoted by function g) with inputs of Vi and Va. PReLU is a common used activation function at the beginning.However, with large scale sparse input ids, training such industrial-scale network still faces a lot of challenge. To further improve the convergence rate and performance of our model,a novel data dependent activation function named "Dice" is used.

In all DIN designs the activation unit to follow local activation structure and weighted sum pooling to follow diversity structure. 

DIN is implemented at a multi-GPU distributed training platform named **X-Deep Learning (XDL)**, which supports model-parallelism and data-parallelism.  Due to the high performance and flexibility of XDL platform, we accelerate training process about 10 times and optimize hyparameters automatically with high tuning efficiency.
![](https://media.arxiv-vanity.com/render-output/2954884/images/omni/XDL.png)
XDL is designed to solve the challenges of training industrial scale deep learning networks with large scale sparse inputs and tens of billions of parameters. Most of the deep networks published so far are constructed with two steps namely: 

i) Employ the embedding technique to cast the original sparse input into low dimensional and dense vectors 
ii) Bridge with networks like MLPs, RNN, CNN etc. Most of the parameters are focused in the first embedding step which needs to be distributed over multi machines. The second network step can be handled within single machine. Under such circumstance, we architecture the XDL platform is architected in a bridge manner, as shown above, which is composed of three main kinds of components:

**a. Distributed Embedding Layer:** It is a model-parallelism module, parameters of embedding layer are distributed over multi-GPUs. Embedding Layer works as a predefined network unit, which provides with forward and backward modes.

**b.Local Backend:** It is a standalone module, which aims to handle the local network training. Here we reuse the open-sourced deep learning frameworks, like tensorflow. With the unified data exchange interface and abstraction, it is easy for us to integrate and switch in different kinds of frameworks.

**c.Communication Component:** It is the base module, which helps to parallel both the embedding layer and backend.

Below is a visualization of embeddings of goods in DIN model. Shape of goods represents category of goods. Color of goods corresponds to CTR prediction value.
![](https://media.arxiv-vanity.com/render-output/2954884/images/omni/TDdiagram.png)

The below illustration of locally activation property in DIN model. Behaviors of high relevance with candidate ad get high attention intensity.
![](https://media.arxiv-vanity.com/render-output/2954884/images/omni/attention2.png)

I think enough overview of the theory .Let us jump into the real time implementation of the DIN model in Ad business world by coming up with a Click prediction model using DeepCTR library . 

Lets install and move forward in implementation. I have choosen a unique public Ad Display/Click Data on **Taobao.com** available at https://tianchi.aliyun.com/dataset/dataDetail?dataId=56&userId=1

### Dataset details:

**raw_sample.csv**

We randomly sampled 1140000 users from the website of Taobao for 8 days of ad display / click logs (26 million records) to form the original sample skeleton. Field description is as follows:

(1) user: User ID(int);

(2) time_stamp: time stamp(Bigint, 1494032110 stands for 2017-05-06 08:55:10);

(3) adgroup_id: adgroup ID(int);

(4) pid: scenario;

(5) noclk: 1 for not click, 0 for click;

We used 7 days’s samples as training samples (20170506-20170512), and the last day’s samples as test samples (20170513).

**ad_feature.csv**

This data set covers the basic information of all ads in raw_sample. Field description is as follows:

(1) adgroup_id：Ad ID(int) ;

(2) cate_id：category ID;

(3) campaign_id：campaign ID;

(4) brand：brand ID;

(5) customer_id: Advertiser ID;

One of the ad ID corresponds to an item, an item belongs to a category, an item belongs to a brand.

**user_profile.csv**

This data set covers the basic information of 1060000 users in raw_sample.. Field description is as follows:

(1) userid: user ID;

(2) cms_segid: Micro group ID;

(3) cms_group_id: cms_group_id;

(4) final_gender_code: gender 1 for male , 2 for female

(5) age_level: age_level

(6) pvalue_level: Consumption grade, 1: low,  2: mid,  3: high

(7) shopping_level: Shopping depth, 1: shallow user, 2: moderate user, 3: depth user

(8) occupation: Is the college student 1: yes, 0: no?

(9) new_user_class_level: City level

In [None]:
!pip install --no-warn-conflicts -q deepctr==0.7.4

In [None]:
from sklearn.metrics import log_loss, roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Model, load_model
from deepctr.models import DIN,DeepFM,DIEN,DSIN,xDeepFM
from deepctr.inputs import SparseFeat,VarLenSparseFeat,DenseFeat,get_feature_names
from tensorflow.keras.callbacks import ModelCheckpoint, LearningRateScheduler, Callback
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder, OneHotEncoder
from tensorflow.keras.utils import get_custom_objects
from tensorflow.keras.optimizers import Adam,RMSprop
from tensorflow.keras.layers import Activation
from tensorflow.keras import backend as K
from tensorflow.keras import callbacks
from tensorflow.keras import utils
import tensorflow.keras as keras
import tensorflow as tf
import pandas as pd
import numpy as np
import tensorflow as tf
import warnings
import pandas_profiling 
from tensorflow.keras.losses import binary_crossentropy
from sklearn.metrics import log_loss, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
warnings.simplefilter('ignore')

### Load Taoboa Dataset

In [None]:
raw_sample_df = pd.read_csv('../input/ad-displayclick-data-on-taobaocom/raw_sample.csv')
ad_feature_df = pd.read_csv('../input/ad-displayclick-data-on-taobaocom/ad_feature.csv')
user_profile_df=pd.read_csv('../input/ad-displayclick-data-on-taobaocom/user_profile.csv')

### Optimise dataset:
Due to the size of the dataset it is observed that while processing the data CPU and RAM utilisation reached optimum levels leading to failure of the notebook and restarting again. To avoid this problem I have come up with a way to optimise the memory utilisation by more than 75 % reduction in RAM usage as shown below

In [None]:
test_size_mb = raw_sample_df.memory_usage().sum() / 1024 / 1024
test_size_mb1 = ad_feature_df.memory_usage().sum() / 1024 / 1024
test_size_mb2 = user_profile_df.memory_usage().sum() / 1024 / 1024
print("raw_sample_df memory size: %.2f MB" % test_size_mb)
print("ad_feature_df memory size: %.2f MB" % test_size_mb1)
print("user_profile_df memory size: %.2f MB" % test_size_mb2)

#### We're going to be calculating memory usage a lot,so we'll create a function namely mem_usage()to save us some time!

In [None]:
def mem_usage(pandas_obj):
    if isinstance(pandas_obj,pd.DataFrame):
        usage_b = pandas_obj.memory_usage(deep=True).sum()
    else: # we assume if not a df it's a series
        usage_b = pandas_obj.memory_usage(deep=True)
    usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
    return "{:03.2f} MB".format(usage_mb)

Let us consider first the raw_sample_df dataframe and its current memory utilisation and look at each column type how much memory it is consuming and provide optimisation of those columns as shown below

In [None]:
raw_sample_df.info(memory_usage='deep')

In [None]:
optimized_gl = raw_sample_df.copy()

gl_int = raw_sample_df.select_dtypes(include=['int'])
converted_int = gl_int.apply(pd.to_numeric,downcast='unsigned')
optimized_gl[converted_int.columns] = converted_int


gl_obj = raw_sample_df.select_dtypes(include=['object']).copy()
converted_obj = pd.DataFrame()
for col in gl_obj.columns:
    num_unique_values = len(gl_obj[col].unique())
    num_total_values = len(gl_obj[col])
    if num_unique_values / num_total_values < 0.5:
        converted_obj.loc[:,col] = gl_obj[col].astype('category')
    else:
        converted_obj.loc[:,col] = gl_obj[col]
optimized_gl[converted_obj.columns] = converted_obj
print("Original Ad Feature dataframe:{0}".format(mem_usage(raw_sample_df)))
print("Memory Optimised Ad Feature dataframe:{0}".format(mem_usage(optimized_gl)))

In [None]:
raw_sample_df = optimized_gl.copy()
raw_sample_df_new = raw_sample_df.rename(columns = {"user": "userid"})

In [None]:
ad_feature_df.info(memory_usage='deep')

In [None]:
optimized_g2 = ad_feature_df.copy()

g2_int = ad_feature_df.select_dtypes(include=['int'])
converted_int = g2_int.apply(pd.to_numeric,downcast='unsigned')
optimized_g2[converted_int.columns] = converted_int

g2_float = ad_feature_df.select_dtypes(include=['float'])
converted_float = g2_float.apply(pd.to_numeric,downcast='float')
optimized_g2[converted_float.columns] = converted_float

print("Original Ad Feature dataframe:{0}".format(mem_usage(ad_feature_df)))
print("Memory Optimised Ad Feature dataframe:{0}".format(mem_usage(optimized_g2)))

In [None]:
user_profile_df.info(memory_usage='deep')

In [None]:
optimized_g3 = user_profile_df.copy()

g3_int = user_profile_df.select_dtypes(include=['int'])
converted_int = g3_int.apply(pd.to_numeric,downcast='unsigned')
optimized_g3[converted_int.columns] = converted_int

g3_float = user_profile_df.select_dtypes(include=['float'])
converted_float = g3_float.apply(pd.to_numeric,downcast='float')
optimized_g3[converted_float.columns] = converted_float

print("Original User Feature dataframe:{0}".format(mem_usage(user_profile_df)))
print("Memory Optimised User Feature dataframe:{0}".format(mem_usage(optimized_g3)))

Now that we optimised all the dataframes it is time to converge all into a single final dataset for our model prediction implementation

In [None]:
df1 = raw_sample_df_new.merge(optimized_g3, on="userid")
final_df = df1.merge(optimized_g2, on="adgroup_id")
final_df.head()

Ideally the dataset should contain historical columns for our model implementation .To overcome this problem I have replicated the two columns as historical columns for calculating the historical behavior of the users.

In [None]:
final_df['hist_cate_id'] = final_df['cate_id']
final_df['hist_adgroup_id'] = final_df['adgroup_id']

Now let us choose the sparse,dense and sequence features required for DIN model as shown below

In [None]:
sparse_features = [feat for feat in final_df.columns if feat not in ['time_stamp','pid', 'nonclk','brand',
       'cms_segid', 'cms_group_id', 'age_level',
       'pvalue_level', 'shopping_level', 'occupation', 'new_user_class_level ',
        'campaign_id', 'customer', 'price', 'hist_cate_id','hist_adgroup_id','clk']]
sparse_features

In [None]:
dense_features = [feat for feat in final_df.columns if feat not in ['userid', 'time_stamp', 'adgroup_id', 'pid', 'nonclk', 'clk',
       'cms_segid', 'cms_group_id', 'final_gender_code', 'occupation', 'new_user_class_level ',
       'cate_id', 'campaign_id', 'shopping_level','customer', 'brand','hist_cate_id','hist_adgroup_id']]
dense_features

In [None]:
sequence_features = [feat for feat in final_df.columns if feat not in ['userid', 'time_stamp', 'adgroup_id', 'pid', 'nonclk', 'clk',
       'cms_segid', 'cms_group_id', 'final_gender_code', 'age_level',
       'pvalue_level', 'shopping_level', 'occupation', 'new_user_class_level ',
       'cate_id', 'campaign_id', 'customer', 'brand', 'shopping_level','price']]
sequence_features

In [None]:
behavior_feature_list = [feat for feat in final_df.columns if feat in ['adgroup_id', 'cate_id']]

In [None]:
final_df[sparse_features] = final_df[sparse_features].fillna('-1', )
final_df[sequence_features] = final_df[sequence_features].fillna('-1', )
final_df[dense_features] = final_df[dense_features].fillna(0, )
target = ['clk']

 #### 1. Perform simple transformation on dense features

In [None]:
mms = MinMaxScaler(feature_range=(0, 1))
final_df[dense_features] = mms.fit_transform(final_df[dense_features])

#### 2. Set hashing space for each sparse field,and record dense feature field name

In [None]:
fixlen_feature_columns = [SparseFeat(feat, vocabulary_size=2000000,embedding_dim=8) for feat in sparse_features] + [DenseFeat(feat, 1, )for feat in dense_features] + [VarLenSparseFeat(SparseFeat(feat, vocabulary_size=2000000,embedding_dim=8), maxlen=1) for feat in sequence_features] 
linear_feature_columns = fixlen_feature_columns
dnn_feature_columns = fixlen_feature_columns
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns, )


#### 3.Generate input data for model

In [None]:
train, test = train_test_split(final_df, test_size=0.2)
train_model_input = {name:train[name] for name in feature_names if name != 'clk'}
test_model_input = {name:test[name] for name in feature_names if name != 'clk'}

 ##### 4. Define Model,Train,Predict and Evaluate

In [None]:
tf.compat.v1.disable_eager_execution()
# model = DIN(linear_feature_columns,behavior_feature_list, task='binary')
model = xDeepFM(linear_feature_columns,dnn_feature_columns,)
# model = DSTN(linear_feature_columns,behavior_feature_list,)
# model = DIEN(linear_feature_columns, behavior_feature_list)
# DIEN()

In [None]:
import seaborn

In [None]:
final_df.columns

In [None]:

import matplotlib.pyplot as plt
plt.hist(user_profile_df['age_level'])
plt.title('Age Distribution')

In [None]:
user_profile_df['age_level'].head(5000).unique()

In [None]:
behavior_feature_list

In [None]:
def Mixed_loss(a = 1.0):
    """
    """
#     a = tf.constant(a, dtype=tf.float32)

    def binary_focal_loss_fixed(y_true, y_pred):
        alpha = tf.constant(0.8, dtype=tf.float32)
        gamma = tf.constant(0.6, dtype=tf.float32)
        """
        y_true shape need be (None,1)
        y_pred need be compute after sigmoid
        """
        y_true = tf.cast(y_true, tf.float32)
        alpha_t = y_true*alpha + (K.ones_like(y_true)-y_true)*(1-alpha)
    
        p_t = y_true*y_pred + (K.ones_like(y_true)-y_true)*(K.ones_like(y_true)-y_pred) + K.epsilon()
        focal_loss = - alpha_t * K.pow((K.ones_like(y_true)-p_t),gamma) * K.log(p_t)
#         return K.mean(focal_loss)
        return focal_loss
        
    def mixed_loss(y_true, y_pred):
        a = tf.constant(0, dtype=tf.float32)
#         return a*binary_focal_loss_fixed(y_true, y_pred)+(1-a)*binary_crossentropy(y_true,y_pred)
        return binary_crossentropy(y_true,y_pred)
    return mixed_loss


In [None]:
from sklearn.metrics import precision_score

In [None]:
from keras.optimizers import Adam
model.compile("adam",binary_crossentropy,metrics=['AUC','accuracy'])
history = model.fit(train_model_input, train[target].values,batch_size=4096, epochs=1, verbose=1, validation_split=0.2,shuffle=True )

In [None]:
pred_ans = model.predict(test_model_input, batch_size=256)

print("test LogLoss", round(log_loss(test[target].values, pred_ans), 2))
print("test AUC", roc_auc_score(test[target].values, pred_ans))

## Conclusion:
In this notebook, we focused on the CTR prediction task in the scenario of display advertising in e-commerce industry in our example taken Taoboa dataset , which involves internet-scale user behavior data for 7 days .In conclusion revealed and summarized the two key structures of data i.e. diversity and local activation and designed a novel model named DIN(Deep Interest Network with better exploitation of data structures.The above experiments show DIN brings more interpretability and achieves better GAUC(Group Area Under Curve) performance compared with popular MLPs model. Besides, we studied the overfitting problem in training such industrial deep networks and proposed an adaptive regularization technique "Dice" which reduced overfitting greatly in our scenario.

In the upcoming versions will bring more insights on the DIN model implementation.

### I hope you had a good overview of DIN model implementation.Greatly appreciate to leave your comments and if you liked this kernel do encourage with an upvote. Thank you :)
