【CTR预估】The Wide and Deep Learning Model（译文+Tensorlfow源码解析）

https://github.com/Shicoder/Deep_Rec/tree/master/Deep_Rank

ABSTRACT

Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can over-generalize and recommend less relevant items when the user-item interactions are sparse and high-rank. In this paper, we present Wide & Deep learning—jointly trained wide linear models and deep neural networks—to combine the benefits of memorization and generalization for recommender systems. We productionized and evaluated the system on Google Play, a commercial mobile app store with over one billion active users and over one million apps. Online experiment results show that Wide & Deep significantly increased app acquisitions compared with wide-only and deep-only models. We have also open sourced our implementation in TensorFlow.

INTRODUCTION

A recommender system can be viewed as a search ranking system, where the input query is a set of user and contextual information, and the output is a ranked list of items. Given a query, the recommendation task is to find the relevant items in a database and then rank the items based on certain objectives, such as clicks or purchases.
One challenge in recommender systems, similar to the general search ranking problem, is to achieve both memorization and generalization. Memorization can be loosely defined as learning the frequent co-occurrence of items or features and exploiting the correlation available in the historical data.Generalization, on the other hand, is based on transitivity of correlation and explores new feature combinations that have never or rarely occurred in the past. Recommendations based on memorization are usually more topical and directly relevant to the items on which users have already performed actions. Compared with memorization, generalization tends to improve the diversity of the recommended items. In this paper, we focus on the apps recommendation problem for the Google Play store, but the approach should apply to generic recommender systems.

For massive-scale online recommendation and ranking systems in an industrial setting, generalized linear models such as logistic regression are widely used because they are simple, scalable and interpretable. The models are often trained on binarized sparse features with one-hot encoding. E.g., the binary feature “user_installed_app=netflix ”has value 1 if the user installed Netflix. Memorization can be achieved effectively using cross-product transformations over sparse features, such as AND( user_installed_app=netflix, impression_app=pandora”), whose value is 1 if the user installed Netflix and then is later shown Pandora. This explains how the co-occurrence of a feature pair correlates with the target label. Generalization can be added by using features that are less granular, such as AND ( user_installed_category=video, impression_category=music ), but manual feature engineering is often required. One limitation of cross-product transformations is that they do not generalize to query-item feature pairs that have not appeared in the training data.

Embedding-based models, such as factorization machines[5] or deep neural networks, can generalize to previously unseen query-item feature pairs by learning a low-dimensional dense embedding vector for each query and item feature, with less burden of feature engineering. However, it is difficult to learn effective low-dimensional representations for queries and items when the underlying query-item matrix is sparse and high-rank, such as users with specific preferences or niche items with a narrow appeal. In such cases, there should be no interactions between most query-item pairs, but dense embeddings will lead to nonzero predictions for all query-item pairs, and thus can over-generalize and make less relevant recommendations. On the other hand, linear models with cross-product feature transformations can memorize these “exception rules” with much fewer parameters.

In this paper, we present the Wide & Deep learning framework to achieve both memorization and generalization in one model, by jointly training a linear model component and a neural network component as shown in Figure 1.
The main contributions of the paper include:
• The Wide & Deep learning framework for jointly training feed-forward neural networks with embeddings and linear model with feature transformations for generic recommender systems with sparse inputs.
• The implementation and evaluation of the Wide & Deep recommender system productionized on Google Play, a mobile app store with over one billion active users and over one million apps.
• We have open-sourced our implementation along with a high-level API in TensorFlow.
While the idea is simple, we show that the Wide & Deep framework significantly improves the app acquisition rate on the mobile app store, while satisfying the training and serving speed requirements.

1.联合训练使用了embedding的深度网络和使用了交叉特征的线性模型。
3.在Tensrolfow开源代码。

RECOMMENDER SYSTEM OVERVIEW

An overview of the app recommender system is shown in Figure 2. A query, which can include various user and contextual features, is generated when a user visits the app store. The recommender system returns a list of apps (also referred to as impressions) on which users can perform certain actions such as clicks or purchases. These user actions, along with the queries and impressions, are recorded in the logs as the training data for the learner. Since there are over a million apps in the database, it is intractable to exhaustively score every app for every query within the serving latency requirements (often O(10) milliseconds). Therefore, the first step upon receiving a query is retrieval. The retrieval system returns a short list of items that best match the query using various signals, usually a combination of machine-learned models and human-defined rules. After reducing the candidate pool, the ranking system ranks all items by their scores. The scores are usually P(y|x), the probability of a user action label y given the features x, including user features (e.g., country, language, demographics), contextual features (e.g., device, hour of the day, day of the week), and impression features (e.g., app age, historical statistics of an app). In this paper, we focus on the ranking model using the Wide & Deep learning framework.

query:当用户访问app store的时候生成的许多用户和文本特征。

P(y|x)。特征x包括一些用户特征（国家，语言。。。），文本特征（设备。使用时长。。。）和展示特征（app历史统计数据。。。）。在本论文中，我们主要关注的是将WD模型用户排序系统。

WIDE&DEEP LEARNING

The Wide Component

The wide component is a generalized linear model of the form y = wT x + b, as illustrated in Figure 1 (left). y is the prediction, x = [x1, x2, …, xd] is a vector of d features, w =[w1, w2, …, wd] are the model parameters and b is the bias.The feature set includes raw input features and transformed features. One of the most important transformations is the cross-product transformation, which is defined as:

where cki is a boolean variable that is 1 if the i-th feature is part of the k-th transformation φk, and 0 otherwise.For binary features, a cross-product transformation (e.g.,“AND(gender=female, language=en)”) is 1 if and only if the constituent features (“gender=female” and “language=en”) are all 1, and 0 otherwise. This captures the interactions between the binary features, and adds non linearity to the generalized linear model.

The Deep Component

deep模块则是一个前向神经网络，如图1右，对于类别特征，原始输入特征其原始输入都是字符串形式的特征，如“language=en".我们把这些稀疏，高维的类别特征转换为低纬稠密的实值向量，这就是embedding向量。embedding随机初始化，并利用反向传播对其进行更新。将高维的特征换换为embedding特征后，这些低维的embedding向量就被fed到神经网络中，每个隐藏层做如下计算：

Joint Training of Wide & Deep Model

P ( Y = 1 ∣ X ) = σ ( W w i d e T [ X , ϕ ( X ) ] + W d e e p T a l f + b ) P(Y=1|X)=\sigma(W_{wide}^{T}[X,\phi(X)]+W_{deep}^{T}a^{l_{f}}+b) （3）

Data Generation

In this stage, user and app impression data within a period of time are used to generate training data. Each example corresponds to one impression. The label is app acquisition:1 if the impressed app was installed, and 0 otherwise. Vocabularies, which are tables mapping categorical feature strings to integer IDs, are also generated in this stage. The system computes the ID space for all the string features that occurred more than a minimum number of times. Continuous real-valued features are normalized to [0, 1] by mapping a feature value x to its cumulative distribution function P(X ≤ x), divided into nq quantiles. The normalized value is i−1 nq−1 for values in the i-th quantiles. Quantile boundaries are computed during data generation.

app推荐主要由三个阶段组成，data generation,model training,model serving。图3所示。

Vacabularies，是一些将类别特征字符串映射为整型的ID。系统计算为哪些出现超过设置的最小次数的字符串特征计算ID空间。连续的实值特征通过映射特征x到它的累积分布P(X<=x)，将其标准化到[0,1]，然后在离散到nq个分位数。这些分位数边界也是在该阶段计算获得。

Model Training

The model structure we used in the experiment is shown in Figure 4. During training, our input layer takes in training data and vocabularies and generate sparse and dense features together with a label. The wide component consists of the cross-product transformation of user installed apps and impression apps. For the deep part of the model, A 32 dimensional embedding vector is learned for each categorical feature. We concatenate all the embeddings together with the dense features, resulting in a dense vector of approximately 1200 dimensions. The concatenated vector is then fed into 3 ReLU layers, and finally the logistic output unit. The Wide & Deep models are trained on over 500 billion examples. Every time a new set of training data arrives, the model needs to be re-trained. However, retraining from scratch every time is computationally expensive and delays the time from data arrival to serving an updated model.

To tackle this challenge, we implemented a warm-starting system which initializes a new model with the embeddings and the linear model weights from the previous model. Before loading the models into the model servers, a dry run of the model is done to make sure that it does not cause problems in serving live traffic. We empirically validate the model quality against the previous model as a sanity check.

WD将被训练在超过5000亿的样本上。每次一个新的训练数据达到，模型需要重新训练。但是，重新训练费时费力。为了克服这个挑战，我们实现了一个热启动系统，我们使用预先的模型权值去初始化新模型权值。

Model Serving

Once the model is trained and verified, we load it into the model servers. For each request, the servers receive a set of app candidates from the app retrieval system and user features to score each app. Then, the apps are ranked from the highest scores to the lowest, and we show the apps to the users in this order. The scores are calculated by running a forward inference pass over the Wide & Deep model. In order to serve each request on the order of 10 ms, we optimized the performance using multithreading parallelism by running smaller batches in parallel, instead of scoring all candidate apps in a single batch inference step.

特征工程

Feature_column模块自带的函数有这么几个：
‘crossed_column’,
‘numeric_column’,
‘bucketized_column’,‘
‘categorical_column_with_hash_bucket’,
‘categorical_column_with_vocabulary_file’,
‘categorical_column_with_vocabulary_list’,
‘categorical_column_with_identity’,
‘weighted_categorical_column’,
‘indicator_column’,
crossed_column用于构造交叉特征，numeric_column用于处理实值，bucketized_column用于离散连续特征，categorical_column_with_hash_bucket将类别特征hash到不同bin中，categorical_column_with_vocabulary_file将类别特征的所有取值保存在文件中，categorical_column_vocabulary_list将类别特征的所有取值保存在list中，categorical_column_with_identity返回的是和特征本身一样的id，weighted_categorical_column是加权用的，indicator_column用来对类别特征做one-hot。

1.针对取值较少的类别特征，demo里使用了tf.feature_column.categorical_column_with_vocabulary_list（）方法将类别特征从字符串类型映射到整型。比如性别特征，原始数据集中的取值为Femal或者Male，这样我们就可以将其通过

gender = tf.feature_column.categorical_column_with_vocabulary_list(
"gender", ["Female", "Male"])


categorical_column_with_vocabulary_list()方法中还有一个参数是oov，意思就是out of vocabulary,就是说如果数据没有出现在我们定义的vocabulary中的话，我们可以将其投到oov中。其实这个方法的底层就是一个将String映射到int的一个hashtable。

2.针对那些不清楚有多少个取值的类别特征，或者说取值数很多的特征，可以使用tf.feature_column.categorical_column_with_hash_bucket（）方法，思想和categorical_column_with_vocabulary_list一样，因为我们不知道类别特征的取值，所以没法定义vocabulary。所以可以直接利用hash方法将其直接hash到不同的bucket中，该方法将特征中的每一个可能的取值散列分配一个整型ID。比如

occupation = tf.feature_column.categorical_column_with_hash_bucket(
"occupation", hash_bucket_size=1000)


if self.dtype == dtypes.string:
sparse_values = input_tensor.values
else:
sparse_values = string_ops.as_string(input_tensor.values)
sparse_id_values = string_ops.string_to_hash_bucket_fast(   sparse_values, self.hash_bucket_size, name='lookup')


output_id = Hash(input_feature_string) % bucket_size

3.连续性变量

# Continuous base columns.
age = tf.feature_column.numeric_column("age")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")



4.对于分布不平均的连续性变量

age_buckets = tf.feature_column.bucketized_column(
age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])


5.交叉特征

tf.feature_column.crossed_column( ["education", "occupation"], hash_bucket_size=1000)

SparseTensor referred by first key:
shape = [2, 2] {
[0, 0]: "a"
[1, 0]: "b"
[1, 1]: "c"
}
SparseTensor referred by second key:
shape = [2, 1]
{
[0, 0]: "d"
[1, 0]: "e"
}
then crossed feature will look like:
shape = [2, 2]{
[0, 0]: Hash64("d", Hash64("a")) % hash_bucket_size
[1, 0]: Hash64("e", Hash64("b")) % hash_bucket_size
[1, 1]: Hash64("e", Hash64("c")) % hash_bucket_size


6.indicator特征，因为dnn是不能直接输入sparseColumn的，怎么说呢，之前那些类别特征处理好后，全是将string转化成了int，但是针对每个取值返回的还是一个整形的id值，我们不可能直接将该id传入网络，但是线性模型可以直接将这类特征做embedding，来实现线性模型。

one_hot_id_tensor = array_ops.one_hot(
dense_id_tensor,
depth=self._variable_shape[-1],
on_value=1.0,
off_value=0.0)



return math_ops.reduce_sum(one_hot_id_tensor, axis=[-2])


Embedding_column

tf.feature_column.embedding_column(native_country, dimension=8)


return _EmbeddingColumn( categorical_column=categorical_column, dimension=dimension, combiner=combiner, initializer=initializer, ckpt_to_load_from=ckpt_to_load_from, tensor_name_in_ckpt=tensor_name_in_ckpt, max_norm=max_norm, trainable=trainable)

embedding_weights = variable_scope.get_variable(
name='embedding_weights',
shape=(self.categorical_column._num_buckets, self.dimension),
dtype=dtypes.float32,
initializer=self.initializer,
trainable=self.trainable and trainable,
collections=weight_collections)


https://stackoverflow.com/questions/34870614/what-does-tf-nn-embedding-lookup-function-do

模型的构造

m = tf.estimator.DNNLinearCombinedClassifier(
model_dir=model_dir,
linear_feature_columns=crossed_columns,
dnn_feature_columns=deep_columns,
dnn_hidden_units=[100, 50])


class DNNLinearCombinedClassifier(estimator.Estimator)


_dnn_linear_combined_model_fn


DNN模型构建

net = feature_column_lib.input_layer（）


Linear模块

for column in sorted(feature_columns, key=lambda x: x.name):
with variable_scope.variable_scope(None, default_name=column.name):
ordered_columns.append(column)
if isinstance(column, _CategoricalColumn):
weighted_sums.append(_create_categorical_column_weighted_sum(
column, builder, units, sparse_combiner, weight_collections,
trainable))
else:
weighted_sums.append(_create_dense_column_weighted_sum(
column, builder, units, weight_collections, trainable))


_create_categorical_column_weighted_sum（）

weight = variable_scope.get_variable(
name='weights',
shape=(column._num_buckets, units),  # pylint: disable=protected-access    initializer=init_ops.zeros_initializer(),
trainable=trainable,
collections=weight_collections)
return _safe_embedding_lookup_sparse(
weight,
id_tensor,
sparse_weights=weight_tensor,
combiner=sparse_combiner,
name='weighted_sum')


_create_dense_column_weighted_sum（）

predictions_no_bias = math_ops.add_n(
weighted_sums, name='weighted_sum_no_bias')
bias = variable_scope.get_variable(
'bias_weights',
shape=[units],
initializer=init_ops.zeros_initializer(),
trainable=trainable,
collections=weight_collections)
predictions_no_bias, bias, name='weighted_sum')


combine

if dnn_logits is not None and linear_logits is not None:
logits = dnn_logits + linear_logits


loss的反向传播，也很直接
`def _train_op_fn(loss):
“”“Returns the op to optimize the loss.”"" ……
……
if dnn_logits is not None:
train_ops.append(
dnn_optimizer.minimize(
loss, ……
……)
if linear_logits is not None:
train_ops.append(
loss, ……
……)

08-20 3816
08-12 7107

05-14 220
01-03
04-21 3050
03-30 507