CN114511152A - Training method and device of prediction model - Google Patents

Training method and device of prediction model Download PDF

Info

Publication number
CN114511152A
CN114511152A CN202210146356.0A CN202210146356A CN114511152A CN 114511152 A CN114511152 A CN 114511152A CN 202210146356 A CN202210146356 A CN 202210146356A CN 114511152 A CN114511152 A CN 114511152A
Authority
CN
China
Prior art keywords
network
prediction
preset
training
expert
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210146356.0A
Other languages
Chinese (zh)
Inventor
何俊佑
梅桂宝
邢凤
杨小锐
丁卓冶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202210146356.0A priority Critical patent/CN114511152A/en
Publication of CN114511152A publication Critical patent/CN114511152A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Software Systems (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Primary Health Care (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Operations Research (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Development Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the disclosure discloses a method and a device for training a prediction model. One embodiment of the method comprises: acquiring a training sample set consisting of training sample subsets respectively corresponding to all preset scenes; the method comprises the steps that an initial model formed by prediction models corresponding to preset scenes is obtained, each prediction model comprises an embedded network and a shared expert network which are the same, the prediction model corresponding to each preset scene further comprises a private expert network and a prediction network corresponding to the preset scene, the embedded network is used for generating a feature vector corresponding to attribute data, the private expert network is used for extracting features corresponding to the attribute data under the preset scenes, the shared expert network is used for extracting the features of the attribute data under the preset scenes, and the prediction network generates prediction data index values according to the output of the private expert network and the shared expert network; and training the initial model by using the training sample set and the loss function. The implementation mode is helpful for improving the training effect of the prediction model in multiple scenes.

Description

Training method and device of prediction model
Technical Field
The embodiment of the disclosure relates to the technical field of computers, in particular to a method and a device for training a prediction model.
Background
Prediction is an important component of various network service platforms for the service providing method of users. Through accurate prediction, personalized services and the like for the user can be realized, so that the user experience is improved. With the development of machine learning, many prediction methods based on deep learning and the like have appeared.
As network platforms develop and expand, many network platforms have a very large number of scenarios that require application of predictive models. For example, for some e-commerce platforms, there are many medium-long tailed scenarios that require the use of predictive models to refine the recommendation system to enhance the user experience.
The existing prediction model is usually obtained by training the following two methods: one is to perform training of a corresponding prediction model by collecting various data under each scene to be predicted, so that the trained prediction model serves the scene. One is to mix data under multiple scenarios to train a predictive model to serve multiple scenarios.
Disclosure of Invention
The embodiment of the disclosure provides a method and a device for training a prediction model.
In a first aspect, an embodiment of the present disclosure provides a method for training a prediction model, where the method includes: acquiring a training sample set consisting of training sample subsets corresponding to all preset scenes respectively, wherein the training samples comprise sample attribute data and sample data index values under the preset scenes; the method comprises the steps that an initial model formed by prediction models corresponding to preset scenes is obtained, each prediction model comprises an embedded network and a shared expert network which are the same, the prediction model corresponding to each preset scene further comprises a private expert network and a prediction network corresponding to the preset scene, the embedded network is used for generating a feature vector corresponding to attribute data, the private expert network is used for extracting features corresponding to the attribute data under the preset scenes, the shared expert network is used for extracting the features of the attribute data under the preset scenes, and the prediction network is used for generating prediction data index values according to the output of the private expert network and the shared expert network; and training the initial model by utilizing a training sample set and a loss function, wherein the loss function is determined according to the sum of the loss values respectively corresponding to the prediction models.
In a second aspect, an embodiment of the present disclosure provides a data index prediction method, including: acquiring attribute data in a preset scene; inputting the attribute data into a prediction model corresponding to a preset scene to obtain a data index value, wherein the prediction model is obtained by pre-training by using the method described in the first aspect; and executing preset operation corresponding to the data index value.
In a third aspect, an embodiment of the present disclosure provides an apparatus for training a prediction model, where the apparatus includes: the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is configured to acquire a training sample set consisting of training sample subsets corresponding to preset scenes respectively, and the training samples comprise sample attribute data and sample data index values under the preset scenes; the second acquisition unit is configured to acquire an initial model formed by prediction models corresponding to preset scenes respectively, each prediction model comprises an embedded network and a shared expert network which are the same, the prediction model corresponding to each preset scene also comprises a private expert network and a prediction network corresponding to the preset scene, the embedded network is used for generating a feature vector corresponding to attribute data, the private expert network is used for extracting features corresponding to the attribute data under the preset scenes, the shared expert network is used for extracting features of the attribute data under each preset scene, and the prediction network is used for generating a prediction data index value according to the outputs of the private expert network and the shared expert network; and the training unit is configured to train the initial model by utilizing a training sample set and a loss function, wherein the loss function is determined according to the sum of the loss values respectively corresponding to the prediction models.
In a fourth aspect, an embodiment of the present disclosure provides a data index prediction apparatus, including: a third acquiring unit configured to acquire attribute data in a preset scene; the prediction unit is configured to input the attribute data into a prediction model corresponding to a preset scene to obtain a data index value, wherein the prediction model is obtained by pre-training by using the method described in the first aspect; and the execution unit is configured to execute preset operation corresponding to the data index value.
In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; storage means for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.
In a sixth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, which computer program, when executed by a processor, implements the method as described in any of the implementations of the first aspect.
According to the training method and device for the prediction model, the corresponding private expert network and the corresponding shared expert network are set for the prediction model of each scene to respectively extract the characteristic of the attribute data under each scene and the general characteristic of the attribute data under each scene, and the data index value under each scene is predicted by fusing the outputs of the private expert network and the shared expert network, so that the prediction model of each scene can be trained by fully utilizing the characteristic of each scene and the commonality of each scene, the accuracy of the prediction result of the prediction model under each scene is improved, and various services provided by the application scene based on the prediction model are optimized.
Drawings
Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;
FIG. 2 is a flow diagram of one embodiment of a method of training a predictive model according to the present disclosure;
FIG. 3 is a network architecture diagram of an initial model suitable for use in implementing embodiments of the present disclosure;
FIG. 4 is a flow diagram of one embodiment of a data index prediction method according to the present disclosure;
FIG. 5 is a schematic diagram of an application scenario of a data index prediction method according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of an embodiment of a predictive model training apparatus according to the present disclosure;
FIG. 7 is a schematic block diagram of one embodiment of a data index prediction apparatus according to the present disclosure;
FIG. 8 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary architecture 100 of an embodiment of a training method of a predictive model or a training apparatus of a predictive model to which the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The terminal devices 101, 102, 103 interact with a server 105 via a network 104 to receive or send messages or the like. Various client applications may be installed on the terminal devices 101, 102, 103. For example, browser-like applications, search-like applications, instant messaging tools, social platform software, deep learning-like applications, neural network model training-like applications, and the like.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 105 may be a server that provides various services, such as a backend server that provides service support for client applications installed on the terminal devices 101, 102, 103. The server 105 may receive the data sent by the terminal devices 101, 102, 103 and determine a training sample set from the data sent by the terminal devices 101, 102, 103, and then the server 105 may train the initial model from the training sample set.
It should be noted that the training method of the predictive model provided by the embodiment of the present disclosure is generally performed by the server 105, and accordingly, the training device of the predictive model is generally disposed in the server 105.
It should also be noted that a neural network model class tool or application may also be installed in the terminal devices 101, 102, 103, and the terminal devices 101, 102, 103 may train the initial model with the training sample set based on the neural network model class tool or application. In this case, the terminal devices 101, 102, and 103 may execute the training method of the prediction model, and accordingly, the training device of the prediction model may be provided in the terminal devices 101, 102, and 103. At this point, the exemplary system architecture 100 may not have the server 105 and the network 104.
The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method of training a predictive model according to the present disclosure is shown. The training method of the prediction model comprises the following steps:
step 201, a training sample set composed of training sample subsets corresponding to each preset scene is obtained.
In this embodiment, the executing entity (such as the server 105 shown in fig. 1) of the training method of the prediction model may obtain the training sample set from a local or other data platform or the like. The training sample set may be composed of training sample subsets corresponding to respective preset scenes. I.e. each preset scenario may correspond to a subset of training samples.
The training samples in the training sample subset corresponding to each preset scene may include sample attribute data and sample data index values in the preset scene. Each preset scene may be a scene predicted by various applications, and may be specifically set according to an actual application scene.
For example, each preset scenario may be various information push scenarios in a certain application. As an example, for a shopping application, each preset scenario may include an information pushing scenario of an item detail page, an information pushing scenario of an order detail page, an information pushing scenario of a to-be-paid page, and the like.
The attribute data in each preset scene may refer to various attribute data for predicting a data index value. The data index may be various types of data indexes, and may be specifically set according to an actual application requirement. For example, data metrics include, but are not limited to: click rate, conversion rate, retention rate, and the like. The attribute data may be set according to a specific corresponding scene. For example, in an information push scenario, the attribute data may include user attributes and attributes of information to be pushed, among others.
Step 202, obtaining an initial model formed by prediction models corresponding to each preset scene.
In this embodiment, the executing agent may obtain the initial model from a local or other data source or the like. The initial model may be composed of prediction models corresponding to respective preset scenes. I.e. one prediction model for each preset scene.
Each predictive model may include an Embedding (Embedding) network, a private expert network, a shared expert network, and a predictive network. The embedded network is used for generating a feature vector for representing the attribute data according to the input attribute data. Each predictive model may include the same embedded network to enable feature coding of the attribute data. The embedded network may be constructed from various embedded layers that are present.
For example, in the item push scenario, the attribute data input into the embedded network includes a user attribute and an item attribute, and the embedded network may generate a user feature vector for characterizing the user attribute and an item feature vector for characterizing the item attribute. In the actual application process, various attribute features can be spliced to form a final feature representation.
The private expert network included in each prediction model can be used for extracting the characteristic features of the attribute data under the preset scene corresponding to the prediction model. The shared expert network may be used to extract general features of the attribute data in each of the preset scenarios. The private expert network and the shared expert network may be constructed based on various existing feature extraction networks. For example, the private expert network and the shared expert network may be various fully connected networks.
Each prediction model may include inputs to the prediction network that may be outputs of the corresponding private expert network and shared expert network, and outputs of the prediction network may be generated data indicator values. The predictive network may be constructed from existing networks of various predictive data indicators. For example, the predictive network may be a variety of classification or regression based convolutional neural networks.
As an example, the feed-forward process of the prediction network corresponding to the preset scenario "k" can be expressed as follows:
Figure BDA0003509152780000071
wherein the content of the first and second substances,
Figure BDA0003509152780000072
representing the output of the prediction network. l denotes a network layer. σ denotes the activation function.
Figure BDA0003509152780000073
And
Figure BDA0003509152780000074
network parameters representing the predicted network represent the weights and biases, respectively.
Correspondingly, as an example, the data indicator value of the predicted network output may be expressed as follows:
Figure BDA0003509152780000075
wherein p iskRepresenting a predicted data index value. Sigmoid is an activation function. L represents the number of network layers of the predicted network.
It should be noted that, in general, the network structures of the private expert networks included in the prediction models may be the same or similar. Similarly, the network structures of the prediction networks included in the prediction models may be the same or similar.
Step 203, training the initial model by using the training sample set and the loss function.
In this embodiment, the initial model may be trained by using a preset sample set and a preset loss function to obtain a trained initial model, so that trained prediction models corresponding to preset scenes included initially after training may be obtained.
Specifically, by using a machine learning method, sample attribute data included in training samples in a training sample set is used as input of an initial model, a sample data index value corresponding to each preset scene is used as expected output of a prediction network corresponding to the preset scene, and network parameters of the initial model are adjusted based on methods such as gradient descent and back propagation according to a preset loss function, so as to complete training of the initial model.
The loss function can be determined according to the sum of the loss values respectively corresponding to the prediction models, and can be flexibly set according to actual application requirements. The loss value for each prediction model may represent the difference between the data indicator value output by that prediction model and the corresponding sample data indicator value. At this point, the initial model may be continuously trained by minimizing the loss function. As an example, the corresponding loss value for each prediction model may be calculated based on cross-entropy loss.
In an actual application process, the trained initial model may be deployed to serve each preset scenario. Specifically, the network parameters of the prediction model corresponding to each preset scenario may be activated in each preset scenario to complete the prediction.
In some optional implementations of this embodiment, the prediction model corresponding to each preset scenario may further include a gate network corresponding to the preset scenario. The gate network may be configured to merge outputs of the private expert networks corresponding to the preset scene, and the output of the gate network may be used as an input of the prediction network corresponding to the preset scene. The gate network may be configured according to the particular feature fusion method employed.
At this time, the private expert network may include a plurality of feature extraction networks to respectively extract features of the attribute data in the preset scene. The gate network may fuse the outputs of these feature extraction networks to extract information useful for predicting data index values.
For example, the gate network may assign different weights to the outputs of the respective feature extraction networks, and input the weighted sum of the outputs of the respective feature extraction networks as a fusion result to the prediction network.
As an example, the gate network may perform linear transformation on the input, and then input into the Softmax layer for processing, which is specifically expressed as follows:
gk(X)=softmax(WqkX)
wherein, gk(x) Representing the output of the gate network. WgkRepresenting network parameters of the door network. X represents the input of the gate network. "k" represents a preset scene "k".
The characteristics of the attribute data in each preset scene can be extracted by using the gate network, so that the accuracy of the prediction result of the prediction network in each preset scene is improved.
In some optional implementations of this embodiment, the prediction model corresponding to each preset scenario may further include a distillation network corresponding to the preset scenario. Wherein, the distillation network can be used for fusing the output of the shared expert network, and the output of the distillation network can be used as the input of the prediction network. The distillation network may be configured according to the particular feature fusion method employed.
At this time, the shared expert network may include a plurality of feature extraction networks to respectively share common features of the attribute data in all the preset scenes. The shared expert network may fuse the outputs of these feature extraction networks to extract information useful for the prediction results of each prediction network.
For example, the distillation network may assign different weights to respective elements in the feature vector output from each feature extraction network to update the feature vector, and then input the updated feature vectors corresponding to the respective feature extraction networks as a fusion result to the prediction network by adding the updated feature vectors by elements.
As an example, the distillation network may be set based on an activation function, specifically expressed as follows:
Figure BDA0003509152780000081
wherein d isk(x) Representing the output of the distillation network. X represents the input to the distillation network.
Figure BDA0003509152780000082
And
Figure BDA0003509152780000083
representing the network parameters of the distillation network. Sigma1Representing the activation function Relu. Sigma2Representing the activation function Sigmoid. "k" represents a preset scene "k".
The distillation network can be used for independently extracting useful general characteristics of the attribute data under each scene from the shared expert network, the interference among a plurality of expert networks is relieved, the distillation processing of the shared expert network is prevented from being interfered by other private expert networks, and the learning of the cross-scene general characteristics is also beneficial to improving the accuracy of the prediction result of the prediction network.
In some optional implementation manners of this embodiment, when the prediction model corresponding to each preset scene includes a gate network and a distillation network corresponding to the preset scene, an output of the gate network corresponding to the preset scene and an output of the distillation network corresponding to the preset scene may be fused, and an obtained fusion result may be used as an input of the prediction network corresponding to the preset scene.
The specific fusion method can be flexibly set according to the actual application requirements. For example, the outputs of the gate networks and the outputs of the corresponding distillation networks may be added element-wise to achieve fusion of the two.
As an example, the fusion of the outputs of the gate network and the distillation network can be represented by the following equation:
Figure BDA0003509152780000091
wherein f isk(x) Representing the fusion result, i.e. the input of the prediction network. X represents a doorAnd (4) inputting the network. X represents the input to the distillation network. As indicates hadamard or elemental products. "k" represents a preset scene "k". j represents a number. m iskRepresenting the number of feature extraction networks that the private expert network comprises. m is a unit ofsRepresenting the number of feature extraction networks that the shared expert network comprises.
Figure BDA0003509152780000092
And
Figure BDA0003509152780000093
representing the output of the screened feature extraction network.
Based on the method, the gate network can be used for distilling vector granularity of the output of the private expert network, the distillation network is used for distilling element granularity of the output of the shared expert network, the output of the gate network and the distillation network are fused, so that accurate feature representation influencing the prediction of data index values in each preset scene can be obtained, and the prediction of a subsequent prediction network is assisted.
Referring now to fig. 3, fig. 3 is a schematic diagram of a network architecture suitable for use as an initial model for implementing embodiments of the present disclosure. As shown in fig. 3, the initial model may include an embedded network common to the preset scenes to vector-encode the input attribute data, and a shared expert network common to extract general features of the attribute data in the preset scenes. For each preset scene, the prediction model corresponding to the preset scene may further include a private expert network, a gate network, a distillation network, and a prediction network corresponding to the preset scene. In addition, the initial model can further comprise a knowledge interaction module to realize mutual guidance and learning among the prediction models corresponding to different preset scenes.
In particular, the output of the embedded network may be used as input to the private expert network and the shared expert network, respectively. The output of the private expert network is used as the input of the gate network, the output of the shared expert network is used as the input of the distillation network, and the fusion result of the outputs of the gate network and the distillation network is used as the input of the prediction network. And finally outputting the measured data index value by the prediction network. Network parameters of the private expert network, the gate network, the distillation network and the prediction network respectively corresponding to different preset scenes are generally different.
In addition, in the training process of the initial model, various methods (such as transfer learning and the like) of knowledge interaction can be utilized to enable the prediction models corresponding to the preset scenes to mutually guide and learn so as to promote the training of the prediction models.
In some alternative implementations of this embodiment, the loss function may be determined according to a sum of the loss values respectively corresponding to the prediction models and a sum of the migration losses. The migration loss may represent a sum of migration loss values corresponding to the prediction models. The migration loss value of each prediction model may represent a sum of loss values corresponding to the teacher networks when the prediction model is used as the student network and the other prediction models are used as the teacher network.
As an example, the loss function may be expressed as follows:
L=Ld+αLkt
Figure BDA0003509152780000101
Figure BDA0003509152780000102
where L represents the loss function. L isdRepresenting the sum of the cross-entropy losses of the prediction models. p (x) denotes a teacher network, and q (x) denotes a student network. XiAnd representing sample attribute data in a preset scene corresponding to the teacher network as input of the teacher network and the student network. L is a radical of an alcoholktRepresents the sum of cross-entropy losses based on student-teacher knowledge migration. Alpha represents the strength of knowledge migration and can be preset according to the actual application requirements. "GB" denotes Gradient Block (Gradient Block) operation to avoid deterioration of the teacher's network. N is a radical ofpIndicating the number of samples corresponding to the preset scene "P". N is a radical ofkRepresenting scene "k" pairsThe number of samples to be tested.
Figure BDA0003509152780000103
The predicted data index value (such as predicted click rate) of the "i" th sample in scene "k" is represented.
Figure BDA0003509152780000104
Representing sample data index values in the training sample.
It should be noted that, the above loss calculation implemented by taking the transfer learning as an example only can optimize the training of each prediction model by adopting various knowledge interaction methods according to the requirements in the actual application process.
In some cases, there are few available data in some scenarios (e.g., for some medium-long tail scenarios), and therefore, the predicted network corresponding to these scenarios is not trained sufficiently, thereby affecting the training effect. Meanwhile, knowledge learned under different scenes may have a promoting effect on each other in consideration. Therefore, based on the idea of knowledge interaction (such as a method of reference knowledge distillation), in the training process, for the training of a certain preset scene, the training data of some scenes are less, and the training of each prediction network is sufficient, so that the final training effect is optimized.
The training method for the prediction models for multiple preset scenes provided by the embodiments of the present disclosure is based on a basic framework of multi-task learning, the task output of the multi-task learning is designed to be the same task (i.e., the prediction task), a bottom layer adopts a universal embedded network to perform feature representation, a private expert network and a shared expert network are used to respectively extract unique data features under each preset scene and universal data features under each preset scene, and the output of the private expert network and the output of the shared expert network are fused as the input of the prediction network, so as to improve the accuracy of the final prediction result. Moreover, knowledge migration is utilized in the training process, so that the training processes of the prediction models of all the preset scenes can be mutually guided, and the training effect of the prediction models is further improved.
Compared with the existing training method of the prediction model, the training method of the prediction model provided by the disclosure does not need to use the training data under the corresponding scene to train the corresponding prediction model separately aiming at each preset scene, so that the training cost and the subsequent maintenance cost can be saved, and the problem of insufficient training data of a single preset scene can be solved. Meanwhile, the problem of information interference under each scene caused by training a universal prediction model by mixing training data under each preset scene can be avoided, and the prediction accuracy of the data index value under each preset scene is improved.
With further reference to FIG. 4, a flow 400 of one embodiment of a data index prediction method according to the present disclosure is shown. The process 400 of the data index prediction method includes the following steps:
step 401, obtaining attribute data in a preset scene.
In this embodiment, the preset scenario may be a scenario in which various predicted data indexes are applied. For example, the preset scenes can be medium-long tail scenes which relate to information pushing in shopping applications. The attribute data may preset various attribute data in a scene. For example, the attribute data may include user attribute data, attribute data of information to be pushed, and other attribute data (e.g., environment attribute data, device attribute data, etc.).
The execution subject of the data index prediction method may obtain attribute data in a preset scenario from a local or other data source. It should be noted that the execution subject of the training method of the prediction model described in the embodiment of fig. 2 may be the same as or different from the execution subject of the data index prediction method of the present embodiment.
Step 402, inputting the attribute data into a prediction model corresponding to a preset scene to obtain a data index value.
In this embodiment, a pre-trained prediction model may be used to obtain a data index value corresponding to a preset scene according to the input attribute data. The data index value may correspond to various data indexes, and is specifically set according to an actual application scenario and an application requirement. The predictive model may be pre-trained using the method described above in the embodiment of fig. 2.
And step 403, executing a preset operation corresponding to the data index value.
In this embodiment, after obtaining the predicted data index value, a preset operation corresponding to the data index value may be performed. The preset operation may be various operations in a preset scene, and the corresponding relationship between the data index value and the preset operation may be preset according to actual application requirements. For example, the preset operation includes a push operation, a stop of the push operation, and the like.
With continued reference to fig. 5, fig. 5 is an exemplary application scenario 500 of the data index prediction method according to the present embodiment. In the application scenario of fig. 5, for the item information in the candidate pushed item information set 501, the click rate 504 of the user corresponding to the user terminal 508 on the item indicated by the item information may be determined according to the user attribute corresponding to the user terminal 508 and the attribute data such as the item attribute corresponding to the item information by using the prediction model 502 in the payment scenario, and then according to the click rates respectively corresponding to the items, a plurality of item information with a larger click rate may be selected from the candidate pushed item information set 501 to form a pushed item information set 506.
Similarly, for the item information in the candidate pushed item information set 501, according to the user attribute corresponding to the user terminal 508 and the attribute data such as the item attribute corresponding to the item information, etc., the prediction model 503 in the order detail scene may also be used to determine the click rate 505 of the user corresponding to the user terminal 508 on the item indicated by the item information, and then according to the click rate respectively corresponding to each item, a plurality of item information with a larger click rate are selected from the candidate pushed item information set 501 to form a pushed item information set 507.
Further, item information in the pushed item information set 506 may be pushed to a to-be-paid page 5081 when the user terminal 508 displays the page, and item information in the pushed item information set 507 may be pushed to the page when the user terminal 508 displays an order details page 5082.
The method provided by the above embodiment of the present disclosure uses the pre-trained prediction models in each scene to predict the data index values, and performs corresponding operations based on the predicted data index values, which is helpful for optimizing the service quality provided in each application scene.
With further reference to fig. 6, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a training apparatus for a prediction model, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.
As shown in fig. 6, the training apparatus 600 of the prediction model provided in this embodiment includes a first obtaining unit 601, a second obtaining unit 602, and a training unit 603. The first obtaining unit 601 is configured to obtain a training sample set composed of training sample subsets corresponding to respective preset scenes, where the training samples include sample attribute data and sample prediction results in the preset scenes; the second obtaining unit 602 is configured to obtain an initial model composed of prediction models respectively corresponding to preset scenes, each prediction model including an identical embedded network and a shared expert network, the prediction model corresponding to each preset scene further including a private expert network and a prediction network corresponding to the preset scene, the embedded network being configured to generate feature vectors corresponding to attribute data, the private expert network being configured to extract features corresponding to the attribute data in the preset scenes, the shared expert network being configured to extract features of the attribute data in each preset scene, and the prediction network being configured to generate a prediction result according to outputs of the private expert network and the shared expert network; the training unit 603 is configured to train the initial model using a training sample set and a loss function, wherein the loss function is determined according to a sum of loss values respectively corresponding to the prediction models.
In the present embodiment, the training apparatus 600 for the prediction model includes: the detailed processing of the first obtaining unit 601, the second obtaining unit 602, and the training unit 603 and the technical effects thereof can refer to the related descriptions of step 201, step 202, and step 203 in the corresponding embodiment of fig. 2, which are not repeated herein.
In some optional implementation manners of this embodiment, the prediction model corresponding to each preset scenario further includes a gate network corresponding to the preset scenario, where the gate network is used to merge outputs of corresponding private expert networks, and the output of the gate network is an input of the corresponding prediction network.
In some optional implementation manners of this embodiment, the prediction model corresponding to each preset scenario further includes a distillation network corresponding to the preset scenario, where the distillation network is used to merge outputs of the shared expert network, and the output of the distillation network is an input of the corresponding prediction network.
In some optional implementations of the present embodiment, a fusion result of an output of the gate network corresponding to each preset scenario and an output of the distillation network corresponding to each preset scenario is used as an input of the prediction network corresponding to each preset scenario.
In some optional implementations of this embodiment, the loss function is determined according to a sum of the sum and the migration loss, the migration loss represents a sum of migration loss values corresponding to the prediction models, respectively, and the migration loss value of each prediction model represents a sum of loss values corresponding to the teacher networks when the prediction model is used as the student network and other prediction models are used as the teacher network.
The device provided by the above embodiment of the present disclosure is configured to obtain a training sample set composed of training sample subsets corresponding to respective preset scenes, where the training samples include sample attribute data and sample prediction results in the preset scenes; the second obtaining unit obtains initial models formed by prediction models corresponding to all preset scenes respectively, each prediction model comprises an embedded network and a shared expert network which are the same, the prediction model corresponding to each preset scene also comprises a private expert network and a prediction network corresponding to the preset scene, the embedded network is used for generating feature vectors corresponding to attribute data, the private expert network is used for extracting features of the attribute data corresponding to the preset scenes, the shared expert network is used for extracting features of the attribute data under all the preset scenes, and the prediction network is used for generating prediction results according to the outputs of the private expert network and the shared expert network; and the training unit trains the initial model by utilizing the training sample set and the loss function, wherein the loss function is determined according to the sum of the loss values respectively corresponding to the prediction models. Therefore, the characteristics of each scene and the commonality of each scene can be fully utilized to train and obtain the prediction model of each scene, the accuracy of the prediction result of the prediction model in each scene is improved, and various services provided by the application scene based on the prediction model are optimized.
With further reference to fig. 7, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of a data index prediction apparatus, which corresponds to the embodiment of the method shown in fig. 4, and which can be applied in various electronic devices.
As shown in fig. 7, the data index prediction apparatus 700 provided by the present embodiment includes a third acquisition unit 701, a prediction unit 702, and an execution unit 703. Wherein, the third obtaining unit 701 is configured to obtain attribute data in a preset scene; the prediction unit 702 is configured to input the attribute data into a prediction model corresponding to a preset scene, and obtain a data index value, where the prediction model is obtained by pre-training using a method as described in the embodiment of fig. 2; the execution unit 703 is configured to perform a preset operation corresponding to the data index value.
In the present embodiment, the data index prediction apparatus 700: the specific processing of the third obtaining unit 701, the predicting unit 702 and the executing unit 703 and the technical effects thereof may refer to the related descriptions of step 401, step 402 and step 403 in the corresponding embodiment of fig. 2, which are not described herein again.
According to the device provided by the embodiment of the disclosure, the attribute data in the preset scene is acquired through the third acquisition unit; the prediction unit inputs the attribute data into a prediction model corresponding to a preset scene to obtain a data index value, wherein the prediction model is obtained by pre-training by using a method described in the embodiment of fig. 2; the execution unit executes preset operation corresponding to the data index value, and is beneficial to optimizing the service quality provided under each application scene.
Referring now to FIG. 8, a block diagram of an electronic device (e.g., the server of FIG. 1) 800 suitable for use in implementing embodiments of the present disclosure is shown. The server shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 8, an electronic device 800 may include a processing means (e.g., central processing unit, graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing apparatus 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
Generally, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage 808 including, for example, magnetic tape, hard disk, etc.; and a communication device 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 illustrates an electronic device 800 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 8 may represent one device or may represent multiple devices as desired.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 809, or installed from the storage means 808, or installed from the ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a training sample set consisting of training sample subsets corresponding to all preset scenes respectively, wherein the training samples comprise sample attribute data and sample prediction results under the preset scenes; the method comprises the steps that an initial model formed by prediction models corresponding to preset scenes is obtained, each prediction model comprises an embedded network and a shared expert network which are the same, the prediction model corresponding to each preset scene further comprises a private expert network and a prediction network corresponding to the preset scene, the embedded network is used for generating a feature vector corresponding to attribute data, the private expert network is used for extracting features of the attribute data corresponding to the preset scenes, the shared expert network is used for extracting features of the attribute data under each preset scene, and the prediction network is used for generating a prediction result according to the output of the private expert network and the shared expert network; and training the initial model by utilizing a training sample set and a loss function, wherein the loss function is determined according to the sum of the loss values respectively corresponding to the prediction models.
Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a first acquisition unit, a second acquisition unit, and a training unit. For example, the first obtaining unit may be further described as a unit that obtains a training sample set composed of training sample subsets corresponding to respective preset scenes, where the training sample includes sample attribute data and sample data index values in the preset scenes.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims (10)

1. A method of training a predictive model, comprising:
acquiring a training sample set consisting of training sample subsets corresponding to all preset scenes respectively, wherein the training samples comprise sample attribute data and sample data index values under the preset scenes;
the method comprises the steps that an initial model formed by prediction models corresponding to preset scenes is obtained, each prediction model comprises an embedded network and a shared expert network which are the same, the prediction model corresponding to each preset scene further comprises a private expert network and a prediction network corresponding to the preset scene, the embedded network is used for generating a feature vector corresponding to attribute data, the private expert network is used for extracting features corresponding to the attribute data under the preset scenes, the shared expert network is used for extracting the features of the attribute data under the preset scenes, and the prediction network is used for generating prediction data index values according to the output of the private expert network and the shared expert network;
and training the initial model by utilizing the training sample set and the loss function, wherein the loss function is determined according to the sum of the loss values respectively corresponding to the prediction models.
2. The method according to claim 1, wherein the prediction model corresponding to each preset scenario further comprises a gate network corresponding to the preset scenario, the gate network is used for fusing the output of the corresponding private expert network, and the output of the gate network is the input of the corresponding prediction network.
3. The method of claim 2, wherein the prediction model corresponding to each predetermined scenario further comprises a distillation network corresponding to the predetermined scenario, the distillation network is used for fusing outputs of the shared expert network, and the output of the distillation network is an input of the corresponding prediction network.
4. The method of 3, wherein the fused result of the output of the corresponding gate network and the output of the corresponding distillation network for each preset scenario is used as the input of the corresponding prediction network.
5. The method of any one of claims 1 to 4, wherein the loss function is determined from the sum of the sum and the migration loss, the migration loss representing a sum of the values of the migration loss corresponding to the respective predictive models, the migration loss value of each predictive model representing a sum of the values of the loss corresponding to the respective teacher network when the predictive model is used as the student network and the other predictive models are used as the teacher network.
6. A data index prediction method comprises the following steps:
acquiring attribute data in a preset scene;
inputting the attribute data into a prediction model corresponding to the preset scene to obtain a data index value, wherein the prediction model is obtained by pre-training by using the method according to any one of claims 1 to 5;
and executing the preset operation corresponding to the data index value.
7. An apparatus for training a predictive model, comprising:
the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is configured to acquire a training sample set consisting of training sample subsets corresponding to preset scenes respectively, and the training samples comprise sample attribute data and sample data index values under the preset scenes;
the second acquisition unit is configured to acquire an initial model formed by prediction models corresponding to preset scenes respectively, each prediction model comprises an embedded network and a shared expert network which are the same, the prediction model corresponding to each preset scene also comprises a private expert network and a prediction network corresponding to the preset scene, the embedded network is used for generating a feature vector corresponding to attribute data, the private expert network is used for extracting features corresponding to the attribute data under the preset scenes, the shared expert network is used for extracting features of the attribute data under each preset scene, and the prediction network is used for generating a prediction data index value according to the outputs of the private expert network and the shared expert network;
and the training unit is configured to train the initial model by using the training sample set and a loss function, wherein the loss function is determined according to the sum of the loss values respectively corresponding to the prediction models.
8. A data index prediction apparatus comprising:
a third acquiring unit configured to acquire attribute data in a preset scene;
a prediction unit configured to input the attribute data into a prediction model corresponding to the preset scene to obtain a data index value, wherein the prediction model is obtained by pre-training according to the method of one of claims 1 to 5;
and the execution unit is configured to execute the preset operation corresponding to the data index value.
9. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.
10. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.
CN202210146356.0A 2022-02-17 2022-02-17 Training method and device of prediction model Pending CN114511152A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210146356.0A CN114511152A (en) 2022-02-17 2022-02-17 Training method and device of prediction model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210146356.0A CN114511152A (en) 2022-02-17 2022-02-17 Training method and device of prediction model

Publications (1)

Publication Number Publication Date
CN114511152A true CN114511152A (en) 2022-05-17

Family

ID=81552031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210146356.0A Pending CN114511152A (en) 2022-02-17 2022-02-17 Training method and device of prediction model

Country Status (1)

Country Link
CN (1) CN114511152A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114897162A (en) * 2022-05-18 2022-08-12 Oppo广东移动通信有限公司 Training method, selection method and device of object selection model and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114897162A (en) * 2022-05-18 2022-08-12 Oppo广东移动通信有限公司 Training method, selection method and device of object selection model and electronic equipment

Similar Documents

Publication Publication Date Title
KR102342604B1 (en) Method and apparatus for generating neural network
JP7208952B2 (en) Method and apparatus for generating interaction models
JP2021096813A (en) Method and apparatus for processing data
CN111090756B (en) Artificial intelligence-based multi-target recommendation model training method and device
CN110688528B (en) Method, apparatus, electronic device, and medium for generating classification information of video
CN111523640B (en) Training method and device for neural network model
CN108090218B (en) Dialog system generation method and device based on deep reinforcement learning
CN112115257A (en) Method and apparatus for generating information evaluation model
CN111598253A (en) Training machine learning models using teacher annealing
US20230119229A1 (en) Augmenting neural networks
CN111340220A (en) Method and apparatus for training a predictive model
CN108182472A (en) For generating the method and apparatus of information
CN111783810A (en) Method and apparatus for determining attribute information of user
CN117290477A (en) Generating type building knowledge question-answering method based on secondary retrieval enhancement
CN114511152A (en) Training method and device of prediction model
CN111026849B (en) Data processing method and device
CN111090740B (en) Knowledge graph generation method for dialogue system
CN110991661A (en) Method and apparatus for generating a model
CN111767290B (en) Method and apparatus for updating user portraits
CN113255819A (en) Method and apparatus for identifying information
CN112308942A (en) Method and apparatus for generating image
CN110110894A (en) Construction method, device, medium, the electronic equipment of Economic Forecasting Mathematical Model
CN111522887B (en) Method and device for outputting information
CN113010784B (en) Method, apparatus, electronic device and medium for generating prediction information
CN111259659B (en) Information processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination