WO2023221928A1 - 一种推荐方法、训练方法以及装置 - Google Patents

一种推荐方法、训练方法以及装置 Download PDF

Info

Publication number
WO2023221928A1
WO2023221928A1 PCT/CN2023/094227 CN2023094227W WO2023221928A1 WO 2023221928 A1 WO2023221928 A1 WO 2023221928A1 CN 2023094227 W CN2023094227 W CN 2023094227W WO 2023221928 A1 WO2023221928 A1 WO 2023221928A1
Authority
WO
WIPO (PCT)
Prior art keywords
tower
network
expert
feature extraction
task
Prior art date
Application number
PCT/CN2023/094227
Other languages
English (en)
French (fr)
Inventor
贾庆林
刘晓帆
李璟洁
唐睿明
董振华
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023221928A1 publication Critical patent/WO2023221928A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a recommendation method, training method and device.
  • the parameters of the machine learning model are trained through optimization methods such as gradient descent. After the model parameters converge, the model can be used to complete the prediction of unknown data.
  • the input data includes user characteristics, item characteristics, context characteristics, etc.
  • the output is a recommendation list generated by the user.
  • Multi-task learning may cause negative transfer, that is, information sharing between tasks will affect the performance of the network. Therefore, a more flexible parameter sharing mechanism is needed.
  • data sparseness makes the conversion rate prediction model prone to overfitting. Therefore, how to obtain more accurate prediction results has become an urgent problem to be solved.
  • This application provides a recommendation method, training method and device for setting up multiple tower expert networks in the recommendation model, thereby avoiding overfitting problems caused by data sparseness and improving the output stability of the model.
  • this application provides a recommendation method, which includes: obtaining input data, and the input data includes user information; then, using the input data as the input of the recommendation model, outputting recommendation information for the user;
  • the recommendation model is used to perform multiple tasks for recommending users.
  • the recommendation model includes a shared feature extraction network, multiple tower expert networks corresponding to each task, and a task-specific feature extraction network corresponding to each task.
  • the output end of the shared feature extraction network is connected to the input end of each tower expert network, and the input ends of multiple tower expert networks corresponding to each task are also connected to the output end of the task-specific feature extraction network corresponding to each task;
  • the parameters of multiple tower expert networks are different.
  • the shared feature extraction network is used to extract shared features from the input data.
  • the shared features are shared by the tower expert networks corresponding to multiple tasks.
  • the task-specific feature extraction network is used to extract shared features from the input data.
  • the tower expert shared features are extracted from the tower expert network.
  • the tower expert shared features are used to be shared by multiple tower expert networks corresponding to a single task.
  • Multiple tower expert networks are used to extract the task-specific feature extraction network based on the corresponding task and the shared feature extraction network to extract
  • the features of the tower expert network perform the corresponding tasks, and the outputs of multiple tower expert networks corresponding to multiple tasks are weighted and fused to obtain recommended information.
  • multiple tower expert networks with different parameters are set up for each task, thereby improving the output accuracy of the recommendation model through the output results of the multiple tower expert networks. Even when the data is sparse In this case, more stable output results can also be obtained through the output results of multiple tower expert structures.
  • the recommendation model also includes a tower feature extraction network corresponding to multiple tower expert networks.
  • the tower feature extraction network is used to extract from the input data relevant to the tasks performed by the corresponding tower expert networks.
  • Features, and the tower feature extraction network parameters corresponding to multiple tower expert networks are different.
  • the input of each tower expert network also includes features extracted by the corresponding tower feature extraction network.
  • a separate feature extraction network is set up for each tower, so that the required features can be extracted for each tower expert network in a targeted manner, thereby further improving the accuracy of the output results of the recommendation model.
  • the recommendation model also includes multiple gated networks, each tower expert network corresponds to a gated network, and the gated network is used to fuse the corresponding task-specific feature extraction network, shared feature extraction network and The output of the tower feature extraction network is used as the input of the corresponding tower expert network.
  • the weight of various features input to the tower expert network is controlled through the gating network, so that the required features can be adaptively extracted for different tower expert networks, and the output accuracy of each tower expert network can be improved. sex.
  • the above method before obtaining the input data, further includes: iteratively training an initial model to obtain a recommended model, where the structure of the initial model is the same as the recommended model;
  • the training sample is used as the input of the initial model and the first output result is output; the first loss value between the label of the training sample and the first output result is obtained; multiple Multiple second output results output by the tower expert network corresponding to the task; obtain multiple second loss values between the first output and the multiple second output results; update the initial model according to the first loss value and the second loss value, Get the initial model after the current iteration.
  • the loss value between the overall output result of the recommendation model and the output result of the sub-network can be used as a constraint to update each tower expert network so that the output of each sub-network It is closer to the overall output of the recommended model, improves the model convergence speed, and can achieve model training efficiently.
  • multiple tasks include predicting click-through rates and predicting conversion information.
  • the click-through rate is the probability that a user clicks on the target object.
  • the conversion information includes the conversion rate or conversion duration.
  • the conversion rate is the probability that the user clicks on the target object.
  • the conversion duration includes the length of time the user stays on the target object after clicking on the target object and performing a conversion operation on the target object.
  • the recommendation model provided by this application can be used to perform multiple tasks, such as predicting click-through rates and conversion information, thereby accurately predicting suitable recommendation objects for users and improving user experience.
  • this application provides a training method, including: obtaining a training set, which includes multiple samples and labels corresponding to each sample; using the training set as the input of the initial model to iteratively train the initial model to obtain a recommended model ;
  • the recommendation model is used to perform multiple tasks for recommending users.
  • the recommendation model includes a shared feature extraction network, multiple tower expert networks corresponding to each task, and a task-specific feature extraction network corresponding to each task.
  • the output end of the shared feature extraction network is connected to the input end of each tower expert network, and the input ends of multiple tower expert networks corresponding to each task are also connected to the output end of the task-specific feature extraction network corresponding to each task;
  • the samples in the training set are used as the input of the initial model obtained in the previous iteration, Obtain the first loss value between the first output result of the model obtained in the previous iteration and the label of the input sample, and obtain the second loss value between the second output result and the first output result of each tower expert network, according to The second loss value and the first loss value update the model obtained in the previous iteration to obtain the model of the current iteration.
  • the loss value between the output result of the model and the output result of each tower expert is calculated, and the loss value is used as a constraint to update each tower expert, thereby constraining each tower expert
  • the output result is closer to the overall output result of the model, which can speed up the convergence of the model and achieve efficient training of the model.
  • the recommendation model also includes a tower feature extraction network corresponding to multiple tower expert networks.
  • the input end of each tower expert network is also connected to the output end of the corresponding tower feature extraction network.
  • the tower feature The extraction network is used to extract features from the input data that are related to the tasks performed by the corresponding tower expert networks, and the tower feature extraction network parameters corresponding to multiple tower expert networks are different.
  • a separate feature extraction network is set up for each tower, so that the required features can be extracted for each tower expert network in a targeted manner, thereby further improving the accuracy of the output results of the recommendation model.
  • the recommendation model also includes multiple gated networks, each tower expert network corresponds to a gated network, and the gated network is used to fuse the corresponding task-specific feature extraction network, shared feature extraction network and The output of the tower feature extraction network is used as the input of the corresponding tower expert network.
  • the weight of various features input to the tower expert network is controlled through the gating network, so that the required features can be adaptively extracted for different tower expert networks, and the output accuracy of each tower expert network can be improved. sex.
  • multiple tasks include predicting click-through rates and predicting conversion information.
  • the click-through rate is the probability that a user clicks on the target object.
  • the conversion information includes the conversion rate or conversion duration.
  • the conversion rate is the probability that the user clicks on the target object.
  • the conversion duration includes the length of time the user stays on the target object after clicking on the target object and performing a conversion operation on the target object.
  • the recommendation model provided by this application can be used to perform multiple tasks, such as predicting click-through rates and conversion information, thereby accurately predicting suitable recommendation objects for users and improving user experience.
  • this application provides a recommended device, including:
  • the acquisition module is used to obtain input data, which includes user information
  • the recommendation module is used to use the input data as the input of the recommendation model and output recommendation information for users;
  • the recommendation model is used to perform multiple tasks for recommending users.
  • the recommendation model includes a shared feature extraction network, multiple tower expert networks corresponding to each task, and a task-specific feature extraction network corresponding to each task.
  • the output end of the shared feature extraction network is connected to the input end of each tower expert network, and the input ends of multiple tower expert networks corresponding to each task are also connected to the output end of the task-specific feature extraction network corresponding to each task;
  • the parameters of multiple tower expert networks are different.
  • the shared feature extraction network is used to extract shared features from the input data.
  • the shared features are shared by the tower expert networks corresponding to multiple tasks.
  • the task-specific feature extraction network is used to extract shared features from the input data.
  • the tower expert shared features are extracted from the tower expert network.
  • the tower expert shared features are used to be shared by multiple tower expert networks corresponding to a single task.
  • Multiple tower expert networks are used to extract the task-specific feature extraction network based on the corresponding task and the shared feature extraction network to extract Features perform corresponding tasks, and the outputs of multiple tower expert networks corresponding to multiple tasks are weighted and fused to obtain recommended information.
  • the recommendation model also includes tower feature extraction corresponding to multiple tower expert networks.
  • the tower feature extraction network is used to extract features from the input data related to the tasks performed by the corresponding tower expert network, and the tower feature extraction network parameters corresponding to multiple tower expert networks are different, and the input of each tower expert network also Including features extracted by the corresponding tower feature extraction network.
  • the recommendation model also includes multiple gated networks, each tower expert network corresponds to a gated network, and the gated network is used to fuse the corresponding task-specific feature extraction network, shared feature extraction network and The output of the tower feature extraction network is used as the input of the corresponding tower expert network.
  • the device further includes: a training module, which is also used to iteratively train the initial model to obtain the recommended model, where the structure of the initial model is the same as the recommended model;
  • the training sample is used as the input of the initial model and the first output result is output; the first loss value between the label of the training sample and the first output result is obtained; multiple Multiple second output results output by the tower expert network corresponding to the task; obtain multiple second loss values between the first output and the multiple second output results; update the initial model according to the first loss value and the second loss value, Get the initial model after the current iteration.
  • multiple tasks include predicting click-through rates and predicting conversion information.
  • the click-through rate is the probability that a user clicks on the target object.
  • the conversion information includes the conversion rate or conversion duration.
  • the conversion rate is the probability that the user clicks on the target object.
  • the conversion duration includes the length of time the user stays on the target object after clicking on the target object and performing a conversion operation on the target object.
  • this application provides a training device, including:
  • the acquisition module is used to obtain the training set, which includes multiple samples and the labels corresponding to each sample;
  • the training module is used to use the training set as the input of the initial model to iteratively train the initial model to obtain the recommended model;
  • the recommendation model is used to perform multiple tasks for recommending users.
  • the recommendation model includes a shared feature extraction network, multiple tower expert networks corresponding to each task, and a task-specific feature extraction network corresponding to each task.
  • the output end of the shared feature extraction network is connected to the input end of each tower expert network, and the input ends of multiple tower expert networks corresponding to each task are also connected to the output end of the task-specific feature extraction network corresponding to each task; where , in each iteration process, use the samples in the training set as the input of the initial model obtained in the previous iteration, obtain the first loss value between the first output result of the model obtained in the previous iteration and the label of the input sample, and obtain The second loss value between the second output result and the first output result of each tower expert network is used to update the model obtained in the previous iteration based on the second loss value and the first loss value to obtain the model of the current iteration.
  • the recommendation model also includes a tower feature extraction network corresponding to multiple tower expert networks.
  • the tower feature extraction network is used to extract from the input data relevant to the tasks performed by the corresponding tower expert networks.
  • features, and the tower feature extraction network parameters corresponding to multiple tower expert networks are different.
  • the recommendation model also includes multiple gated networks, each tower expert network corresponds to a gated network, and the gated network is used to fuse the corresponding task-specific feature extraction network, shared feature extraction network and The output of the tower feature extraction network is used as the input of the corresponding tower expert network.
  • multiple tasks include predicting click-through rates and predicting conversion information.
  • the click-through rate is the probability that a user clicks on the target object.
  • the conversion information includes the conversion rate or conversion duration.
  • the conversion rate is the probability that the user clicks on the target object.
  • the conversion duration includes the length of time the user stays after clicking on the target object and performing a conversion operation on the target object.
  • this application provides a recommendation model that is used to perform multiple tasks for recommending to users.
  • the recommendation model includes a shared feature extraction network, multiple tower expert networks corresponding to each task, and each The task-specific feature extraction network corresponding to each task, the output end of the shared feature extraction network is connected to the input end of each tower expert network, and the input ends of multiple tower expert networks corresponding to each task are also connected to the input end of each tower expert network corresponding to each task.
  • the output end of the task-specific feature extraction network the parameters of the multiple tower expert networks are different, the shared feature extraction network is used to extract features from the input data, and the task-specific feature extraction network is used to extract features from the input data that are related to each
  • multiple tower expert networks are used to perform corresponding tasks based on the features extracted by the task-specific feature extraction network and the shared feature extraction network.
  • the outputs of multiple tower expert networks corresponding to multiple tasks are obtained after weighted fusion. Recommended information.
  • multiple tower expert networks with different parameters are set up for each task, thereby improving the output accuracy of the recommendation model through the output results of the multiple tower expert networks. Even in the case of sparse data, more stable output results can be obtained through the output results of multiple tower expert structures.
  • the recommendation model also includes a tower feature extraction network corresponding to multiple tower expert networks.
  • the tower feature extraction network is used to extract from the input data relevant to the tasks performed by the corresponding tower expert networks.
  • features, and the tower feature extraction network parameters corresponding to multiple tower expert networks are different.
  • a separate feature extraction network is set up for each tower, so that the required features can be extracted for each tower expert network in a targeted manner, thereby further improving the accuracy of the output results of the recommendation model.
  • the recommendation model also includes multiple gated networks, each tower expert network corresponds to a gated network, and the gated network is used to fuse the corresponding task-specific feature extraction network, shared feature extraction network and The output of the tower feature extraction network is used as the input of the corresponding tower expert network.
  • the weight of various features input to the tower expert network is controlled through the gating network, so that the required features can be adaptively extracted for different tower expert networks, and the output accuracy of each tower expert network can be improved. sex.
  • multiple tasks include predicting click-through rates and predicting conversion information.
  • the click-through rate is the probability that a user clicks on the target object.
  • the conversion information includes the conversion rate or conversion duration.
  • the conversion rate is the probability that the user clicks on the target object.
  • the conversion duration includes the length of time the user stays on the target object after clicking on the target object and performing a conversion operation on the target object.
  • the recommendation model provided by this application can be used to perform multiple tasks, such as predicting click-through rates and conversion information, thereby accurately predicting suitable recommendation objects for users and improving user experience.
  • embodiments of the present application provide a recommendation device, including: a processor and a memory, wherein the processor and the memory are interconnected through lines, and the processor calls the program code in the memory to execute any one of the above-mentioned first aspects. Processing-related functions in the recommended methods shown.
  • embodiments of the present application provide a recommendation device, including: a processor and a memory, wherein the processor and the memory are interconnected through lines, and the processor calls the program code in the memory to execute any of the above second aspects. processing-related functions in the training methods shown.
  • embodiments of the present application provide an electronic device, including: a processor and a memory, wherein the processor The processor is interconnected with the memory through lines, and the processor calls the program code in the memory to perform processing-related functions in the recommended method shown in any one of the above first aspects.
  • inventions of the present application provide a recommendation device.
  • the recommendation device may also be called a digital processing chip or chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface.
  • the program instructions are processed by the processing unit. Execution, the processing unit is configured to perform processing-related functions in the above-mentioned first aspect or any optional implementation manner of the first aspect.
  • inventions of the present application provide a training device.
  • the training device may also be called a digital processing chip or chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface.
  • the program instructions are processed by the processing unit. Execution, the processing unit is configured to perform processing-related functions in the above-mentioned second aspect or any optional implementation manner of the second aspect.
  • embodiments of the present application provide a computer-readable storage medium that includes instructions that, when run on a computer, cause the computer to execute the method in any optional implementation of the first aspect or the second aspect.
  • embodiments of the present application provide a computer program product containing instructions that, when run on a computer, cause the computer to execute the method in any optional implementation of the first aspect or the second aspect.
  • Figure 1 is a schematic diagram of an artificial intelligence subject framework applied in this application
  • FIG. 2 is a schematic diagram of a system architecture provided by this application.
  • FIG. 3 is a schematic diagram of another system architecture provided by this application.
  • Figure 4 is a schematic diagram of an application scenario provided by this application.
  • Figure 5 is the structural intention of a recommendation model provided by this application.
  • Figure 6 is the structural intention of another recommendation model provided by this application.
  • FIG. 7 is a structural diagram of a gate control network provided by this application.
  • Figure 8 is a schematic flow chart of a training method provided by this application.
  • Figure 9 is a schematic flow chart of another training method provided by this application.
  • Figure 10 is a schematic flow chart of a recommendation method provided by this application.
  • Figure 11 is a schematic diagram of another application scenario provided by this application.
  • Figure 12 is a schematic diagram of another application scenario provided by this application.
  • Figure 13 is a schematic structural diagram of a recommendation device provided by this application.
  • Figure 14 is a schematic structural diagram of a training device provided by this application.
  • Figure 15 is a schematic structural diagram of another recommendation device provided by this application.
  • Figure 16 is a schematic structural diagram of another training device provided by this application.
  • Figure 17 is a schematic structural diagram of a chip provided by this application.
  • AI artificial intelligence
  • AI is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that can respond in a manner similar to human intelligence.
  • Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
  • Figure 1 shows a structural schematic diagram of the artificial intelligence main framework.
  • the following is from the “intelligent information chain” (horizontal axis) and “IT value chain” ( The above artificial intelligence theme framework is elaborated on the two dimensions of vertical axis).
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensation process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (providing and processing technology implementation) to the systematic industrial ecological process.
  • Infrastructure provides computing power support for artificial intelligence systems, enables communication with the external world, and supports it through basic platforms.
  • computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA, etc.);
  • the basic platform includes distributed computing framework and network and other related platform guarantees and support, which can include cloud storage and Computing, interconnection networks, etc.
  • sensors communicate with the outside world to obtain data, which are provided to smart chips in the distributed computing system provided by the basic platform for calculation.
  • Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
  • machine learning and deep learning can perform symbolic and formal intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formal information to perform machine thinking and problem solving based on reasoning control strategies. Typical functions are search and matching.
  • Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of further data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, and image processing. identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Its application fields mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, smart cities, etc.
  • the embodiments of this application involve related applications of neural networks.
  • the relevant terms and concepts of neural networks that may be involved in the embodiments of this application are first introduced below.
  • the neural network can be composed of neural units.
  • the neural unit can refer to an operation unit that takes x s as input.
  • the output of the operation unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to perform nonlinear transformation on the features obtained in the neural network and convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN can be understood as a neural network with multiple hidden layers.
  • DNN is divided according to the position of different layers.
  • the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the layers in between are hidden layers.
  • the layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks very complicated, the work of each layer is actually not complicated. Simply put, it is the following linear relationship expression: in, is the input vector, is the output vector, is the offset vector, W is the weight matrix (also called coefficient), and ⁇ () is the activation function.
  • Each layer is just a pair of input vectors After such a simple operation, the output vector is obtained. Due to the large number of DNN layers, the coefficient W and offset vector The number is also relatively large.
  • DNN The definitions of these parameters in DNN are as follows: Taking the coefficient W as an example: Assume that in a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
  • the coefficient from the k-th neuron in layer L-1 to the j-th neuron in layer L is defined as
  • the input layer has no W parameter.
  • more hidden layers make the network more capable of describing complex situations in the real world. Theoretically, a model with more parameters has higher complexity and greater "capacity", which means it can complete more complex learning tasks.
  • Training a deep neural network is the process of learning the weight matrix. The ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (a weight matrix formed by the vectors W of many layers).
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • the convolutional neural network contains a feature extractor consisting of a convolutional layer and a subsampling layer, which can be regarded as a filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a neuron can be connected to only some of the neighboring layer neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as extracting features in a way that is independent of location.
  • the convolution kernel can be formalized as a matrix of random size. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the loss function can usually include error square mean square, cross entropy, logarithm, exponential and other loss functions. For example, one can use the error mean square as the loss function, defined as The specific loss function can be selected according to the actual application scenario.
  • the convolutional neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial network model during the training process, so that the reconstruction error loss of the model becomes smaller and smaller. Specifically, forward propagation of the input signal until the output will produce an error loss, and backward propagation of the error loss information is used to update the parameters in the initial model, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain optimal model parameters, such as weight matrices.
  • the BP algorithm in the pre-training stage or the noise processing stage, can be used to train the model. Perform training and obtain the trained model.
  • Stochastic gradient The number of samples in machine learning is very large, so the loss function calculated each time is calculated from the data obtained by random sampling, and the corresponding gradient is called stochastic gradient.
  • Embedding refers to the feature representation of the sample, usually the penultimate layer of the neural network.
  • Automatic machine learning refers to designing a series of advanced control systems to operate machine learning models so that the models can automatically learn appropriate parameters and configurations without manual intervention.
  • automatic computational learning mainly includes network architecture search and global parameter setting. Among them, network architecture search is used to allow the computer to generate the most suitable neural network architecture for the problem based on the data. It has the characteristics of high training complexity and large performance improvement.
  • (10) Recommendation system uses machine learning algorithms to analyze and learn based on the user's historical click behavior data, and then predicts the user's new requests and returns a personalized item recommendation list.
  • CTR Click Through Rate
  • Conversion rate refers to the probability that a user will convert a clicked display item under a specific environment. For example, if the user clicks on the icon of an APP, conversion refers to download, Installation, registration and other behaviors.
  • Transfer learning Use existing knowledge to assist in learning new knowledge.
  • the core is to find the similarity between existing knowledge and new knowledge.
  • Multi-task learning Put multiple related tasks together to learn and learn multiple tasks at the same time.
  • Ensemble learning Ensemble learning methods use multiple learning algorithms to obtain better prediction performance than using any individual learning algorithm alone.
  • Model convergence After multiple rounds of iterations of the model, the error between the model's predicted value and the actual value is less than a preset smaller value.
  • Generalizability refers to the adaptability of the machine learning system to fresh samples.
  • the purpose of machine learning is to learn the rules hidden behind the data. For data outside the learning set with the same rules, the trained network can also give appropriate output. This ability is called generalization ability.
  • Robustness refers to the ability of a machine learning system to handle errors during execution and the ability of the algorithm to continue to run normally when encountering abnormalities such as input and operations.
  • the recommendation method provided by the embodiment of this application can be executed on the server and can also be executed on the terminal device.
  • the terminal device may be a mobile phone with image processing function, a tablet personal computer (TPC), a media player, a smart TV, a laptop computer (LC), or a personal digital assistant (PDA). ), personal computer (PC), camera, camcorder, smart watch, wearable device (WD) or self-driving vehicle, etc., the embodiments of this application are not limited to this.
  • this embodiment of the present application provides a system architecture 200.
  • data collection device 260 may be used to collect training data.
  • the training data is stored in the database 230.
  • the training device 220 trains to obtain the target model/rule 201 based on the training data maintained in the database 230.
  • the training device 220 processes the multi-frame sample images and outputs corresponding prediction labels, calculates the loss between the prediction label and the original label of the sample, and updates the classification network based on the loss until the prediction label is close to the original label of the sample.
  • the difference between the label or predicted label and the original label is less than the threshold, thereby completing the training of the target model/rule 201.
  • the target model/rule 201 in the embodiment of this application may specifically be a neural network.
  • the training data maintained in the database 230 may not necessarily be collected by the data collection device 260, but may also be received from other devices.
  • the training device 220 does not necessarily perform training of the target model/rules 201 based entirely on the training data maintained by the database 230. It may also obtain training data from the cloud or other places for model training.
  • the above description should not be regarded as a limitation of this application. Limitations of Examples.
  • the target model/rules 201 trained according to the training device 220 can be applied to different systems or devices, such as the execution device 210 shown in Figure 2.
  • the execution device 210 can be a terminal, such as a mobile phone terminal, a tablet computer, Laptops, augmented reality (AR)/virtual reality (VR), vehicle terminals, TVs, etc. It can also be servers or clouds, etc.
  • the execution device 210 is configured with a transceiver 212, which may include an input/output (I/O) interface or other wireless or wired communication interfaces, etc., for data interaction with external devices. , taking the I/O interface as an example, the user can input data to the I/O interface through the client device 240 .
  • I/O input/output
  • the execution device 210 When the execution device 210 preprocesses input data, or when the calculation module 212 of the execution device 210 performs calculations and other related processes, the execution device 210 can call data, codes, etc. in the data storage system 250 for corresponding processing. , the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 250.
  • the I/O interface 212 returns the processing results to the client device 240, thereby providing them to the user.
  • the training device 220 can generate corresponding target models/rules 201 based on different training data for different goals or different tasks, and the corresponding target models/rules 201 can be used to achieve the above goals or complete the The above tasks, thereby providing the user with the desired results.
  • the user can manually enter the input data, which can be operated through the interface provided by the transceiver 212.
  • the client device 240 can automatically send the input data to the transceiver 212. If requiring the client device 240 to automatically send the input data requires the user's authorization, the user can set corresponding permissions in the client device 240. The user can view the results output by the execution device 210 on the client device 240, and the specific presentation form may be display, sound, action, etc.
  • the client device 240 can also be used as a data collection end to collect the input data of the input transceiver 212 and the output result of the output transceiver 212 as new sample data, and store them in the database 230.
  • the transceiver 212 directly stores the input data input to the transceiver 212 and the output result of the output transceiver 212 as new sample data into the database 230 as shown in the figure.
  • Figure 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 250 is an external memory relative to the execution device 210. In other cases, the data storage system 250 can also be placed in the execution device 210.
  • a target model/rule 201 is obtained through training based on the training device 220.
  • the target model/rule 201 may be a recommendation model in the present application.
  • the system architecture for the application of the neural network training method provided by this application can be shown in Figure 3.
  • the server cluster 310 is implemented by one or more servers, and optionally cooperates with other computing devices, such as data storage, routers, load balancers and other devices.
  • the server cluster 310 can use the data in the data storage system 250 or call the program code in the data storage system 250 to implement the steps of the neural network training method provided by this application.
  • Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, game console, etc.
  • Each user's local device can interact with the server cluster 310 through a communication network of any communication mechanism/standard.
  • the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the communication network may include a wireless network, a wired network, or a combination of a wireless network and a wired network, etc.
  • the wireless network includes but is not limited to: fifth-generation mobile communication technology (5th-Generation, 5G) system, long term evolution (LTE) system, global system for mobile communication (GSM) or code division Multiple access (code division multiple access, CDMA) network, wideband code division multiple access (WCDMA) network, wireless fidelity (wireless fidelity, WiFi), Bluetooth (bluetooth), Zigbee protocol (Zigbee), Any one or a combination of radio frequency identification technology (radio frequency identification, RFID), long range (Lora) wireless communication, and near field communication (NFC).
  • the wired network may include an optical fiber communication network or a network composed of coaxial cables.
  • one or more aspects of the execution device 210 may be implemented by each local device, for example, the local device 301 may provide local data or feedback calculation results to the execution device 210 .
  • the execution device 210 can also be implemented by local devices.
  • the local device 301 implements the functions of the execution device 210 and provides services for its own users, or provides services for users of the local device 302 .
  • a machine learning system can include a personalized recommendation system, which can train the parameters of a machine learning model through optimization methods such as gradient descent based on input data and labels. After the model parameters converge, the model can be used to complete the prediction of unknown data.
  • a personalized recommendation system can train the parameters of a machine learning model through optimization methods such as gradient descent based on input data and labels. After the model parameters converge, the model can be used to complete the prediction of unknown data.
  • the input data includes user characteristics, item characteristics and context characteristics. How to predict personalized recommendation lists based on user preferences has an important impact on improving the user experience of the recommendation system and platform revenue.
  • the recommendation process can be shown in Figure 4, which can be divided into a training part and an online inference part.
  • the training set includes input data and corresponding labels.
  • the training set can include Data such as apps that users have clicked on and apps that clicked and converted.
  • Input the training set into the initial model train the parameters of the machine learning model through optimization methods such as gradient descent, and obtain the recommended model.
  • the recommendation model can be deployed on the recommendation platform, such as in a server or terminal.
  • the server is used as an example.
  • the server can be used to output a recommendation list for the user.
  • the APP recommendation scenario in the user The home page of the terminal displays icons of APPs recommended by the user, or after the user clicks on an APP, the icons of recommended APPs related to it can be displayed.
  • Conversion rate estimation faces the following two challenges: Sample Selection Bias: Conversion rate estimation model training is performed on the post-click sample space, while prediction needs to be performed on the exposure sample space. Data sparsity: The positive sample label of the conversion rate prediction model is conversion, and the negative sample label is click. The number of positive samples is greatly reduced compared to the click rate prediction model.
  • Some strategies can alleviate these two problems, such as sampling unclicked samples from the exposure set as negative examples to alleviate sample selection bias, oversampling converted samples to alleviate data sparseness, etc.
  • none of these methods can substantially solve any of the above problems. From click to conversion, they are two strongly related continuous behaviors. Multi-task learning models these two tasks at the same time, so that training and prediction can be performed across the entire space. It is a mainstream solution in the industry.
  • most of the existing multi-task learning methods use hard parameter sharing mechanisms. Multi-task learning may bring about negative transfer, that is, information sharing between tasks will affect the performance of the network.
  • data sparseness makes the conversion rate prediction model prone to overfitting.
  • ESMM Entire space multi-task model
  • ESMM adopts a parameter sharing structure of shared Embedding.
  • the CTR task and the CVR task use the same features and feature embedding.
  • ESMM uses the supervision information of CTCVR and CTR to train the network and learn CVR implicitly.
  • ESMM only shares information in the embedding layer, and the CVR task still faces the problem of data sparseness and hard parameter sharing, which is not conducive to parameter learning.
  • the multi-task learning model (AITM) based on the automatic information transfer framework is a multi-task model used for conversion rate estimation with sequence dependencies.
  • AITM outputs the input features to multiple tower networks through task-shared Embedding.
  • the AIT module uses the vector output by the current task tower and the information from the previous task to learn how to fuse information between tasks.
  • the attention mechanism is used to automatically assign weights to the transferred information and the original information.
  • the transferred information is learned through functions, which can be a simple fully connected layer to learn what information should be transferred between two adjacent tasks.
  • AITM constrains the output of the probability to satisfy sequence dependence as much as possible by applying a calibrator in the loss function, that is, the probability of the task output at the end of the sequence should be smaller than the probability of the task output at the front of the sequence.
  • AITM only shares information in the embedding layer, and the CVR task still faces the problem of data sparseness and hard parameter sharing, which is not conducive to parameter learning.
  • sequence relationship calibrator can play a regularizing role
  • pcvr is the value obtained by pCTCVR/pCTR, and division will make the prediction results unstable.
  • a novel multi-task learning model for personalized recommendations (PLE).
  • the underlying network of PLE mainly consists of shared expert network (shared experts) and task-specific network (task -specific expert), the upper layer is composed of a multi-task tower network.
  • the input of each multi-task tower network is weighted and controlled by a gating network.
  • the input of each sub-task gating network includes two parts.
  • the task-specific network under the task and the shared expert network in the shared part, the input of the feature vector serves as the selector of the gating network.
  • CTR and CVR tasks that is, only clicked samples may be converted.
  • this application provides a multi-task learning framework based on hierarchical hybrid experts.
  • a hybrid expert structure on the bottom layer (feature representation layer) and the tower structure (feature interaction layer) where the task is located, hierarchical learning of feature representation and feature interaction is achieved. , which can make full use of the correlation between tasks to help the conversion rate estimation task achieve better recommendation results.
  • the recommended methods and neural network training methods provided by this application are described in detail below.
  • the recommendation model can be used to perform multiple tasks (P tasks are taken as an example in Figure 5).
  • the multiple tasks are related tasks recommended for users.
  • Each task corresponds to multiple tower expert networks, and each task also Corresponds to one or more task-specific feature extraction networks, and the multiple tasks correspond to one or more shared feature extraction networks.
  • the recommendation model may include multiple tower expert networks corresponding to each task, and one corresponding to each task. Or multiple task-specific feature extraction networks and one or more shared feature SAR networks, in which the parameters of each tower expert network are different.
  • the parameters can include internal parameters of each operation in the tower expert network, such as the internal parameters of the convolution kernel.
  • the parameters can also include the weight parameters of the output of each operation, etc.
  • the shared feature extraction network can be used to extract features from the input data.
  • the features output by it are called shared features and serve as a common input for each tower expert network, that is, the shared features are used for the multiple tasks.
  • the corresponding tower expert network is shared.
  • the task-specific feature extraction network can be used to extract features for the corresponding task.
  • its output features are called tower expert shared features and serve as multiple tower expert networks in the corresponding single task.
  • Shared input that is, the shared features of the tower experts are shared by multiple tower expert networks corresponding to a single task.
  • the recommendation model can be used to perform multiple tasks.
  • the multiple tasks can be tasks related to recommending users.
  • the multiple tasks can be associated or not.
  • the multiple tasks may include predicting click-through rates, predicting conversion information, etc.
  • the conversion information may include information such as conversion rate or conversion duration.
  • the conversion rate refers to the further conversion of the object after the user clicks on the object.
  • Probability, the conversion duration is the length of time the user stays on the object after clicking on it and further converting the object.
  • the tower expert network is used to perform the corresponding tasks. For example, if the task includes click-through rate prediction, the tower expert network can be used to predict the click-through rate based on the input features to obtain the click-through rate prediction result; if the task includes conversion rate prediction, the tower expert network can be used to predict the click-through rate based on the input features.
  • the input features are used to predict the conversion rate and the conversion rate prediction result is obtained; if the task is target recognition, the tower expert network can be used to perform target recognition based on the input features and obtain the identified target. Target information, etc. can be adjusted according to actual application scenarios.
  • one or more tower feature extraction networks can be set up for each tower expert network.
  • the input end of each tower expert network is also connected to the output end of the corresponding tower feature extraction network, that is, The input end of each tower expert network is also connected to the output end of the corresponding tower feature extraction network.
  • the one or more tower feature extraction networks are used to extract the required features for each tower.
  • the tower features corresponding to each tower are extracted.
  • the parameters of the network are different, so that features can be extracted adaptively according to each tower, improving the accuracy of the final output of the recommendation model.
  • the features output by the tower feature extraction network are referred to as tower-specific features below.
  • a gating network can also be set at the input end of each tower expert network.
  • the gating network is used to fuse the output of the task-specific feature extraction network, the shared feature extraction network, and the tower feature extraction network to obtain each tower. Input from expert network.
  • the features extracted by each feature extraction network can be input to each tower expert network in an appropriate proportion, so that the tower expert network can output accurate task output results.
  • the input data can be converted into a feature representation through the Embedding layer, such as into a feature vector, so that the subsequent feature extraction network can extract the required features from the feature representation.
  • the structure of the gating network can be shown in Figure 7.
  • the gated network may include a fully connected layer and a softmax layer.
  • the task-specific features extracted by the task-specific feature extraction network include multi-dimensional features required by the tower expert network. Therefore, the feature representation of the input data can be used as the input of the fully connected layer, and the fully connected layer can help
  • the tower expert network extracts the required features and maps the output of the fully connected layer through the softmax layer into weight values that can be recognized by subsequent networks. Then, the tower-specific features, shared features and task-specific features are weighted and fused according to the weight value output by the softmax layer to obtain the fusion feature, which can be used as the input of the tower-specific network.
  • the recommendation model provided by this application multiple tower expert networks are set up for each task, and the parameters of each tower expert network are different.
  • multiple output results of multiple tower expert networks can be used to improve the output stability of the model and avoid overfitting. Even if there is a problem of data sparseness, the recommendation model provided by this application can stably output recommendation results and improve user experience.
  • the method provided by this application can be divided into two parts, namely the training part and the online reasoning part, which are introduced separately below.
  • FIG. 8 a schematic flow chart of a training method provided by this application is as follows.
  • the training set may be the collected historical input data of one or more users, or data received from other servers or clients. It can be understood that the training set may include multiple samples and labels corresponding to each sample.
  • the data types in the training set are related to the tasks performed by the recommendation model.
  • the data required for training different tasks may be different, and the details can be adjusted according to the actual application scenario.
  • the training set can include information about APPs that a large number of users have clicked, such as APP name, application type, application style and other information, as well as further operations after clicking on the APP, such as downloading, installation, registration and other conversions. operate.
  • the training set can include information about the music clicked by a large number of users, such as music type, singer information and other information. information, as well as further operations after clicking on the music, such as playback, downloading and other conversion operations.
  • the initial model may be a constructed model, or an existing model structure may be used as the initial model.
  • the structure of the initial model can be referred to the aforementioned Figure 5 or Figure 6.
  • the recommendation model can be used to perform multiple tasks.
  • the multiple tasks are related tasks for recommending to users.
  • Each task corresponds to multiple tower expert networks, and each task corresponds to multiple tower expert networks.
  • Each task also corresponds to one or more task-specific feature extraction networks, and the multiple tasks correspond to one or more shared feature extraction networks.
  • the recommendation model can include each task corresponding to multiple tower expert networks, each task Corresponding to one or more task-specific feature extraction networks and one or more shared feature special zone networks, where the parameters of each tower expert network are different, and the parameters can include internal parameters of each operation in the tower expert network, such as volume
  • the parameters within the accumulation kernel or the parameters within the pooling operation can also include the weight parameters of the output of each operation, etc.
  • the recommendation model can be used to perform multiple tasks, and the multiple tasks can be tasks related to recommending users.
  • the multiple tasks can be associated or not associated.
  • the multiple tasks may include predicting click-through rates, predicting conversion information, etc.
  • the conversion information may include information such as conversion rate or conversion duration.
  • the conversion rate refers to the further conversion of the object after the user clicks on the object.
  • Probability, the conversion duration is the length of time the user stays on the object after clicking on it and further converting the object.
  • the samples in the training set are used as the input of the initial model obtained in the previous iteration, and the first loss value between the first output result of the initial model obtained in the previous iteration and the label of the input sample is obtained.
  • the second loss value between the second output result of each tower expert network and the first output result is used to update the model obtained in the previous iteration using the first loss value and the second loss value corresponding to each tower expert network, Get the model of the current iteration.
  • the output result of each tower expert network and the overall output of the model can also be used.
  • the loss value between the results is used as a constraint to update each tower expert, so that the output of each tower expert is closer to the overall output result of the model, making the output of the model more accurate, speeding up the convergence of model training, and achieving high efficiency train.
  • the tasks performed by the recommendation model may include CVR prediction and CTR prediction.
  • CVR prediction the update process of the network corresponding to the CVR prediction task can be shown in Figure 9.
  • M towers predict the conversion probability respectively.
  • the M prediction results are weighted and fused to obtain This result is then used to calibrate the predicted value of each tower.
  • the following loss function can be used to constrain the output of the tower expert network:
  • the above-mentioned cross-validation entropy and KL distance are used to constrain the update of the tower expert network, so that the output of the tower expert network is closer to the output of the recommended model.
  • the training data is exposure data
  • the label of CTCVR task p (conversion&click 1
  • x) is click and conversion
  • CVR tasks can be modeled implicitly
  • the expert ensemble learning results are used to calibrate the prediction value of each expert.
  • the results of multi-expert ensemble learning are more robust, which can in turn calibrate the prediction value of each expert. Avoid expert learning results from being too divergent, while improving the stability of model convergence and the generalization performance of the model.
  • the recommendation model also includes a tower feature extraction network that corresponds to multiple tower expert networks.
  • the tower feature extraction network is used to extract from the input data the features performed by the corresponding tower expert network.
  • Task-related features, and the tower feature extraction network parameters corresponding to multiple tower expert networks are different.
  • each tower expert can also be updated The tower feature extraction network corresponding to the network is updated, so that the output of each tower expert network is closer to the overall output of the model, making the output of the model more accurate, speeding up the convergence of model training, and achieving efficient training.
  • Figure 10 is a schematic flow chart of a recommendation method provided by this application.
  • the input data may be collected user information, or input data received from the client, etc.
  • the input data may include user information.
  • the user information may specifically include user identity information, positioning information, user input data or user-generated historical information, etc.
  • the user identity information such as the user's name, identification and other information indicating the user's identity
  • the positioning information can include the coordinates of the user's own location, which can be obtained by the user using the client for positioning;
  • the user input data can include data input by the user, such as the user opening the application market or music software, or the user clicking on an app Or click on a music icon and other operations; historical information generated by the user, such as information about the apps the user clicked or downloaded, music played or downloaded, etc.
  • the input data may also include information about the object to be recommended, such as the type of the object to be recommended, a candidate list, and other information.
  • the input data may include the types of objects that need to be recommended for the user.
  • the types of objects recommended for the user include apps, music, and other type information.
  • the input data may directly include an alternative list of objects recommended for the user.
  • the candidate list may include information about multiple apps, so that the recommendation model can be used to subsequently select apps recommended for users from the multiple apps.
  • the candidate list may include information about multiple songs, so that the recommendation model can be used to subsequently select songs recommended for the user from the multiple pieces of music.
  • the information of the object to be recommended can be sent by the client to the server, or can be generated by the server based on locally saved data.
  • the server can pre-set a database of objects that need to be recommended to users.
  • the objects in the database can be used as a candidate list, or the server can set corresponding objects to be recommended for each user.
  • the type of object After receiving the input data, the information of the object to be recommended can be obtained from the locally saved data based on the user's identity information.
  • the user can directly operate on the client to obtain input data.
  • the client can generate input data based on the user's input operation. and sent to the server; if the recommended method provided by this application is deployed on the server, the user can directly perform input operations through the input device connected to the server, such as clicking on an app, opening music playback software, etc., and the server generates information through the input device.
  • the data generates input data, or the user can use the client to establish a connection with the server, the user performs input operations on the client, the client transmits the data generated by the user to the server, and the server can obtain the input data.
  • the input data generated are also different.
  • the input data may include data generated by the user's operation of opening the app store.
  • the input data may include the user's identification information, such as the user's name, unique identification number, etc., or Including the user's historical data, such as information about the apps that have been clicked in the past, such as the number, type, name or identification number of the apps that have been clicked.
  • the input data may include data generated by the user's click on the next song.
  • the input data may include user information and historical playback data, such as the user's name, identification number, etc., and may also It can include information about the last music played, such as the name of the music, the name of the singer, the music style and other information.
  • the input data can be used as the input of the recommendation model to output recommendation information for the user.
  • the types of recommended information may be different in different scenarios.
  • the recommendation information may include information about apps recommended for users, such as app icons, download links, and other information.
  • the recommendation information may include information about music recommended for the user, such as music title, singer, playback entry, and other information.
  • the recommendation model can be used to perform multiple tasks.
  • the multiple tasks can be tasks related to recommending users.
  • the multiple tasks can be associated or not.
  • the multiple tasks may include predicting click-through rates, predicting conversion information, etc.
  • the conversion information may include information such as conversion rate or conversion duration.
  • the conversion rate That is, the probability that the user will further convert the object after clicking on the object.
  • the conversion duration is the length of time the user stays after clicking on the object and further converting the object, such as the length of time the user plays videos, plays music, etc.
  • the app recommendation scenario you can output the click-through rate and conversion rate of each app in the candidate list, so that when generating the recommendation list, you can filter out those with click-through rates and conversion rates higher than a certain value from the candidate list. app as a recommended app.
  • the conversion time is the length of time the user clicks on a certain music and plays the music, and can be filtered from the alternative list.
  • the music with the highest click-through rate and conversion time is selected as the recommended music for users.
  • the shared feature extraction network in the recommendation model can extract shared features from the input data.
  • the shared features can include features required by each tower expert network in the recommendation model to perform tasks, and the shared features can be The shared features are input to the tower expert network.
  • the task-specific feature extraction network in the recommendation model is used to extract task-specific features related to the task from the input data and serve as input to the tower expert network corresponding to the task.
  • the task-specific features are extracted for the corresponding task.
  • the obtained features can be used to implement task-specific feature extraction.
  • the tower expert network receives and fuses the task-specific features and shared features, it can perform corresponding tasks based on the task-specific features and shared features to obtain the output results of the tower expert network.
  • the recommendation model can include multiple tower expert networks, and by fusing the output results of multiple tower expert networks, user-specific recommendation information can be obtained.
  • multiple tower expert networks are set up for each task to perform multi-task recommendation models.
  • the output stability of the recommendation model is improved through the output of multiple tower expert networks, which can avoid data sparseness. This leads to the problem of overfitting, thereby improving the output accuracy of the recommended model.
  • the model provided in this application is a multi-task model.
  • the efficient parameter sharing scheme can make the two associated tasks of the multi-task model, click rate estimation and conversion rate estimation, assist each other, achieving better results than the single-task model. effect, thus directly affecting platform revenue and user experience. It can not only reduce the number of models deployed online and reduce model maintenance costs, but also more effectively mine the information contained in related tasks to achieve better recommendation results.
  • each tower expert network can correspond to a tower feature extraction network one-to-one, and the tower feature extraction network can be used to target the corresponding tower expert.
  • the network performs feature extraction to obtain tower-specific features, and inputs the tower-specific features into the corresponding tower expert network. Therefore, the input to each tower expert network can include shared features extracted by the shared feature extraction network, task-specific features extracted by the task-specific feature extraction network, and tower-specific features extracted by the tower feature extraction network, so that Each tower expert network can use a variety of features to perform corresponding tasks, thereby outputting more accurate output results. Therefore, in the implementation of this application, a separate feature extraction network is set up for each tower expert network, so that the features required for each tower can be more accurately extracted, so that each tower can use more accurate features to obtain more accurate output results.
  • the input data when the input data is input to the recommendation model, it is converted into a feature representation that can be recognized by the feature extraction network after passing through the Embedding layer, or is called a feature vector. It is then input to the tower feature extraction corresponding to each tower, the task-specific feature extraction network corresponding to each task, and the shared feature extraction network to facilitate feature extraction for each tower and each task. After the tower feature extraction corresponding to each tower, the task-specific feature extraction network corresponding to each task, and the shared feature extraction network extract features, the features extracted by these feature extraction networks can be input to the gate corresponding to each tower expert network. In the network, the features extracted by each feature extraction network are fused through the gated network.
  • the features required by different tower expert networks may be different, and they can be set by The set gating network adaptively fuses the features extracted by each feature extraction network, such as fusion according to different weights, so as to obtain the features required by the tower expert network and input them to the tower expert network.
  • the input layer passes in data features, takes out the corresponding embedding vector expressions from the embedding table through sparsely encoded IDs, and finally concatenates the embedding vector expressions of all input features in order to form a feature vector.
  • Each feature extraction network receives the feature vector
  • Each task is fused with the associated tower expert results to give a prediction for the task.
  • the feature vector x 0 of the input layer is input to the underlying feature representation layer.
  • the feature representation layer is composed of various feature extraction networks.
  • the feature representation layer includes a shared expert layer (Shared Expert), that is, the aforementioned shared feature extraction network and task-specific expert layer ( CTR/CVR Task-Expert), which is the aforementioned task-specific feature extraction network, and Tower-Specific Expert (CTR/CVR Tower-Specific Expert), which is the aforementioned tower feature extraction network.
  • Each expert is composed of multiple sub-networks. The number of subnetworks, the dimensionality of the subnetworks and the network structure are all hyperparameters.
  • Each task of the feature interaction layer contains several Tower Expert networks.
  • the input of each Tower Expert network is weighted and controlled by the Gate Control network.
  • the input of each Tower Expert network of each task is
  • the input of the gating network includes three parts: the output of the tower-specific expert layer under this tower, the output of the task-specific expert layer under this task, and the output of the shared expert layer.
  • the feature vector x 0 serves as the selector (Selector) of the gating network.
  • the structure of the gated network can be a fully connected network or other deep networks.
  • the feature vector x 0 is used as a selector to obtain the weights of different subnetworks, and thus the weighted sum of the gated network under different tower experts for different tasks can be obtained.
  • the tower experts of each feature interaction layer will perform a weighted summation of the expert layer outputs of the tower-specific expert layer under this tower, the task-specific expert layer under this task, and the shared expert layer based on the input feature vector x 0 . Therefore, each tower expert network of each task obtains a unique feature representation, and then through the tower expert network of each subtask, the output of the corresponding subtask tower expert network is obtained.
  • the predicted value for each task is a weighted aggregation of the outputs of multiple tower expert networks contained in that subtask.
  • the expert network in this application can use a variety of networks, such as any deep network, such as Squeeze-and-Excitation network, Attention network, etc.
  • hierarchical expert structures are set up on the bottom layer (feature representation layer) and the task tower network layer (feature interaction layer) respectively.
  • this application designs a multi-expert parameter sharing mechanism in the feature representation layer.
  • the feature representation layer includes three types of expert networks: shared experts, task-specific experts and tower-specific experts. Shared experts share shared knowledge between tasks, task-specific experts extract the knowledge required for the task, and tower-specific experts learn knowledge independently on the tower structure. Expert services, each expert performs his or her duties and extracts information efficiently. At the same scale, a single network cannot effectively learn common expressions between tasks.
  • each sub-network can always learn some relevant and unique expressions in a certain task. Therefore, this application is suitable for a prediction task.
  • the tower network sets up multiple tower experts to further learn feature interactions from different angles, improve the learning ability and generalization ability of the model, and ultimately achieve better prediction accuracy than traditional methods.
  • each sub-network can always learn some relevant and unique expressions in a certain task.
  • a more flexible parameter sharing mechanism is provided in the feature representation layer.
  • the tower expert The parameters of a unique expert only serve the tower expert, the parameters of a task-specific expert are only shared among tower experts on the same task, and the parameters of a shared expert are shared by all tower experts, which can efficiently extract and represent information.
  • the ranking of display ads requires click-through rate prediction and conversion rate prediction.
  • the inputs include user characteristics, product characteristics, and contextual characteristics.
  • a multi-task method is used to jointly model click-through rate prediction and conversion rate prediction. Conversion rate prediction task.
  • this application improves the ability of the multi-task learning model to share information, so that it can more accurately estimate the click rate and conversion rate and give more accurate recommendations. Enhance the ability to share information between multiple tasks through hierarchical hybrid expert modules.
  • the feature vector is extracted by the three types of experts mentioned above, and the information representation is obtained from the three types of experts through the gated network, and then input into the tower network layer (feature interaction layer) to learn feature interaction;
  • the output results of the tower network experts are weighted and aggregated, and the activation function is input to finally obtain the predicted value of the task.
  • the display advertising sorting scenario is a typical scenario in machine learning applications. Its main structure is shown in Figure 11, including display advertising, offline logs, offline training, online reasoning, and online sorting.
  • the basic operating logic of the display advertising recommendation system is: users perform a series of behaviors in the front-end display list, such as browsing, clicking, commenting, downloading, etc., and generate behavioral data, which is stored in the log.
  • the recommendation system uses data including user behavior logs for offline model training, generates a prediction model after the training converges, deploys the model in an online service environment, and gives a click-through rate estimate based on the user's requested access, item characteristics, and contextual information.
  • the score P ctr and the conversion rate estimated score P cvr then the online sorting module will combine the above two scores and business logic to sort the candidate ads, display the final recommendation list to the user, and finally the user will generate feedback on the recommendation result to form User data.
  • the icon of the recommended app can be displayed in the display interface of the user's terminal, so that the user can further click or download the recommended app, so that the user can quickly find the required app. Improve user experience.
  • the online sorting stage requires the score P ctr for click rate estimation and the score P cvr for conversion rate estimation.
  • Multi-task learning can alleviate the impact of sample selection bias and data sparsity on model performance.
  • An efficient parameter sharing scheme can assist each other in the click rate estimation and conversion rate estimation of the two related tasks of the multi-task model, achieving better results than the single-task model, thus directly affecting platform revenue and user experience.
  • a good multi-task learning solution can not only reduce the number of models deployed online and reduce model maintenance costs, but also more effectively mine the information contained in related tasks to achieve better recommendation results.
  • the offline evaluation index is AUC (area under the curve and the coordinate axis, Area Under Curve).
  • This application improves the parameter sharing mechanism by setting up a hierarchical hybrid expert module. Compared with commonly used solutions, this application proposes that multiple tower expert structures should also be set up in the feature interaction layer. Based on the idea of integrated learning, a single network at the same scale cannot effectively learn common expressions between tasks, but multiple sub-networks can be obtained through division. After the network, each sub-network can always learn some relevant and unique expressions in a certain task. Therefore, for a separate prediction task, this application sets up multiple tower experts in the task tower network to further learn feature interactions from different angles to improve the learning ability of the model.
  • this application proposes tower-specific experts.
  • the input of each tower expert layer is controlled by the gate control network.
  • the gate control network accepts tower-specific experts (parameters only serve the tower), task-specific experts (parameters are only shared among tower experts on the same task), and shared experts (parameters only serve the tower). Shared by all tower experts) the learned feature representation is used as input for weighted summation. This enables tower experts in the feature interaction layer to learn feature interactions that include both personalized information unique to the network, information shared between the same tasks, and more generalized information between all tasks.
  • This flexible parameter sharing mechanism can efficiently extract information representation so that the multi-task learning solution proposed in this application can fully share the associated information between tasks.
  • using the loss value between the overall output of the model and the output of a single tower as a constraint to update the model can make the model converge faster, reduce the possibility of learning bias, and improve the performance of the model.
  • this application provides a schematic structural diagram of a recommendation device for performing the steps in Figures 10-12.
  • the recommendation device includes:
  • the acquisition module 1301 is used to acquire input data, which includes user information;
  • the recommendation module 1302 is used to use the input data as the input of the recommendation model and output recommendation information for the user;
  • the recommendation model is used to perform multiple tasks for recommending users.
  • the recommendation model includes a shared feature extraction network, multiple tower expert networks corresponding to each task, and a task-specific feature extraction network corresponding to each task.
  • the output end of the shared feature extraction network is connected to the input end of each tower expert network, and the input ends of multiple tower expert networks corresponding to each task are also connected to the output end of the task-specific feature extraction network corresponding to each task;
  • the parameters of multiple tower expert networks are different.
  • the shared feature extraction network is used to extract shared features from the input data.
  • the shared features are shared by the tower expert networks corresponding to multiple tasks.
  • the task-specific feature extraction network is used to extract shared features from the input data. Extract tower expert shared features from the tower expert network.
  • the tower expert shared features are shared by multiple tower expert networks corresponding to a single task. Multiple tower expert networks are used to execute features extracted based on task-specific feature extraction networks and shared feature extraction networks. Corresponding tasks, the outputs of multiple tower expert networks corresponding to multiple tasks are weighted and fused to obtain recommended information.
  • the recommendation model also includes a tower feature extraction network corresponding to multiple tower expert networks.
  • the tower feature extraction network is used to extract from the input data relevant to the tasks performed by the corresponding tower expert networks.
  • feature Moreover, the tower feature extraction network parameters corresponding to multiple tower expert networks are different.
  • the recommendation model also includes multiple gated networks, each tower expert network corresponds to a gated network, and the gated network is used to fuse the corresponding task-specific feature extraction network, shared feature extraction network and The output of the tower feature extraction network is used as the input of the corresponding tower expert network.
  • the recommendation device further includes:
  • the training module 1303 is also used to iteratively train the initial model to obtain the recommended model.
  • the structure of the initial model is the same as the recommended model;
  • the training sample is used as the input of the initial model and the first output result is output; the first loss value between the label of the training sample and the first output result is obtained; multiple Multiple second output results output by the tower expert network corresponding to the task; obtain multiple second loss values between the first output and the multiple second output results; update the initial model according to the first loss value and the second loss value, Get the initial model after the current iteration.
  • multiple tasks include predicting click-through rates and predicting conversion information.
  • the click-through rate is the probability that a user clicks on the target object.
  • the conversion information includes the conversion rate or conversion duration.
  • the conversion rate is the probability that the user clicks on the target object.
  • the conversion duration includes the length of time the user stays on the target object after clicking on the target object and performing a conversion operation on the target object.
  • the training device may include:
  • the acquisition module 1401 is used to acquire a training set, which includes multiple samples and labels corresponding to each sample;
  • the training module 1402 is used to use the training set as the input of the initial model to iteratively train the initial model to obtain the recommended model;
  • the recommendation model is used to perform multiple tasks for recommending users.
  • the recommendation model includes a shared feature extraction network, multiple tower expert networks corresponding to each task, and a task-specific feature extraction network corresponding to each task.
  • the output end of the shared feature extraction network is connected to the input end of each tower expert network, and the input ends of multiple tower expert networks corresponding to each task are also connected to the output end of the task-specific feature extraction network corresponding to each task; where , in each iteration process, use the samples in the training set as the input of the initial model obtained in the previous iteration, obtain the first loss value between the first output result of the model obtained in the previous iteration and the label of the input sample, and obtain The second loss value between the second output result and the first output result of each tower expert network is used to update the model obtained in the previous iteration based on the second loss value and the first loss value to obtain the model of the current iteration.
  • the recommendation model also includes a tower feature extraction network corresponding to multiple tower expert networks.
  • the input end of each tower expert network is also connected to the output end of the corresponding tower feature extraction network.
  • the tower feature The extraction network is used to extract features from the input data that are related to the tasks performed by the corresponding tower expert networks, and the tower feature extraction network parameters corresponding to multiple tower expert networks are different.
  • the recommendation model also includes multiple gated networks, each tower expert network corresponds to a gated network, and the gated network is used to fuse the corresponding task-specific feature extraction network, shared feature extraction network and The output of the tower feature extraction network is used as the input of the corresponding tower expert network.
  • the multiple tasks include predicting click-through rates and predicting conversion information.
  • the click-through rates of users The probability of clicking on the target object.
  • the conversion information includes conversion rate or conversion duration.
  • the conversion rate is the probability of the user performing a conversion operation on the target object after clicking on the target object.
  • the conversion duration includes the length of time the user stays on the target object after clicking on the target object and performs a conversion operation on the target object. .
  • Figure 15 is a schematic structural diagram of another recommended device provided by this application, as described below.
  • the recommendation device may include a processor 1501 and a memory 1502.
  • the processor 1501 and the memory 1502 are interconnected through lines. Among them, the memory 1502 stores program instructions and data.
  • the memory 1502 stores program instructions and data corresponding to the steps in the aforementioned FIGS. 10-12.
  • the processor 1501 is configured to execute the method steps performed by the recommendation device shown in any of the embodiments shown in FIGS. 10 to 12 .
  • the recommendation device may also include a transceiver 1503 for receiving or sending data.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a program. When it is run on a computer, it causes the computer to execute the steps described in the embodiments shown in Figures 10-12. steps in the method.
  • the aforementioned recommended device shown in Figure 15 is a chip.
  • Figure 16 is a schematic structural diagram of another training device provided by this application, as described below.
  • the training device may include a processor 1601 and a memory 1602.
  • the processor 1601 and the memory 1602 are interconnected through lines.
  • the memory 1602 stores program instructions and data.
  • the memory 1602 stores program instructions and data corresponding to the steps in FIGS. 8-9.
  • the processor 1601 is configured to execute the method steps performed by the training device shown in FIGS. 8 and 9 .
  • the training device may also include a transceiver 1603 for receiving or sending data.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a program. When it is run on a computer, it causes the computer to execute the steps described in the embodiments shown in Figures 8-9. steps in the method.
  • the aforementioned training device shown in Figure 16 is a chip.
  • the embodiment of the present application also provides a recommendation device.
  • the recommendation device may also be called a digital processing chip or chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit.
  • the unit is used to execute the aforementioned method steps of Figures 10-12.
  • the embodiment of the present application also provides a training device.
  • the recommendation device may also be called a digital processing chip or chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface.
  • the program instructions are executed by the processing unit.
  • the unit is used to perform the aforementioned method steps of Figures 8-9.
  • An embodiment of the present application also provides a digital processing chip.
  • the digital processing chip integrates circuits and one or more interfaces for realizing the functions of the above-mentioned processor 1501, processor 1601, or processor 1501, processor 1601.
  • the digital processing chip can complete the method steps of any one or more of the foregoing embodiments.
  • the digital processing chip does not have an integrated memory, it can be connected to an external memory through a communication interface.
  • the digital processing chip implements the actions performed by the recommendation device or the training device in the above embodiment according to the program code stored in the external memory.
  • An embodiment of the present application also provides a computer program product that, when run on a computer, causes the computer to perform the steps of the method described in the embodiments shown in FIGS. 8 to 12 .
  • the recommendation device or training device provided by the embodiment of the present application may be a chip.
  • the chip includes: a processing unit and a communication unit.
  • the processing unit may be, for example, a processor.
  • the communication unit may be, for example, an input/output interface, a pin or a circuit. wait.
  • the processing unit can execute computer execution instructions stored in the storage unit, so that the chip in the server executes the method steps described in the embodiments shown in FIGS. 8-12.
  • the storage unit is a storage unit within the chip, such as a register, cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • the aforementioned processing unit or processor may be a central processing unit (CPU), a network processor (neural-network processing unit, NPU), a graphics processing unit (GPU), or a digital signal processing unit.
  • CPU central processing unit
  • NPU network processor
  • GPU graphics processing unit
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • a general-purpose processor may be a microprocessor or any conventional processor, etc.
  • Figure 17 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the chip can be represented as a neural network processor NPU 170.
  • the NPU 170 serves as a co-processor and is mounted to the main CPU ( On the Host CPU), tasks are allocated by the Host CPU.
  • the core part of the NPU is the arithmetic circuit 1703.
  • the arithmetic circuit 1703 is controlled by the controller 1704 to extract the matrix data in the memory and perform multiplication operations.
  • the computing circuit 1703 internally includes multiple processing engines (PEs).
  • PEs processing engines
  • arithmetic circuit 1703 is a two-dimensional systolic array.
  • the arithmetic circuit 1703 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 1703 is a general-purpose matrix processor.
  • the arithmetic circuit obtains the corresponding data of matrix B from the weight memory 1702 and caches it on each PE in the arithmetic circuit.
  • the operation circuit takes matrix A data and matrix B from the input memory 1701 to perform matrix operations, and the partial result or final result of the matrix is stored in an accumulator (accumulator) 1708 .
  • the unified memory 1706 is used to store input data and output data.
  • the weight data directly passes through the storage unit access controller (direct memory access controller, DMAC) 1705, and the DMAC is transferred to the weight memory 1702.
  • Input data is also transferred to unified memory 1706 via DMAC.
  • DMAC direct memory access controller
  • Bus interface unit (bus interface unit, BIU) 1710 is used for interaction between the AXI bus and DMAC and instruction fetch buffer (IFB) 1709.
  • the bus interface unit 1710 (bus interface unit, BIU) is used to fetch the memory 1709 to obtain instructions from the external memory, and is also used for the storage unit access controller 1705 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1706 or the weight data to the weight memory 1702 or the input data to the input memory 1701 .
  • the vector calculation unit 1707 includes multiple arithmetic processing units, and performs operations on the output of the arithmetic circuit if necessary. Further processing, such as vector multiplication, vector addition, exponential operations, logarithm operations, size comparison, etc. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as batch normalization, pixel-level summation, upsampling of feature planes, etc.
  • vector calculation unit 1707 can store the processed output vectors to unified memory 1706 .
  • the vector calculation unit 1707 can apply a linear function and/or a nonlinear function to the output of the operation circuit 1703, such as linear interpolation on the feature plane extracted by the convolution layer, or a vector of accumulated values, to generate an activation value.
  • vector calculation unit 1707 generates normalized values, pixel-wise summed values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 1703, such as for use in a subsequent layer in a neural network.
  • the instruction fetch buffer 1709 connected to the controller 1704 is used to store instructions used by the controller 1704;
  • the unified memory 1706, the input memory 1701, the weight memory 1702 and the fetch memory 1709 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • each layer in the recurrent neural network can be performed by the operation circuit 1703 or the vector calculation unit 1707.
  • the processor mentioned in any of the above places may be a general central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the program execution of the methods in Figures 8 to 12.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate.
  • the physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
  • the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology.
  • the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, read only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc., including a number of instructions to make a computer device (which can be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments of this application.
  • the computer program product includes one or more computer instructions.
  • the computer may be General purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a server or data center integrated with one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供人工智能领域的一种推荐方法、训练方法以及装置,用于在推荐模型中设置多个塔专家网络,从而避免数据稀疏带来的过拟合问题,提高模型的输出稳定性。该方法包括:获取输入数据;随后,将输入数据作为推荐模型的输入,输出推荐信息,该推荐模型为多任务模型,推荐模型中包括共享特征提取网络、每个任务分别对应的多个塔专家网络以及任务专有特征提取网络,共享特征提取网络用于从输入数据中提取共享特征,任务专有特征提取网络用于从输入数据中提取单个任务的塔专家共享特征,塔专家网络用于基于任务专有特征提取网络以及共享特征提取网络提取到的特征执行对应任务,多个任务分别对应的多个塔专家网络的输出经加权融合后得到推荐信息。

Description

一种推荐方法、训练方法以及装置
本申请要求于2022年05月17日提交中国专利局、申请号为“202210536912.5”、申请名称为“一种推荐方法、训练方法以及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种推荐方法、训练方法以及装置。
背景技术
机器学习系统中,基于输入数据和标签,通过梯度下降等优化方法训练机器学习模型的参数,当模型参数收敛之后,可利用该模型来完成未知数据的预测。以个性化推荐系统中的点击率预测为例,其输入数据包括用户特征、物品特征和上下文特征等,输出为用户产生的推荐列表。
如何根据用户的偏好,预测出个性化的推荐列表,对提升推荐系统的用户体验和平台收入有着重要的影响。在传统互联网效果广告按点击出价模式中,只有点击率预估,广告主的真实诉求没有得到表达,越来越多的广告主开始关注深度转化行为的效果,即需要对广告的转化率进行预估。
然而现有的多任务学习技术方案大多采用硬参数共享机制,多任务学习可能带来负迁移(negative transfer)现象,即任务之间的信息共享会影响网络的表现。因此,需要更加灵活的参数共享机制。此外,数据稀疏会使得转化率预估模型容易过拟合。因此,如何得到更准确的预估结果,成为亟待解决的问题。
发明内容
本申请提供一种推荐方法、训练方法以及装置,用于在推荐模型中设置多个塔专家网络,从而避免数据稀疏带来的过拟合问题,提高模型的输出稳定性。
有鉴于此,第一方面,本申请提供一种推荐方法,包括:获取输入数据,输入数据包括用户的信息;随后,将输入数据作为推荐模型的输入,输出针对用户的推荐信息;
其中,推荐模型用于执行针对用户进行推荐的多个任务,推荐模型中包括共享特征提取网络、每个任务分别对应的多个塔专家网络以及每个任务分别对应的任务专有特征提取网络,共享特征提取网络的输出端连接每个塔专家网络的输入端,每个任务分别对应的多个塔专家网络的输入端还连接每个任务分别对应的任务专有特征提取网络的输出端;该多个塔专家网络的参数不相同,共享特征提取网络用于从输入数据中提取共享特征,该共享特征用于多个任务对应的塔专家网络共用,任务专有特征提取网络用于从输入数据中提取塔专家共享特征,该塔专家共享特征用于单个任务对应的多个塔专家网络所共用,多个塔专家网络用于基于对应任务的任务专有特征提取网络以及共享特征提取网络提取到的特征执行该对应任务,多个任务分别对应的多个塔专家网络的输出经加权融合后得到推荐信息。
因此,本申请实施方式中,针对每个任务设置了参数不完全相同的多个塔专家网络,从而通过多个塔专家网络的输出结果,来提高推荐模型的输出准确性。即使在数据稀疏的 情况下,也可以通过多个塔专家结构的输出结果,得到更稳定的输出结果。
在一种可能的实施方式中,推荐模型还包括与多个塔专家网络一一对应的塔特征提取网络,塔特征提取网络用于从输入数据中提取与对应的塔专家网络执行的任务相关的特征,且多个塔专家网络对应的塔特征提取网络参数不相同,每个塔专家网络的输入还包括对应的塔特征提取网络提取到的特征。
因此,本申请实施方式中,为每个塔设置单独的特征提取网络,从而可以针对性地为每个塔专家网络提取到所需的特征,从而进一步提高推荐模型的输出结果的准确性。
在一种可能的实施方式中,推荐模型还包括多个门控网络,每个塔专家网络对应一个门控网络,门控网络用于融合对应的任务专有特征提取网络、共享特征提取网络以及塔特征提取网络的输出,并将融合结果作为对应的塔专家网络的输入。
因此,本申请实施方式中,通过门控网络来控制输入至塔专家网络的各种特征的权重,从而可以适应性地针对不同塔专家网络提取所需的特征,提高各个塔专家网络的输出准确性。
在一种可能的实施方式中,在获取输入数据之前,上述方法还包括:对初始模型进行迭代训练,得到推荐模型,初始模型的结构与推荐模型相同;
其中,在对初始模型的其中一次迭代训练过程中:将训练样本作为初始模型的输入,输出第一输出结果;获取训练样本的标签和第一输出结果之间的第一损失值;获取多个任务对应的塔专家网络输出的多个第二输出结果;获取第一输出与多个第二输出结果之间的多个第二损失值;根据第一损失值和第二损失值更新初始模型,得到当前次迭代后的初始模型。
因此,本申请实施方式中,在训练过程中,可以利用推荐模型的整体输出结果与子网络的输出结果之间的损失值作为约束,对每个塔专家网络进行更新,使各个子网络的输出与推荐模型的整体输出更接近,提高模型收敛速度,可以高效地实现模型训练。
在一种可能的实施方式中,多个任务包括预测点击率和预测转化信息,点击率为用户点击目标对象的概率,转化信息包括转化率或者转化时长,转化率为用户点击目标对象后对目标对象进行转化操作的概率,转化时长包括用户点击目标对象后对目标对象进行转化操作后停留的时长。
因此,本申请提供的推荐模型可以用于执行多个任务,如预测点击率以及转化信息,从而可以准确地为用户预测出适应的推荐对象,提高用户体验。
第二方面,本申请提供一种训练方法,包括:获取训练集,训练集中包括多个样本以及每个样本对应的标签;将训练集作为初始模型的输入对初始模型进行迭代训练,得到推荐模型;
其中,推荐模型用于执行针对用户进行推荐的多个任务,推荐模型中包括共享特征提取网络、每个任务分别对应的多个塔专家网络以及每个任务分别对应的任务专有特征提取网络,共享特征提取网络的输出端连接每个塔专家网络的输入端,每个任务分别对应的多个塔专家网络的输入端还连接每个任务分别对应的任务专有特征提取网络的输出端;
其中,在每次迭代过程中,将训练集中的样本作为上一次迭代得到的初始模型的输入, 获取上一次迭代得到的模型的第一输出结果与输入样本的标签之间的第一损失值,获取每个塔专家网络的第二输出结果与第一输出结果之间的第二损失值,根据第二损失值和第一损失值对上一次迭代得到的模型进行更新,得到当前次迭代的模型。
因此,本申请实施方式中,在更新整体模型时,计算模型的输出结果和各个塔专家的输出结果之间的损失值,将该损失值作为约束对各个塔专家进行更新,从而约束各个塔专家的输出结果与模型的整体输出结果更接近,可以加快模型收敛,实现模型的高效训练。
在一种可能的实施方式中,推荐模型还包括与多个塔专家网络一一对应的塔特征提取网络,每个塔专家网络的输入端还连接对应的塔特征提取网络的输出端,塔特征提取网络用于从输入数据中提取与对应的塔专家网络执行的任务相关的特征,且多个塔专家网络对应的塔特征提取网络参数不相同。
因此,本申请实施方式中,为每个塔设置单独的特征提取网络,从而可以针对性地为每个塔专家网络提取到所需的特征,从而进一步提高推荐模型的输出结果的准确性。
在一种可能的实施方式中,推荐模型还包括多个门控网络,每个塔专家网络对应一个门控网络,门控网络用于融合对应的任务专有特征提取网络、共享特征提取网络以及塔特征提取网络的输出,并将融合结果作为对应的塔专家网络的输入。
因此,本申请实施方式中,通过门控网络来控制输入至塔专家网络的各种特征的权重,从而可以适应性地针对不同塔专家网络提取所需的特征,提高各个塔专家网络的输出准确性。
在一种可能的实施方式中,多个任务包括预测点击率和预测转化信息,点击率为用户点击目标对象的概率,转化信息包括转化率或者转化时长,转化率为用户点击目标对象后对目标对象进行转化操作的概率,转化时长包括用户点击目标对象后对目标对象进行转化操作后停留的时长。
因此,本申请提供的推荐模型可以用于执行多个任务,如预测点击率以及转化信息,从而可以准确地为用户预测出适应的推荐对象,提高用户体验。
第三方面,本申请提供一种推荐装置,包括:
获取模块,用于获取输入数据,输入数据包括用户的信息;
推荐模块,用于将输入数据作为推荐模型的输入,输出针对用户的推荐信息;
其中,推荐模型用于执行针对用户进行推荐的多个任务,推荐模型中包括共享特征提取网络、每个任务分别对应的多个塔专家网络以及每个任务分别对应的任务专有特征提取网络,共享特征提取网络的输出端连接每个塔专家网络的输入端,每个任务分别对应的多个塔专家网络的输入端还连接每个任务分别对应的任务专有特征提取网络的输出端;该多个塔专家网络的参数不相同,共享特征提取网络用于从输入数据中提取共享特征,该共享特征用于多个任务对应的塔专家网络共用,任务专有特征提取网络用于从输入数据中提取塔专家共享特征,该塔专家共享特征用于单个任务对应的多个塔专家网络所共用,多个塔专家网络用于基于对应任务的任务专有特征提取网络以及共享特征提取网络提取到的特征执行对应任务,多个任务分别对应的多个塔专家网络的输出经加权融合后得到推荐信息。
在一种可能的实施方式中,推荐模型还包括与多个塔专家网络一一对应的塔特征提取 网络,塔特征提取网络用于从输入数据中提取与对应的塔专家网络执行的任务相关的特征,且多个塔专家网络对应的塔特征提取网络参数不相同,每个塔专家网络的输入还包括对应的塔特征提取网络提取到的特征。
在一种可能的实施方式中,推荐模型还包括多个门控网络,每个塔专家网络对应一个门控网络,门控网络用于融合对应的任务专有特征提取网络、共享特征提取网络以及塔特征提取网络的输出,并将融合结果作为对应的塔专家网络的输入。
在一种可能的实施方式中,装置还包括:训练模块,还用于对初始模型进行迭代训练,得到推荐模型,初始模型的结构与推荐模型相同;
其中,在对初始模型的其中一次迭代训练过程中:将训练样本作为初始模型的输入,输出第一输出结果;获取训练样本的标签和第一输出结果之间的第一损失值;获取多个任务对应的塔专家网络输出的多个第二输出结果;获取第一输出与多个第二输出结果之间的多个第二损失值;根据第一损失值和第二损失值更新初始模型,得到当前次迭代后的初始模型。
在一种可能的实施方式中,多个任务包括预测点击率和预测转化信息,点击率为用户点击目标对象的概率,转化信息包括转化率或者转化时长,转化率为用户点击目标对象后对目标对象进行转化操作的概率,转化时长包括用户点击目标对象后对目标对象进行转化操作后停留的时长。
第四方面,本申请提供一种训练装置,包括:
获取模块,用于获取训练集,训练集中包括多个样本以及每个样本对应的标签;
训练模块,用于将训练集作为初始模型的输入对初始模型进行迭代训练,得到推荐模型;
其中,推荐模型用于执行针对用户进行推荐的多个任务,推荐模型中包括共享特征提取网络、每个任务分别对应的多个塔专家网络以及每个任务分别对应的任务专有特征提取网络,共享特征提取网络的输出端连接每个塔专家网络的输入端,每个任务分别对应的多个塔专家网络的输入端还连接每个任务分别对应的任务专有特征提取网络的输出端;其中,在每次迭代过程中,将训练集中的样本作为上一次迭代得到的初始模型的输入,获取上一次迭代得到的模型的第一输出结果与输入样本的标签之间的第一损失值,获取每个塔专家网络的第二输出结果与第一输出结果之间的第二损失值,根据第二损失值和第一损失值对上一次迭代得到的模型进行更新,得到当前次迭代的模型。
在一种可能的实施方式中,推荐模型还包括与多个塔专家网络一一对应的塔特征提取网络,塔特征提取网络用于从输入数据中提取与对应的塔专家网络执行的任务相关的特征,且多个塔专家网络对应的塔特征提取网络参数不相同。
在一种可能的实施方式中,推荐模型还包括多个门控网络,每个塔专家网络对应一个门控网络,门控网络用于融合对应的任务专有特征提取网络、共享特征提取网络以及塔特征提取网络的输出,并将融合结果作为对应的塔专家网络的输入。
在一种可能的实施方式中,多个任务包括预测点击率和预测转化信息,点击率为用户点击目标对象的概率,转化信息包括转化率或者转化时长,转化率为用户点击目标对象后 对目标对象进行转化操作的概率,转化时长包括用户点击目标对象后对目标对象进行转化操作后停留的时长。
第五方面,本申请提供一种推荐模型,该推荐模型用于执行针对用户进行推荐的多个任务,推荐模型中包括共享特征提取网络、每个任务分别对应的多个塔专家网络以及每个任务分别对应的任务专有特征提取网络,共享特征提取网络的输出端连接每个塔专家网络的输入端,每个任务分别对应的多个塔专家网络的输入端还连接每个任务分别对应的任务专有特征提取网络的输出端;该多个塔专家网络的参数不相同,共享特征提取网络用于从输入数据中提取特征,任务专有特征提取网络用于从输入数据中提取与每个任务相关的特征,多个塔专家网络用于基于任务专有特征提取网络以及共享特征提取网络提取到的特征执行对应任务,多个任务分别对应的多个塔专家网络的输出经加权融合后得到推荐信息。
因此,本申请实施方式中,针对每个任务设置了参数不完全相同的多个塔专家网络,从而通过多个塔专家网络的输出结果,来提高推荐模型的输出准确性。即使在数据稀疏的情况下,也可以通过多个塔专家结构的输出结果,得到更稳定的输出结果。
在一种可能的实施方式中,推荐模型还包括与多个塔专家网络一一对应的塔特征提取网络,塔特征提取网络用于从输入数据中提取与对应的塔专家网络执行的任务相关的特征,且多个塔专家网络对应的塔特征提取网络参数不相同。
因此,本申请实施方式中,为每个塔设置单独的特征提取网络,从而可以针对性地为每个塔专家网络提取到所需的特征,从而进一步提高推荐模型的输出结果的准确性。
在一种可能的实施方式中,推荐模型还包括多个门控网络,每个塔专家网络对应一个门控网络,门控网络用于融合对应的任务专有特征提取网络、共享特征提取网络以及塔特征提取网络的输出,并将融合结果作为对应的塔专家网络的输入。
因此,本申请实施方式中,通过门控网络来控制输入至塔专家网络的各种特征的权重,从而可以适应性地针对不同塔专家网络提取所需的特征,提高各个塔专家网络的输出准确性。
在一种可能的实施方式中,多个任务包括预测点击率和预测转化信息,点击率为用户点击目标对象的概率,转化信息包括转化率或者转化时长,转化率为用户点击目标对象后对目标对象进行转化操作的概率,转化时长包括用户点击目标对象后对目标对象进行转化操作后停留的时长。
因此,本申请提供的推荐模型可以用于执行多个任务,如预测点击率以及转化信息,从而可以准确地为用户预测出适应的推荐对象,提高用户体验。
第五方面,本申请实施例提供一种推荐装置,包括:处理器和存储器,其中,处理器和存储器通过线路互联,处理器调用存储器中的程序代码用于执行上述第一方面任一项所示的推荐方法中与处理相关的功能。
第六方面,本申请实施例提供一种推荐装置,包括:处理器和存储器,其中,处理器和存储器通过线路互联,处理器调用存储器中的程序代码用于执行上述第二方面任一项所示的训练方法中与处理相关的功能。
第七方面,本申请实施例提供一种电子设备,包括:处理器和存储器,其中,处理器 和存储器通过线路互联,处理器调用存储器中的程序代码用于执行上述第一方面任一项所示的推荐方法中与处理相关的功能。
第八方面,本申请实施例提供了一种推荐装置,该推荐装置也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行如上述第一方面或第一方面任一可选实施方式中与处理相关的功能。
第九方面,本申请实施例提供了一种训练装置,该训练装置也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行如上述第二方面或第二方面任一可选实施方式中与处理相关的功能。
第十方面,本申请实施例提供了一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行上述第一方面或第二方面任一可选实施方式中的方法。
第十一方面,本申请实施例提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面或第二方面任一可选实施方式中的方法。
附图说明
图1为本申请应用的一种人工智能主体框架示意图;
图2为本申请提供的一种系统架构示意图;
图3为本申请提供的另一种系统架构示意图;
图4为本申请提供的一种应用场景示意图;
图5为本申请提供的一种推荐模型的结构意图;
图6为本申请提供的另一种推荐模型的结构意图;
图7为本申请提供的一种门控网络的结构意图;
图8为本申请提供的一种训练方法的流程示意图;
图9为本申请提供的另一种训练方法的流程示意图;
图10为本申请提供的一种推荐方法的流程示意图;
图11为本申请提供的另一种应用场景示意图;
图12为本申请提供的另一种应用场景示意图;
图13为本申请提供的一种推荐装置的结构示意图;
图14为本申请提供的一种训练装置的结构示意图;
图15为本申请提供的另一种推荐装置的结构示意图;
图16为本申请提供的另一种训练装置的结构示意图;
图17为本申请提供的一种芯片的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施 例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请提供的推荐方法可以应用于人工智能(artificial intelligence,AI)场景中。AI是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、智慧城市等。
本申请实施例涉及了神经网络的相关应用,为了更好地理解本申请实施例的方案,下面先对本申请实施例可能涉及的神经网络的相关术语和概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以xs为输入的运算单元,该运算单元的输出可以为:
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于对神经网络中获取到的特征进行非线性变换,将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有多层隐含层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。
虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:其中,是输入向量,是输出向量,是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量经过如此简单的操作得到输出向量。由于DNN层数多,系数W和偏移向量的数量也比较多。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
综上,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(3)卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取特征的方式与位置无关。卷积核可以以随机大小的矩阵的形式化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。该损失函数通常可以包括误差平方均方、交叉熵、对数、指数等损失函数。例如,可以使用误差均方作为损失函数,定义为具体可以根据实际应用场景选择具体的损失函数。
(5)反向传播算法
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的网络模型中的参数的大小,使得模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的模型中的参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的模型参数,例如,权重矩阵。
本申请实施方式中,在预训练阶段或者噪声处理阶段,都可以采用BP算法来对模型进 行训练,得到训练后的模型。
(6)梯度:损失函数关于参数的导数向量。
(7)随机梯度:机器学习中样本数量很大,所以每次计算的损失函数都由随机采样得到的数据计算,相应的梯度称作随机梯度。
(8)Embedding:指样本的特征表示,一般是神经网络的倒数第二层。
(9)自动机器学习(AutoML):是指设计一系列高级的控制系统去操作机器学习模型,使得模型可以自动化地学习到合适的参数和配置而无需人工干预。在基于深度神经网络的学习模型中,自动计算学习主要包括网络架构搜索与全局参数设定。其中,网络架构搜索用于根据数据让计算机生成最适应问题的神经网络架构,具有训练复杂度高,性能提升大的特点。
(10)推荐系统:推荐系统根据用户的历史点击行为数据,采用机器学习算法进行分析和学习,然后对用户的新请求进行预测,返回个性化物品推荐列表。
(11)点击率(Click Through Rate,CTR):指用户在特定环境下点击某个展示物品的概率。
(12)转化率(Post-click conversion rate,CVR):指用户在特定环境下对已点击的某个展示物品转化的概率,例如,若用户点击了某个APP的图标,转化即指下载、安装、注册等行为。
(13)迁移学习:运用已有的知识来辅助学习新的知识,核心是找到已有知识和新知识之间的相似性。
(14)多任务学习:把多个相关的任务放在一起学习,同时学习多个任务。
(15)集成学习:集成学习方法使用多种学习算法来获得比单独使用任何单独的学习算法更好的预测性能。
(16)模型收敛:模型经过多轮迭代后模型的预测值与实际值之间的误差小于某个预先设定的较小的值。
(17)泛化性:是指机器学习系统对新鲜样本的适应能力。机器学习的目的是学到隐含在数据背后的规律,对具有同一规律的学习集以外的数据,经过训练的网络也能给出合适的输出,该能力称为泛化能力。
(18)鲁棒性:是指一个机器学习系统在执行过程中处理错误,以及算法在遭遇输入、运算等异常时继续正常运行的能力。
本申请实施例提供的推荐方法可以在服务器上被执行,还可以在终端设备上被执行。其中该终端设备可以是具有图像处理功能的移动电话、平板个人电脑(tablet personal computer,TPC)、媒体播放器、智能电视、笔记本电脑(laptop computer,LC)、个人数字助理(personal digital assistant,PDA)、个人计算机(personal computer,PC)、照相机、摄像机、智能手表、可穿戴式设备(wearable device,WD)或者自动驾驶的车辆等,本申请实施例对此不作限定。
下面介绍本申请实施例提供的系统架构。
参见图2,本申请实施例提供了一种系统架构200。如系统架构200所示,数据采集设备260可以用于采集训练数据。在数据采集设备260采集到训练数据之后,将这些训练数据存入数据库230,训练设备220基于数据库230中维护的训练数据训练得到目标模型/规则201。
下面对训练设备220基于训练数据得到目标模型/规则201进行描述。示例性地,训练设备220对多帧样本图像进行处输出对应的预测标签,并计算预测标签和样本的原始标签之间的损失,基于该损失对分类网络进行更新,直到预测标签接近样本的原始标签或者预测标签和原始标签之间的差异小于阈值,从而完成目标模型/规则201的训练。具体描述详见后文中的训练方法。
本申请实施例中的目标模型/规则201具体可以为神经网络。需要说明的是,在实际的应用中,数据库230中维护的训练数据不一定都来自于数据采集设备260的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备220也不一定完全基于数据库230维护的训练数据进行目标模型/规则201的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备220训练得到的目标模型/规则201可以应用于不同的系统或设备中,如应用于图2所示的执行设备210,所述执行设备210可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR),车载终端,电视等,还可以是服务器或者云端等。在图2中,执行设备210配置有收发器212,该收发器可以包括输入/输出(input/output,I/O)接口或者其他无线或者有线的通信接口等,用于与外部设备进行数据交互,以I/O接口为例,用户可以通过客户设备240向I/O接口输入数据。
在执行设备210对输入数据进行预处理,或者在执行设备210的计算模块212执行计算等相关的处理过程中,执行设备210可以调用数据存储系统250中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统250中。
最后,I/O接口212将处理结果返回给客户设备240,从而提供给用户。
值得说明的是,训练设备220可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则201,该相应的目标模型/规则201即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在附图2中所示情况下,用户可以手动给定输入数据,该手动给定可以通过收发器212提供的界面进行操作。另一种情况下,客户设备240可以自动地向收发器212发送输入数据,如果要求客户设备240自动发送输入数据需要获得用户的授权,则用户可以在客户设备240中设置相应权限。用户可以在客户设备240查看执行设备210输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备240也可以作为数据采集端,采集如图所示输入收发器212的输入数据及输出收发器212的输出结果作为新的样本数据,并存入数据库230。当然,也可以不经过客户设备240进行采集,而是由收发器212直接将如图所示输入收发器212的输入数据及输出收发器212的输出结果,作为新的样本数据存入数据库230。
值得注意的是,附图2仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图2中,数据存储系统250相对执行设备210是外部存储器,在其它情况下,也可以将数据存储系统250置于执行设备210中。
如图2所示,根据训练设备220训练得到目标模型/规则201,该目标模型/规则201在本申请实施例中可以是本申请中的推荐模型。
示例性地,本申请提供的神经网络训练方法的应用的系统架构可以如图3所示。在该系统架构300中,服务器集群310由一个或多个服务器实现,可选的,与其它计算设备配合,例如:数据存储、路由器、负载均衡器等设备。服务器集群310可以使用数据存储系统250中的数据,或者调用数据存储系统250中的程序代码实现本申请提供的神经网络训练方法的步骤。
用户可以操作各自的用户设备(例如本地设备301和本地设备302)与服务器集群310进行交互。每个本地设备可以表示任何计算设备,例如个人计算机、计算机工作站、智能手机、平板电脑、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。
每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与服务器集群310进行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。具体地,该通信网络可以包括无线网络、有线网络或者无线网络与有线网络的组合等。该无线网络包括但不限于:第五代移动通信技术(5th-Generation,5G)系统,长期演进(long term evolution,LTE)系统、全球移动通信系统(global system for mobile communication,GSM)或码分多址(code division multiple access,CDMA)网络、宽带码分多址(wideband code division multiple access,WCDMA)网络、无线保真(wireless fidelity,WiFi)、蓝牙(bluetooth)、紫蜂协议(Zigbee)、射频识别技术(radio frequency identification,RFID)、远程(Long Range,Lora)无线通信、近距离无线通信(near field communication,NFC)中的任意一种或多种的组合。该有线网络可以包括光纤通信网络或同轴电缆组成的网络等。
在另一种实现中,执行设备210的一个方面或多个方面可以由每个本地设备实现,例如,本地设备301可以为执行设备210提供本地数据或反馈计算结果。
需要注意的,执行设备210的所有功能也可以由本地设备实现。例如,本地设备301实现执行设备210的功能并为自己的用户提供服务,或者为本地设备302的用户提供服务。
通常,机器学习系统可以包括个性化推荐系统,可以基于输入数据和标签,通过梯度下降等优化方法训练机器学习模型的参数,当模型参数收敛之后,可利用该模型来完成未知数据的预测。以个性化推荐系统中的点击率预测为例,其输入数据包括用户特征、物品特征和上下文特征。如何根据用户的偏好,预测出个性化的推荐列表,对提升推荐系统的用户体验和平台收入有着重要的影响。
例如,推荐流程可以如图4所示,可以分为训练部分和在线推理部分。其中,在训练部分,训练集中包括输入数据和对应的标签,如在APP推荐场景中,该训练集中可以包括 用户点击过的APP以及点击并进行转化的APP等数据。将训练集输入至初始模型,通过梯度下降等优化方法训练机器学习模型的参数,得到推荐模型。在线推理部分中,即可将推荐模型部署于推荐平台,如部署于服务器或者终端中,此处以服务器为例,即可通过服务器来输出针对用户的推荐列表,如在APP推荐场景中,在用户终端的主页显示为用户推荐的APP图标,或者在用户点击了某个APP之后,即可显示与其相关的推荐APP的图标。
在一些场景中,如在传统互联网效果广告按点击出价模式中,只有点击率预估,广告主的真实诉求没有得到表达,越来越多的广告主开始关注深度转化行为的效果,即需要对广告的转化率进行预估。平台在给用户曝光一个物品后,用户如果有兴趣会选择点击。在这个过程中,推荐系统需要预测用户的点击概率,即用户是否点击即可。但实际上用户在点击后会对物品有进一步的操作。假如推荐的这个物品是一个手机应用,那么用户可能会下载、安装、注册这个应用,这些行为统称为转化。例如广告主会为不同的推广效果支付不同的费用,因此转化率预测的准确性,对广告平台会有很大的影响。
转化率预估面临着以下两个挑战:样本选择偏差(Sample Selection Bias):转化率预估模型训练是在点击后样本空间上进行的,而预测需要在曝光样本空间上进行。数据稀疏(Data Sparsity):转化率预估模型的正样本标签是转化,负样本标签是点击,正样本数量相较于点击率预估模型大大减少。
一些策略可以缓解这两个问题,例如从曝光集中未点击样本抽样做负例缓解样本选择偏差,对转化样本过采样缓解数据稀疏等。但这些方式都不能从实质上解决上面任一个问题。点击到转化,本身是两个强相关的连续行为,多任务学习同时建模这两个任务,从而可以在整个空间上进行训练及预测,是业界的主流方案。然而现有的多任务学习方式大多采用硬参数共享机制,多任务学习可能带来负迁移(negative transfer)现象,即任务之间的信息共享会影响网络的表现。此外,数据稀疏会使得转化率预估模型容易过拟合。
例如,可以采用全样本空间多任务模型(Entire space multi-task model,ESMM)来进行推荐,ESMM采用共享Embedding的参数共享结构,CTR任务与CVR任务使用相同的特征和特征embedding。ESMM利用CTCVR和CTR的监督信息来训练网络,隐式地学习CVR,ESMM的结构是基于“乘”的关系设计:pCTCVR=pCVR*pCTR。然而,ESMM只在embedding层共享了信息,CVR任务依然面临着数据稀疏问题,硬参数共享,不利于参数学习。
又例如,基于自动信息迁移框架的多任务学习模型(AITM)是用于存在序列依赖关系的转化率预估的多任务模型,AITM将输入特征经过任务共享的Embedding分别输出到多个塔网络中,然后,AIT模块利用当前任务塔输出的向量以及前一个任务传来的信息来学习任务间如何融合信息。利用了注意力机制来自动为迁移信息和原始信息来分配权重。而迁移的信息是通过函数来学习的,这里可以是一个简单的全连接层,用来学习两个相邻的任务间应该迁移什么信息。最后,AITM通过在损失函数中施加校准器来约束概率的输出尽量满足序列依赖,即在序列后面的任务输出的概率应小于序列前面的任务输出的概率。然而AITM只在embedding层共享了信息,CVR任务依然面临着数据稀疏问题,硬参数共享,不利于参数学习。序列关系校准器虽然能起到正则作用,但pcvr是pCTCVR/pCTR得到的值,除法会使得预测结果不稳定。
此外,还例如渐进式分层提取模型(Progressive layered extraction:A novel multi-task learning model for personalized recommendations,PLE)等模型,PLE的底层网络主要由共享专家网络(shared experts)和任务专属网络(task-specific expert)构成,上层由多任务塔网络构成,每一个多任务塔网络的输入都是由门控(gating)网络进行加权控制,每一个子任务的门控网络的输入包括两部分,本任务下的任务专属网络和共享部分的共享专家网络,特征向量的输入作为门控网络的选择器。然而,CTR和CVR任务存在前后联系,即只有点击样本才可能会被转化,MMOE和PLE独立学习各个任务都没有捕捉这种联系,导致CVR预估的效果不令人满意。PLE只在底层设计了多专家网络,塔结构设计较为简单,当转化数据十分稀疏时,转化率预估不能取得很好的结果。
因此,本申请提供一种基于层级混合专家的多任务学习框架,通过在底层(特征表示层)和任务所在的塔结构(特征交互层)上设置混合专家结构,分层学习特征表示及特征交互,能够充分利用任务之间相关性帮助转化率预估任务取得更好的推荐效果。下面对本申请提供的推荐方法以及神经网络训练方法进行详细说明。
首先,为便于理解,对本申请提供的推荐模型的结构进行介绍。
参阅图5,本申请提供的一种推荐模型的结构示意图,如下所述。
该推荐模型可以用于执行多个任务(图5中以P个任务为例),该多个任务为针对用户进行推荐的相关任务,每个任务分别对应多个塔专家网络,每个任务还对应一个或多个任务专有特征提取网络,且该多个任务对应一个或多个共享特征提取网络,即该推荐模型可以包括每个任务分别对应多个塔专家网络、每个任务对应的一个或多个任务专有特征提取网络以及一个或者多个共享特征特区网络,其中,每个塔专家网络的参数不相同,该参数可以包括塔专家网络中各个操作的内部参数,如卷积核内的参数,也可以包括各个操作的输出所占的权重参数等。
共享特征提取网络可以用于从输入数据中提取特征,为便于区分,将其输出的特征称为共享特征,并作为每个塔专家网络共用的输入,即该共享特征用于所述多个任务对应的塔专家网络所共用。
任务专有特征提取网络可以用于针对对应的任务进行特征提取,为便于区分,其输出的特征称为,也可以称为塔专家共享特征,并作为对应的单个任务中的多个塔专家网络共用的输入,即该塔专家共享特征用于单个任务对应的多个塔专家网络所共用。
推荐模型可以用于执行多个任务,该多个任务可以是针对用户进行推荐相关的任务,该多个任务可以相关联,也可以不相关联。具体地,一些场景中,该多个任务可以包括预测点击率、预测转化信息等,该转化信息可以包括转化率或者转化时长等信息,该转化率即用户点击过对象之后对该对象进行进一步转化的概率,该转化时长即用户点击过对象后对该对象进行进一步转化后停留的时长。
塔专家网络用于执行对应的任务。例如,若该任务包括点击率预测,则该塔专家网络可以用于基于输入的特征来预测点击率,得到点击率预测结果;若该任务包括转化率预测,则该塔专家结构可以用于基于输入的特征来预测转化率,得到转化率预测结果;若该任务是目标识别,则该塔专家网络可以用于基于输入的特征进行目标识别,得到识别出来的目 标的信息等,具体可以根据实际应用场景进行调整。
可选地,如图6所示,针对每个塔专家网络可以设置分别设置一个或者多个塔特征提取网络,每个塔专家网络的输入端还连接对应的塔特征提取网络的输出端,即每个塔专家网络的输入端还连接对应的塔特征提取网络的输出端,该一个或者多个塔特征提取网络用于针对每个塔进行所需特征提取,通常每个塔对应的塔特征提取网络的参数不相同,从而可以适应性地根据每个塔来提取特征,提高推荐模型最终输出的准确性。为便于区分,以下将塔特征提取网络输出的特征称为塔专有特征。
可选地,还可以在每个塔专家网络的输入端设置门控网络,该门控网络用于融合任务专有特征提取网络、共享特征提取网络以及塔特征提取网络的输出,得到每个塔专家网络的输入。从而可以使各个特征提取网络提取到的特征可以按照适应的比例输入至各个塔专家网络,使塔专家网络可以输出准确的任务输出结果。
此外,在得到输入数据之后,可以通过Embedding层将输入数据转换为特征表示,如转换为特征向量,以便于后续的特征提取网络可以从该特征表示中提取到所需特征。
例如,门控网络的结构可以如图7所示。具体地,该门控网络中可以包括全连接层以及softmax层。通常,任务专有特征提取网络提取到的任务专有特征包括了塔专家网络所需的多个维度的特征,因此可以将输入数据的特征表示作为全连接层的输入,通过全连接层来帮助塔专家网络提取所需要的特征,并将全连接层的全连接层的输出通过softmax层映射为后续网络可识别的权重值。随后按照softmax层输出的权重值来加权融合塔专有特征、共享特征和任务专有特征,得到融合特征,即可将该融合特征作为塔专有网络的输入。
因此,本申请提供的推荐模型中,针对每个任务设置了多个塔专家网络,各个塔专家网络的参数不相同。在数据稀疏的情况下,可以通过多个塔专家网络的多个输出结果来提高模型的输出稳定性,避免出现过拟合的情况。即使存在数据稀疏的问题,本申请提供的推荐模型也可以稳定地输出推荐结果,可以提高用户体验。
下面结合本申请提供的方法,对推荐模型各个模块所执行的步骤进行详细说明。
基于上述的推荐模型,本申请提供的方法可以分为两部分,即训练部分和在线推理部分,下面分别进行介绍。
一、训练部分
参阅图8,本申请提供的一种训练方法的流程示意图,如下所述。
801、获取训练集。
其中,该训练集可以是采集到的一个或者多个用户的历史输入数据,或者从其他服务器或者客户端接收到的数据。可以理解为,该训练集中可以包括多个样本以及每个样本对应的标签。
具体地,训练集中的数据类型与推荐模型所执行的任务相关,针对不同任务训练所需的数据可能也不相同,具体可以根据实际应用场景进行调整。例如,在APP推荐场景中,训练集可以包括大量用户点击过的APP的信息,如APP的名称、应用类型、应用风格等信息,以及点击APP后进一步的操作,如下载、安装、注册等转化操作。又例如,在音乐推荐场景中,该训练集中可以包括大量用户点击的音乐的信息,如音乐类型、歌手信息等信 息,以及点击音乐后的进一步操作,如播放、下载等转化操作。
802、将训练集作为初始模型的输入对所述初始模型进行迭代训练,得到推荐模型。
在得到训练集后,即可使用训练集对初始模型进行迭代训练,得到训练后的推荐模型。
其中,该初始模型可以是构建得到的模型,也可以是将已有的模型结构作为初始模型。该初始模型的结构可以参阅前述图5或者图6,该推荐模型可以用于执行多个任务,该多个任务为针对用户进行推荐的相关任务,每个任务分别对应多个塔专家网络,每个任务还对应一个或多个任务专有特征提取网络,且该多个任务对应一个或多个共享特征提取网络,即该推荐模型可以包括每个任务分别对应多个塔专家网络、每个任务对应的一个或多个任务专有特征提取网络以及一个或者多个共享特征特区网络,其中,每个塔专家网络的参数不相同,该参数可以包括塔专家网络中各个操作的内部参数,如卷积核内的参数或者池化操作内部的参数等,也可以包括各个操作的输出所占的权重参数等。
该推荐模型可以用于执行多个任务,该多个任务可以是针对用户进行推荐相关的任务,该多个任务可以相关联,也可以不相关联。具体地,一些场景中,该多个任务可以包括预测点击率、预测转化信息等,该转化信息可以包括转化率或者转化时长等信息,该转化率即用户点击过对象之后对该对象进行进一步转化的概率,该转化时长即用户点击过对象后对该对象进行进一步转化后停留的时长。
在每次迭代过程中,将训练集中的样本作为上一次迭代得到的初始模型的输入,获取上一次迭代得到的初始模型的第一输出结果与输入样本的标签之间的第一损失值,获取每个塔专家网络的第二输出结果与第一输出结果之间的第二损失值,使用第一损失值和每个塔专家网络对应的第二损失值对上一次迭代得到的模型进行更新,得到当前次迭代的模型。
可以理解为,除了使用模型的整体输出结果与标签之间的损失值来更新整体的推荐模型,还可以使用每个塔专家网络的输出结果与模型的整体输出结果之间的损失值作为约束,对每个塔专家网络的参数进行针对性的更新,从而约束各个塔专家网络的输出与模型的整体输出更接近。
因此,本申请实施方式中,在对推荐模型进行训练时,除了使用模型的输出结果与样本标签之间的损失值来进行更新,还可以使用每个塔专家网络的输出结果与模型的整体输出结果之间的损失值作为约束对每个塔专家进行更新,从而使每个塔专家的输出与模型的整体输出结果更接近,使模型的输出更准确,可以加快模型训练的收敛速度,实现高效训练。
示例性地,推荐模型执行的任务可以包括CVR预测以及CTR预测。例如,CVR预测任务对应的网络的更新过程可以如图9所示,以CVR预测为例,M个塔分别预测出转化概率将M个预测结果加权融合得到接着使用该结果去校准每个塔的预测值,可以采用以下损失函数对塔专家网络的输出进行约束:
1)交叉检验熵:
2)KL距离:
即采用上述的交叉检验熵以及KL距离,来约束针对塔专家网络的更新,使塔专家网络的输出与推荐模型的输出更接近。
具体例如,训练数据为曝光数据,CTR任务的标签为点击p(click=1|x),CTCVR任务p(conversion&click=1|x)的标签为点击且转化,CTR任务和CTCVR任务的训练空间是在全部曝光数据的,CVR任务p(conversion=1|click=1,x)的训练空间是点击样本,即存在p(conversion&click=1|x)=p(conversion=1|click=1,x)*p(click=1|x),CVR任务可以被隐式的建模
采用标签数据,点击标签yclick,点击且转化标签yclick&conversion和预测值基于交叉检验熵(LogLoss)得到分类交叉熵损失同时使用102模块的专家校准损失函数得到最终的损失函数 随后使用梯度下降算法,利用链式法则,即可完成层级混合专家模块、embedding table等不同模块参数的联合训练和优化。通过模型的损失函数不断调整模型的参数,最终得到优化后的模型。其中,损失函数可以表示为:

本申请实施方式中,利用专家集成学习结果去校准每个专家的预测值,相对于任意一个专家来说,多专家集成学习的结果更加鲁棒,它可以反过来校准每个专家的预测值,避免专家学习结果过于发散,同时提高模型收敛的稳定性以及模型的泛化性能。
可选地,推荐模型还包括与多个塔专家网络一一对应的塔特征提取网络,如前述图6中所示,塔特征提取网络用于从输入数据中提取与对应的塔专家网络执行的任务相关的特征,且多个塔专家网络对应的塔特征提取网络参数不相同。相应地,在对使用每个塔专家网络的输出结果与模型的整体输出结果之间的损失值作为约束,对每个塔专家网络的参数进行针对性的更新时,也可以对每个塔专家网络对应的塔特征提取网络进行更新,从而使各个塔专家网络的输出与模型的整体输出结果更接近,使模型的输出更准确,可以加快模型训练的收敛速度,实现高效训练。
二、在线推理部分
参阅图10,本申请提供的一种推荐方法的流程示意图。
1001、获取输入数据。
该输入数据可以是采集到的用户信息,也可以是接收到的来自客户端的输入数据等。该输入数据中可以包括用户信息,该用户信息具体可以包括用户身份信息、定位信息、用户输入数据或者用户产生的历史信息等,该用户身份信息如用户的名称、标识等表示用户身份的信息;该定位信息可以包括用户自身所在位置的坐标,可以由用户使用客户端进行定位得到;用户输入数据可以包括用户进行输入操作的数据,如用户打开应用市场或者音乐软件等,或者用户点击某个app或者点击某个音乐图标等操作;用户产生的历史信息例如用户点击或者下载过的app的信息,播放或者下载过的音乐等信息。
可选地,该输入数据中还可以包括待推荐对象的信息,如待推荐对象的类型、备选列表等信息。例如,该输入数据中可以包括需要为用户推荐的对象的类型,如为用户推荐的类型包括app、音乐等类型信息;或者,该输入数据中可以直接包括为用户推荐的对象的备选列表,如在为用户推荐app场景中,该备选列表中可以包括多个app的信息,以用于后续使用推荐模型从该多个app中筛选出为用户推荐的app,又如在为用户推荐音乐的场景中,该备选列表中可以包括多首歌曲的信息,以用于后续使用推荐模型从该多首音乐中筛选出为用户推荐的歌曲。
此外,若本申请提供的推荐模型部署于服务器,且该输入数据为客户端发送至服务器的数据,则待推荐对象的信息可以由客户端发送至服务器,也可以由服务器根据本地保存的数据生成得到。例如,服务器中可以预先设置需要对用户进行推荐的对象的数据库,当获取到用户信息时,即可将该数据库中的对象作为备选列表,或者,服务器可以为每个用户设置对应的待推荐对象的类型,当接收到输入数据之后,即可根据用户的身份信息从本地保存的数据中获取为待推荐对象的信息。
例如,若本申请提供的推荐方法部署于客户端,则用户可以直接在客户端上进行操作,从而得到输入数据,如用户打开了应用商店,客户端即可基于用户的输入操作生成输入数据,并发送给服务器;若本申请提供的推荐方法部署于服务器,则用户可以直接通过连接服务器的输入设备进行输入操作,如点击某一个app,打开音乐播放软件等操作,服务器通过该输入设备产生的数据生成输入数据,或者,用户可以使用客户端与服务器建立连接,用户在客户端上进行输入操作,客户端将用户产生的数据传输至服务器,服务器即可获取到输入数据。
通常,在不同的场景中,产生的输入数据也不相同。例如,在app推荐场景中,该输入数据可以包括用户打开应用商店的操作产生的数据,此场景下,该输入数据中可以包括用户的标识信息,如用户的名称、唯一标识号等,还可以包括用户的历史数据,如历史点击过的app的信息,如点击过的app的数量、类型、名称或者标识号等信息。又例如,在音乐推荐场景中,该输入数据可以包括用户点击下一首的操作产生的数据,该输入数据可以包括用户信息以及历史播放数据等,如可以包括用户的名称、标识号等,还可以包括上一首播放的音乐的信息,如音乐的名称、歌手名称、音乐风格等信息。
1002、将输入数据作为推荐模型的输入,输出针对用户的推荐信息。
在得到输入数据之后,即可将输入数据作为推荐模型的输入,输出针对用户的推荐信息。
通常,在不同场景下,推荐信息的类型也可能不相同。例如,在app推荐场景中,该推荐信息可以包括为用户推荐的app的信息,如app图标、下载链接等信息。又例如,在音乐推荐场景中,该推荐信息可以包括为用户推荐的音乐的信息,如音乐标题、歌手、播放入口等信息。
推荐模型可以用于执行多个任务,该多个任务可以是针对用户进行推荐相关的任务,该多个任务可以相关联,也可以不相关联。具体地,一些场景中,该多个任务可以包括预测点击率、预测转化信息等,该转化信息可以包括转化率或者转化时长等信息,该转化率 即用户点击过对象之后对该对象进行进一步转化的概率,该转化时长即用户点击过对象后对该对象进行进一步转化后停留的时长,如用户播放视频的时长、播放音乐的时长等。例如,在app推荐场景下,可以输出备选列表中的各个app的点击率以及转化率,从而在生成推荐列表时,可以从备选列表中筛选出点击率与转化率均高于一定值的app作为推荐app。又例如,在音乐推荐场景下,可以输出备选列表中的各个音乐的点击率以及转化时长等,该转化时长即用户点击某个音乐并播放该音乐的停留时长,可以从备选列表中筛选出点击率与转化时长均排列靠前的音乐作为用户的推荐音乐。
具体地,参阅前述图5,推荐模型中的共享特征提取网络可以从输入数据中提取到共享特征,该共享特征中可以包括推荐模型中的各个塔专家网络执行任务所需的特征,并将该共享特征输入至塔专家网络。推荐模型中的任务专有特征提取网络用于从输入数据中提取到与任务相关的任务专有特征,并作为该任务对应的塔专家网络的输入,该任务专有特征为针对对应的任务提取到的特征,可以用于实现针对任务的特征提取。塔专家网络在接收到任务专有特征和共享特征进行融合之后,即可基于任务专有特征和共享特征执行相应的任务,得到该塔专家网络的输出结果。推荐模型中可以包括多个塔专家网络,融合多个塔专家网络的输出结果,即可得到针对用户的推荐信息。
因此,本申请实施方式中,针对执行多任务的推荐模型,为每个任务设置了多个塔专家网络,通过多个塔专家网络的输出来提高推荐模型的输出稳定度,可以避免因数据稀疏导致的过于拟合的问题,从而提高推荐模型的输出准确性。并且,本申请提供的模型为多任务模型,高效的参数共享方案能够使得多任务模型的两个关联任务点击率预估和转化率预估相互提供辅助,取得相比于单任务模型更好的效果,从而直接影响平台收益和用户体验。既可以减少线上部署的模型数量,减少模型维护成本,还可以更有效的挖掘关联任务中蕴含的信息,取得更好的推荐效果。
可选地,参阅前述图6,当推荐模型中设置了多个塔特征提取网络,每个塔专家网络可以一一对应一个塔特征提取网络,该塔特征提取网络可以用于针对对应的塔专家网络进行特征提取,得到塔专有特征,并将塔专有特征输入至对应的塔专家网络。因此,各个塔专家网络上的输入即可包括共享特征提取网络提取到的共享特征、任务专有特征提取网络提取到的任务专有特征以及塔特征提取网络提取到的塔专有特征,从而使各个塔专家网络可以使用多种特征来执行相应的任务,从而输出更准确的输出结果。因此,本申请实施方式中,针对各个塔专家网络单独设置了特征提取网络,从而针对各个塔所需的特征进行更精准地提取,使各个塔可以使用更精准的特征得到更准确的输出结果。
此外,参阅前述图6所示的推荐模型,当输入数据输入至推荐模型之后,经过Embedding层后转换为特征提取网络可识别的特征表示,或者称为特征向量。随后输入至各个塔对应的塔特征提取、各个任务对应的任务专有特征提取网络以及共享特征提取网络,以便于针对各个塔以及各个任务来进行特征提取。在各个塔对应的塔特征提取、各个任务对应的任务专有特征提取网络以及共享特征提取网络提取到特征之后,即可将这些特征提取网络提取到的特征输入至各个塔专家网络对应的门控网络中,通过门控网络对各个特征提取网络提取到的特征进行融合,通常不同的塔专家网络所需的特征可能不相同,可以通过为其设 置的门控网络适应性地对各个特征提取网络提取到的特征进行融合,如按照不同的权重进行融合,从而得到塔专家网络所需的特征,并输入至塔专家网络。
可以理解为,输入层传入数据特征,通过稀疏编码的id从embedding table中取出对应的embedding向量表达,并最终将所有输入特征的embedding向量表达按顺序拼接形成特征向量。各个特征提取网络接收特征向量x0作为输入,在特征表示层通过专家共享机制学习任务所需的特征表示,随后在特征交互层通过塔专家学习任务专属的特征交互,最终通过加权聚合的方式,每个任务融合所属的塔专家结果给出任务的预测。
以预测点击率CTR和转化率CVR为例,参阅前述图6所示的推荐模型,以具体的应用为例进行示例性说明。输入层的特征向量x0输入底层的特征表示层,特征表示层由各个特征提取网络组成,该特征表示层包括共享专家层(Shared Expert),即前述的共享特征提取网络、任务专属专家层(CTR/CVR Task-Expert),即前述的任务专有特征提取网络、塔专属专家层(CTR/CVR Tower-Specific Expert),即前述的塔特征提取网络,每一个专家都由多个子网络组成,子网络的个数、子网络的维数和网络结构都是超参数。
特征交互层的每个任务都包含若干个塔专家(Tower Expert)网络,每一个塔专家网络的输入都是由门控网络(Gate Control)进行加权控制,每一个任务的每一个塔专家网络的门控网络的输入包括三部分本塔下的塔专属专家层的输出、本任务下的任务专属专家层的输出和共享专家层的输出,特征向量x0作为门控网络的选择器(Selector)。
门控网络的结构可以选择全连接网络或者其它深度网络,利用特征向量x0作为选择器从而获得不同子网络所占的权重大小,从而得到不同任务不同塔专家下门控网络的加权和。这样每个特征交互层的塔专家会根据输入的特征向量x0来对本塔下的塔专属专家层、本任务下的任务专属专家层和共享专家层这三部分的专家层输出进行加权求和,从而每个任务的每个塔专家网络得到一个特有的特征表示,再经过每个子任务的塔专家网络就得到了对应子任务塔专家网络的输出。每个任务的预测值为该子任务包含的多个塔专家网络的输出的加权聚合。
假设特征向量为x0,所有专家网络的结构类型均为MLP,以多任务框架的一个子任务CVR任务的一个特征交互层的塔专家输出的学习过程为例形式化表示以上过程,如下:


G(x0)=softmax(MLP(x0))
f(x0)=∑G(x0)*E(x0)
其中表示底层专家网络的输出,type∈{Tower,Task,Shared},k表示该底层专家网络层的第k个子网路的输出,表示该底层专家网络层的所有子网络的输出的集合,E(x0)为该塔专家所有类型的底层专家网络层的输出的集合。G(x0)是门控网络为每一个类专家层的每一个子网络的输出分配的权重,f(x0)是特征表示层混合专家学习到的特征表示, 随后输入特征交互层的塔专家网络,得到CVR任务的第j个塔专家网络的输出CVR任务的最终输出为CTR任务的学习过程与此类似,最终CTR任务的输出为
本申请中的专家网络可以使用多种网络,如任意深度网络,如Squeeze-and-Excitation网络,Attention网络等。
因此,本申请实施方式中,分别在底层(特征表示层)和任务塔网络层(特征交互层)上设置了层级的专家结构。为了充分利用任务之间的相关性,本申请在特征表示层设计了一种多专家参数共享的机制。特征表示层包括共享专家、任务特有专家和塔特有专家这三类专家网络,共享专家分享任务之间的共享知识,任务特有专家提取任务所需要的知识,塔特有专家独立学习知识为塔结构上的专家服务,各个专家各司其职,高效率的提取信息表示。同等规模下单个网络无法有效学习到任务之间通用的表达但通过划分得到多个子网络后每个子网络总能学到某个任务中一些相关独特的表达,因此本申请对于一个预测任务,在任务塔网络设置多个塔专家从不同角度进一步学习特征交互,提升模型的学习能力和泛化能力,最终能够取得比传统方法更好的预测精度。
可以理解为,在特征交互层划分得到多个子网络(塔专家),每个子网络总能学到某个任务中一些相关独特的表达,同时在特征表示层提供更为灵活的参数共享机制,塔特有专家的参数只服务于该塔专家,任务特有专家的参数只在同任务的塔专家之间共享,共享专家的参数为所有塔专家共享,可以高效率的提取信息表示。通过分层级设置混合专家,可以提升模型的学习能力和泛化能力,最终能够取得更好的预测精度。
前述对本申请提供的推荐模型的结构以及方法流程进行了介绍,下面结合前述的推荐模型以及方法对本申请的应用场景进行示例性介绍。
以手机智能助手服务直达中的展示广告场景为例,展示广告排序需要点击率预测与转化率预测,其输入包括用户特征、商品特征和上下文特征,随后通过多任务方法联合建模点击率预测与转化率预测任务。本申请通过改进多任务学习框架的参数共享机制,提升了多任务学习模型共享信息的能力,从而可以更准确的预估点击率和转化率给出更为精准的推荐。通过层级混合专家模块增强多任务之间共享信息的能力。
离线训练多任务预测模型时,具体流程如下所示:
收集系统日志,并进行数据清洗,将得到的原始特征输入模型。原始特征经过独热编码,随后从embedding table中取出对应的embedding向量表达,并最终将所有输入特征的embedding向量表达按顺序拼接形成特征向量;
特征向量经过前述的三种专家进行特征提取,通过门控网络从三种专家中获取信息表示,随后输入塔网络层(特征交互层)学习特征交互;
对于每一个任务,将塔网络专家的输出的结果进行加权聚合,输入激活函数,最终得到该任务的预测值。
利用预测值和真实标签计算分类损失函数同时计算102模块的校准损失函数四者的和为最终的loss,基于loss完成模型、101 模块的参数更新。
在线推理时,即可直接加载模型进行线上预估,即无需利用预测值和真实标签来计算损失值并进行网络更新。
示例性地,广告推荐系统的各个部分可以如图11所示。展示广告排序场景是机器学习应用中的一种典型场景,其主要结构如图11所示,包括展示广告、离线日志、离线训练、在线推理、线上排序等部分。
展示广告推荐系统的基本运行逻辑为:用户在前端展示列表中进行一系列的行为,如浏览、点击、评论、下载等,产生行为数据,存储于日志中。推荐系统利用包括用户行为日志在内的数据进行离线的模型训练,在训练收敛后产生预测模型,将模型部署在线上服务环境并基于用户的请求访问、物品特征和上下文信息给出点击率预估的分数Pctr和转化率预估的分数Pcvr,然后线上排序模块会综合以上两个分数和业务逻辑对候选广告排序,向用户展示最终的推荐列表,最后用户对该推荐结果产生反馈形成用户数据。例如,如图12所示,可以在用户的终端的显示界面中显示推荐的app的图标,以便于用户对推荐的app进行进一步的点击或者下载等操作,使用户可以快速查找所需的app,提高用户体验。
在线上排序阶段需要点击率预估的分数Pctr和转化率预估的分数Pcvr,多任务学习可以缓解样本选择偏差和数据稀疏对模型效果的影响。高效的参数共享方案能够使得多任务模型的两个关联任务点击率预估和转化率预估相互提供辅助,取得相比于单任务模型更好的效果,从而直接影响平台收益和用户体验。一个好的多任务学习方案既可以减少线上部署的模型数量,减少模型维护成本,还可以更有效的挖掘关联任务中蕴含的信息,取得更好的推荐效果。
以常用于转化率预估的公开数据集Ali-CCP和采集到的工业数据集为例,统计信息如表1所示。
表1
评价指标离线为AUC(曲线下与坐标轴围成的面积,Area Under Curve)。在以上两个数据集上进行实验,采用单层MLP作为底层专家网络层和门控网络的骨架,塔专家层采用多层MLP作为骨架,CTR任务与CVR任务均采用两个塔专家网络,输出结果如表2所示。

表2
显然,由上述表可知,相比于仅使用MLP以及MMoE、PLE、ESMM、AITM等模型,本申请提供的方法可以实现更高的AUC。
因此,在一些常用的方式中,只能从特征表示层片面的抽取信息,在特征交互层容易陷入过拟合。本申请通过设置层级混合专家模块改进了参数共享机制。相比于常用的方案,本申请提出在特征交互层也应该设置多个塔专家结构,基于集成学习的思想,同等规模下单个网络无法有效学习到任务之间通用的表达但通过划分得到多个子网络后每个子网络总能学到某个任务中一些相关独特的表达,因此本申请对于单独的预测任务,在任务塔网络设置多个塔专家从不同角度进一步学习特征交互,提升模型的学习能力和泛化能力,最终能够取得比传统方法更好的预测精度。同时在特征表示层,除了任务特有专家层和共享专家层,本申请提出塔特有专家。每个塔专家层的输入由门控网络控制,门控网络接受塔特有专家(参数只服务于该塔)、任务特有专家(参数只在同任务的塔专家之间共享)、共享专家(参数为所有塔专家共享)学习到的特征表示作为输入,进行加权求和。这使得特征交互层的塔专家能够在学习特征交互时既包含网络独有的个性化信息,也包含同一任务之间共享的信息,还包含所有任务之间具有的更加泛化的信息。这种灵活的参数共享机制,可以高效率的提取信息表示使得本申请提出的多任务学习方案能够充分共享任务之间的关联信息。除此之外,在训练时,使用模型整体输出与单个塔的输出之间的损失值作为约束来对模型进行更新,能够使得模型收敛更快,降低了学偏的可能性,提升了模型的泛化性能。
前述对本申请提供的推荐模型以及方法流程进行了介绍,下面对本申请提供的装置进行介绍。
参阅图13,本申请提供的一种推荐装置的结构示意图,用于执行前述图10-图12中的步骤,该推荐装置包括:
获取模块1301,用于获取输入数据,输入数据包括用户的信息;
推荐模块1302,用于将输入数据作为推荐模型的输入,输出针对用户的推荐信息;
其中,推荐模型用于执行针对用户进行推荐的多个任务,推荐模型中包括共享特征提取网络、每个任务分别对应的多个塔专家网络以及每个任务分别对应的任务专有特征提取网络,共享特征提取网络的输出端连接每个塔专家网络的输入端,每个任务分别对应的多个塔专家网络的输入端还连接每个任务分别对应的任务专有特征提取网络的输出端;该多个塔专家网络的参数不相同,共享特征提取网络用于从输入数据中提取共享特征,该共享特征用于多个任务对应的塔专家网络共用,任务专有特征提取网络用于从输入数据中提取塔专家共享特征,该塔专家共享特征用于单个任务对应的多个塔专家网络所共用,多个塔专家网络用于基于任务专有特征提取网络以及共享特征提取网络提取到的特征执行对应任务,多个任务分别对应的多个塔专家网络的输出经加权融合后得到推荐信息。
在一种可能的实施方式中,推荐模型还包括与多个塔专家网络一一对应的塔特征提取网络,塔特征提取网络用于从输入数据中提取与对应的塔专家网络执行的任务相关的特征, 且多个塔专家网络对应的塔特征提取网络参数不相同。
在一种可能的实施方式中,推荐模型还包括多个门控网络,每个塔专家网络对应一个门控网络,门控网络用于融合对应的任务专有特征提取网络、共享特征提取网络以及塔特征提取网络的输出,并将融合结果作为对应的塔专家网络的输入。
在一种可能的实施方式中,推荐装置还包括:
训练模块1303,还用于对初始模型进行迭代训练,得到推荐模型,初始模型的结构与推荐模型相同;
其中,在对初始模型的其中一次迭代训练过程中:将训练样本作为初始模型的输入,输出第一输出结果;获取训练样本的标签和第一输出结果之间的第一损失值;获取多个任务对应的塔专家网络输出的多个第二输出结果;获取第一输出与多个第二输出结果之间的多个第二损失值;根据第一损失值和第二损失值更新初始模型,得到当前次迭代后的初始模型。
在一种可能的实施方式中,多个任务包括预测点击率和预测转化信息,点击率为用户点击目标对象的概率,转化信息包括转化率或者转化时长,转化率为用户点击目标对象后对目标对象进行转化操作的概率,转化时长包括用户点击目标对象后对目标对象进行转化操作后停留的时长。
参阅图14,本申请提供一种训练装置的结构示意图,该训练装置可以用于执行前述图8-图9对应的步骤。该训练装置可以包括:
获取模块1401,用于获取训练集,训练集中包括多个样本以及每个样本对应的标签;
训练模块1402,用于将训练集作为初始模型的输入对初始模型进行迭代训练,得到推荐模型;
其中,推荐模型用于执行针对用户进行推荐的多个任务,推荐模型中包括共享特征提取网络、每个任务分别对应的多个塔专家网络以及每个任务分别对应的任务专有特征提取网络,共享特征提取网络的输出端连接每个塔专家网络的输入端,每个任务分别对应的多个塔专家网络的输入端还连接每个任务分别对应的任务专有特征提取网络的输出端;其中,在每次迭代过程中,将训练集中的样本作为上一次迭代得到的初始模型的输入,获取上一次迭代得到的模型的第一输出结果与输入样本的标签之间的第一损失值,获取每个塔专家网络的第二输出结果与第一输出结果之间的第二损失值,根据第二损失值和第一损失值对上一次迭代得到的模型进行更新,得到当前次迭代的模型。
在一种可能的实施方式中,推荐模型还包括与多个塔专家网络一一对应的塔特征提取网络,每个塔专家网络的输入端还连接对应的塔特征提取网络的输出端,塔特征提取网络用于从输入数据中提取与对应的塔专家网络执行的任务相关的特征,且多个塔专家网络对应的塔特征提取网络参数不相同。
在一种可能的实施方式中,推荐模型还包括多个门控网络,每个塔专家网络对应一个门控网络,门控网络用于融合对应的任务专有特征提取网络、共享特征提取网络以及塔特征提取网络的输出,并将融合结果作为对应的塔专家网络的输入。
在一种可能的实施方式中,多个任务包括预测点击率和预测转化信息,点击率为用户 点击目标对象的概率,转化信息包括转化率或者转化时长,转化率为用户点击目标对象后对目标对象进行转化操作的概率,转化时长包括用户点击目标对象后对目标对象进行转化操作后停留的时长。
请参阅图15,本申请提供的另一种推荐装置的结构示意图,如下所述。
该推荐装置可以包括处理器1501和存储器1502。该处理器1501和存储器1502通过线路互联。其中,存储器1502中存储有程序指令和数据。
存储器1502中存储了前述图10-图12中的步骤对应的程序指令以及数据。
处理器1501用于执行前述图10-图12中任一实施例所示的推荐装置执行的方法步骤。
可选地,该推荐装置还可以包括收发器1503,用于接收或者发送数据。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有程序,当其在计算机上运行时,使得计算机执行如前述图10-图12所示实施例描述的方法中的步骤。
可选地,前述的图15中所示的推荐装置为芯片。
请参阅图16,本申请提供的另一种训练装置的结构示意图,如下所述。
该训练装置可以包括处理器1601和存储器1602。该处理器1601和存储器1602通过线路互联。其中,存储器1602中存储有程序指令和数据。
存储器1602中存储了前述图8-图9中的步骤对应的程序指令以及数据。
处理器1601用于执行前述图8-图9所示的训练装置执行的方法步骤。
可选地,该训练装置还可以包括收发器1603,用于接收或者发送数据。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有程序,当其在计算机上运行时,使得计算机执行如前述图8-图9所示实施例描述的方法中的步骤。
可选地,前述的图16中所示的训练装置为芯片。
本申请实施例还提供了一种推荐装置,该推荐装置也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行前述图10-图12的方法步骤。
本申请实施例还提供了一种训练装置,该推荐装置也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行前述图8-图9的方法步骤。
本申请实施例还提供一种数字处理芯片。该数字处理芯片中集成了用于实现上述处理器1501、处理器1601,或者处理器1501、处理器1601的功能的电路和一个或者多个接口。当该数字处理芯片中集成了存储器时,该数字处理芯片可以完成前述实施例中的任一个或多个实施例的方法步骤。当该数字处理芯片中未集成存储器时,可以通过通信接口与外置的存储器连接。该数字处理芯片根据外置的存储器中存储的程序代码来实现上述实施例中推荐装置或者训练装置执行的动作。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述图8-图12所示实施例描述的方法的步骤。
本申请实施例提供的推荐装置或者训练装置可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使服务器内的芯片执行上述图8-图12所示实施例描述的方法步骤。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体地,前述的处理单元或者处理器可以是中央处理器(central processing unit,CPU)、网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者也可以是任何常规的处理器等。
示例性地,请参阅图17,图17为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 170,NPU 170作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1703,通过控制器1704控制运算电路1703提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路1703内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路1703是二维脉动阵列。运算电路1703还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1703是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1702中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1701中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1708中。
统一存储器1706用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(direct memory access controller,DMAC)1705,DMAC被搬运到权重存储器1702中。输入数据也通过DMAC被搬运到统一存储器1706中。
总线接口单元(bus interface unit,BIU)1710,用于AXI总线与DMAC和取指存储器(instruction fetch buffer,IFB)1709的交互。
总线接口单元1710(bus interface unit,BIU),用于取指存储器1709从外部存储器获取指令,还用于存储单元访问控制器1705从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1706或将权重数据搬运到权重存储器1702中或将输入数据数据搬运到输入存储器1701中。
向量计算单元1707包括多个运算处理单元,在需要的情况下,对运算电路的输出做 进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如批归一化(batch normalization),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元1707能将经处理的输出的向量存储到统一存储器1706。例如,向量计算单元1707可以将线性函数和/或非线性函数应用到运算电路1703的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1707生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1703的激活输入,例如用于在神经网络中的后续层中的使用。
控制器1704连接的取指存储器(instruction fetch buffer)1709,用于存储控制器1704使用的指令;
统一存储器1706,输入存储器1701,权重存储器1702以及取指存储器1709均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,循环神经网络中各层的运算可以由运算电路1703或向量计算单元1707执行。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述图8-图12的方法的程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是 通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。

Claims (22)

  1. 一种推荐方法,其特征在于,包括:
    获取输入数据,所述输入数据包括用户的信息;
    将所述输入数据作为推荐模型的输入,输出针对所述用户的推荐信息;
    其中,所述推荐模型用于执行针对所述用户进行推荐的多个任务,所述推荐模型中包括共享特征提取网络、每个任务分别对应的多个塔专家网络以及每个任务分别对应的任务专有特征提取网络,所述共享特征提取网络的输出端连接每个塔专家网络的输入端,所述每个任务分别对应的多个塔专家网络的输入端还连接所述每个任务分别对应的任务专有特征提取网络的输出端;
    所述多个塔专家网络的参数不相同,所述共享特征提取网络用于从输入数据中提取共享特征,所述共享特征用于所述多个任务对应的塔专家网络所共用,所述任务专有特征提取网络用于从输入数据中提取塔专家网络共享特征,所述塔专家网络共享特征用于单个任务所对应的所述多个塔专家网络所共用,所述多个塔专家网络用于基于对应任务的所述任务专有特征提取网络以及所述共享特征提取网络提取到的特征执行所述对应任务,所述多个任务分别对应的多个塔专家网络的输出经加权融合后得到所述推荐信息。
  2. 根据权利要求1所述的方法,其特征在于,所述推荐模型还包括与所述多个塔专家网络一一对应的塔特征提取网络,所述塔特征提取网络用于从所述输入数据中提取与对应的塔专家网络执行的任务相关的特征,且所述多个塔专家网络对应的塔特征提取网络参数不相同,每个塔专家网络的输入还包括对应的塔特征提取网络提取到的特征。
  3. 根据权利要求2所述的方法,其特征在于,所述推荐模型还包括多个门控网络,每个塔专家网络对应一个门控网络,所述门控网络用于融合对应的任务专有特征提取网络、共享特征提取网络以及塔特征提取网络的输出,并将融合结果作为对应的塔专家网络的输入。
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,在所述获取输入数据之前,所述方法还包括:
    对所述初始模型进行迭代训练,得到所述推荐模型,所述初始模型的结构与所述推荐模型相同;
    其中,在对所述初始模型的其中一次迭代训练过程中:
    将训练样本作为所述初始模型的输入,输出第一输出结果;
    获取所述训练样本的标签和所述第一输出结果之间的第一损失值;
    获取所述多个任务对应的塔专家网络输出的多个第二输出结果;
    获取所述第一输出与所述多个第二输出结果之间的多个第二损失值;
    根据所述第一损失值和所述第二损失值更新所述初始模型,得到当前次迭代后的初始模型。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述多个任务包括预测点击率和预测转化信息,所述点击率为用户点击目标对象的概率,所述转化信息包括转化率或者转化时长,所述转化率为所述用户点击所述目标对象后对所述目标对象进行转化操作的概率,所述转化时长包括所述用户点击所述目标对象后对所述目标对象进行转化操作后停留的时长。
  6. 一种训练方法,其特征在于,包括:
    获取训练集,所述训练集中包括多个样本以及每个样本对应的标签;
    将训练集作为初始模型的输入对所述初始模型进行迭代训练,得到推荐模型;
    其中,所述推荐模型用于执行针对所述用户进行推荐的多个任务,所述推荐模型中包括共享特征提取网络、每个任务分别对应的多个塔专家网络以及每个任务分别对应的任务专有特征提取网络,所述共享特征提取网络的输出端连接每个塔专家网络的输入端,所述每个任务分别对应的多个塔专家网络的输入端还连接所述每个任务分别对应的任务专有特征提取网络的输出端;
    其中,在每次迭代过程中,将所述训练集中的样本作为上一次迭代得到的初始模型的输入,获取上一次迭代得到的模型的第一输出结果与输入样本的标签之间的第一损失值,获取所述每个塔专家网络的第二输出结果与所述第一输出结果之间的第二损失值,根据所述第二损失值和所述第一损失值对上一次迭代得到的模型进行更新,得到当前次迭代的模型。
  7. 根据权利要求6所述的方法,其特征在于,所述推荐模型还包括与所述多个塔专家网络一一对应的塔特征提取网络,每个塔专家网络的输入端还连接对应的塔特征提取网络的输出端,所述塔特征提取网络用于从所述输入数据中提取与对应的塔专家网络执行的任务相关的特征,且所述多个塔专家网络对应的塔特征提取网络参数不相同。
  8. 根据权利要求7所述的方法,其特征在于,所述推荐模型还包括多个门控网络,每个塔专家网络对应一个门控网络,所述门控网络用于融合对应的任务专有特征提取网络、共享特征提取网络以及塔特征提取网络的输出,并将融合结果作为对应的塔专家网络的输入。
  9. 根据权利要求6-8中任一项所述的方法,其特征在于,所述多个任务包括预测点击率和预测转化信息,所述点击率为用户点击目标对象的概率,所述转化信息包括转化率或者转化时长,所述转化率为所述用户点击所述目标对象后对所述目标对象进行转化操作的概率,所述转化时长包括所述用户点击所述目标对象后对所述目标对象进行转化操作后停留的时长。
  10. 一种推荐装置,其特征在于,包括:
    获取模块,用于获取输入数据,所述输入数据包括用户的信息;
    推荐模块,用于将所述输入数据作为推荐模型的输入,输出针对所述用户的推荐信息;
    其中,所述推荐模型用于执行针对所述用户进行推荐的多个任务,所述推荐模型中包括共享特征提取网络、每个任务分别对应的多个塔专家网络以及每个任务分别对应的任务专有特征提取网络,所述共享特征提取网络的输出端连接每个塔专家网络的输入端,所述每个任务分别对应的多个塔专家网络的输入端还连接所述每个任务分别对应的任务专有特征提取网络的输出端;
    所述多个塔专家网络的参数不相同,所述共享特征提取网络用于从输入数据中提取共享特征,所述共享特征用于所述多个任务对应的塔专家网络所共用,所述任务专有特征提取网络用于从输入数据中提取塔专家网络共享特征,所述塔专家网络共享特征用于单个任务所对应的所述多个塔专家网络所共用,所述多个塔专家网络用于基于对应任务的所述任务专有特征提取网络以及所述共享特征提取网络提取到的特征执行所述对应任务,所述多个任务分别对应的多个塔专家网络的输出经加权融合后得到所述推荐信息。
  11. 根据权利要求10所述的装置,其特征在于,所述推荐模型还包括与所述多个塔专家网络一一对应的塔特征提取网络,所述塔特征提取网络用于从所述输入数据中提取与对应的塔专家网络执行的任务相关的特征,且所述多个塔专家网络对应的塔特征提取网络参数不相同,每个塔专家网络的输入还包括对应的塔特征提取网络提取到的特征。
  12. 根据权利要求11所述的装置,其特征在于,所述推荐模型还包括多个门控网络,每个塔专家网络对应一个门控网络,所述门控网络用于融合对应的任务专有特征提取网络、共享特征提取网络以及塔特征提取网络的输出,并将融合结果作为对应的塔专家网络的输入。
  13. 根据权利要求10-12中任一项所述的装置,其特征在于,所述装置还包括:
    训练模块,用于对所述初始模型进行迭代训练,得到所述推荐模型,所述初始模型的结构与所述推荐模型相同;
    其中,在对所述初始模型的其中一次迭代训练过程中:
    将训练样本作为所述初始模型的输入,输出第一输出结果;
    获取所述训练样本的标签和所述第一输出结果之间的第一损失值;
    获取所述多个任务对应的塔专家网络输出的多个第二输出结果;
    获取所述第一输出与所述多个第二输出结果之间的多个第二损失值;
    根据所述第一损失值和所述第二损失值更新所述初始模型,得到当前次迭代后的初始模型。
  14. 根据权利要求10-13中任一项所述的装置,其特征在于,所述多个任务包括预测点击率和预测转化信息,所述点击率为用户点击目标对象的概率,所述转化信息包括转化率或者转化时长,所述转化率为所述用户点击所述目标对象后对所述目标对象进行转化操 作的概率,所述转化时长包括所述用户点击所述目标对象后对所述目标对象进行转化操作后停留的时长。
  15. 一种训练装置,其特征在于,包括:
    获取模块,用于获取训练集,所述训练集中包括多个样本以及每个样本对应的标签;
    训练模块,用于将训练集作为初始模型的输入对所述初始模型进行迭代训练,得到推荐模型;
    其中,所述推荐模型用于执行针对所述用户进行推荐的多个任务,所述推荐模型中包括共享特征提取网络、每个任务分别对应的多个塔专家网络以及每个任务分别对应的任务专有特征提取网络,所述共享特征提取网络的输出端连接每个塔专家网络的输入端,所述每个任务分别对应的多个塔专家网络的输入端还连接所述每个任务分别对应的任务专有特征提取网络的输出端;
    其中,在每次迭代过程中,将所述训练集中的样本作为上一次迭代得到的初始模型的输入,获取上一次迭代得到的模型的第一输出结果与输入样本的标签之间的第一损失值,获取所述每个塔专家网络的第二输出结果与所述第一输出结果之间的第二损失值,根据所述第二损失值和所述第一损失值对上一次迭代得到的模型进行更新,得到当前次迭代的模型。
  16. 根据权利要求15所述的装置,其特征在于,所述推荐模型还包括与所述多个塔专家网络一一对应的塔特征提取网络,所述塔特征提取网络用于从所述输入数据中提取与对应的塔专家网络执行的任务相关的特征,且所述多个塔专家网络对应的塔特征提取网络参数不相同。
  17. 根据权利要求16所述的装置,其特征在于,所述推荐模型还包括多个门控网络,每个塔专家网络对应一个门控网络,所述门控网络用于融合对应的任务专有特征提取网络、共享特征提取网络以及塔特征提取网络的输出,并将融合结果作为对应的塔专家网络的输入。
  18. 根据权利要求15-17中任一项所述的装置,其特征在于,所述多个任务包括预测点击率和预测转化信息,所述点击率为用户点击目标对象的概率,所述转化信息包括转化率或者转化时长,所述转化率为所述用户点击所述目标对象后对所述目标对象进行转化操作的概率,所述转化时长包括所述用户点击所述目标对象后对所述目标对象进行转化操作后停留的时长。
  19. 一种推荐装置,其特征在于,包括至少一个处理器和存储器,所述至少一个处理器与所述存储器耦合,用于读取并执行所述存储器中的指令,以执行如权利要求1-5任一项所述的方法。
  20. 一种推荐装置,其特征在于,包括至少一个处理器和存储器,所述至少一个处理器与所述存储器耦合,用于读取并执行所述存储器中的指令,以执行如权利要求6-9任一项所述的方法。
  21. 一种计算机可读存储介质,包括程序,当其被处理单元所执行时,执行如权利要求1至9任一项所述的方法。
  22. 一种计算机程序产品,包括计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至9任一项所述的方法。
PCT/CN2023/094227 2022-05-17 2023-05-15 一种推荐方法、训练方法以及装置 WO2023221928A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210536912.5 2022-05-17
CN202210536912.5A CN114997412A (zh) 2022-05-17 2022-05-17 一种推荐方法、训练方法以及装置

Publications (1)

Publication Number Publication Date
WO2023221928A1 true WO2023221928A1 (zh) 2023-11-23

Family

ID=83027575

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/094227 WO2023221928A1 (zh) 2022-05-17 2023-05-15 一种推荐方法、训练方法以及装置

Country Status (2)

Country Link
CN (1) CN114997412A (zh)
WO (1) WO2023221928A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114997412A (zh) * 2022-05-17 2022-09-02 华为技术有限公司 一种推荐方法、训练方法以及装置
CN115630677B (zh) * 2022-11-07 2023-10-13 北京百度网讯科技有限公司 任务处理方法、装置、电子设备及介质
CN116244517B (zh) * 2023-03-03 2023-11-28 北京航空航天大学 基于层次化信息抽取网络的多场景多任务的模型训练方法
CN116684480B (zh) * 2023-07-28 2023-10-31 支付宝(杭州)信息技术有限公司 信息推送模型的确定及信息推送的方法及装置
CN116805253B (zh) * 2023-08-18 2023-11-24 腾讯科技(深圳)有限公司 干预增益预测方法、装置、存储介质及计算机设备
CN117194652B (zh) * 2023-11-08 2024-01-23 泸州友信达智能科技有限公司 一种基于深度学习的信息推荐系统
CN117556150B (zh) * 2024-01-11 2024-03-15 腾讯科技(深圳)有限公司 多目标预测方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180046926A1 (en) * 2014-05-23 2018-02-15 DataRobot, Inc. Systems for time-series predictive data analytics, and related methods and apparatus
CN108763362A (zh) * 2018-05-17 2018-11-06 浙江工业大学 基于随机锚点对选择的局部模型加权融合Top-N电影推荐方法
CN113901328A (zh) * 2021-11-19 2022-01-07 北京房江湖科技有限公司 信息推荐方法和装置、电子设备和存储介质
CN114463091A (zh) * 2022-01-29 2022-05-10 北京沃东天骏信息技术有限公司 信息推送模型训练和信息推送方法、装置、设备和介质
CN114997412A (zh) * 2022-05-17 2022-09-02 华为技术有限公司 一种推荐方法、训练方法以及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180046926A1 (en) * 2014-05-23 2018-02-15 DataRobot, Inc. Systems for time-series predictive data analytics, and related methods and apparatus
CN108763362A (zh) * 2018-05-17 2018-11-06 浙江工业大学 基于随机锚点对选择的局部模型加权融合Top-N电影推荐方法
CN113901328A (zh) * 2021-11-19 2022-01-07 北京房江湖科技有限公司 信息推荐方法和装置、电子设备和存储介质
CN114463091A (zh) * 2022-01-29 2022-05-10 北京沃东天骏信息技术有限公司 信息推送模型训练和信息推送方法、装置、设备和介质
CN114997412A (zh) * 2022-05-17 2022-09-02 华为技术有限公司 一种推荐方法、训练方法以及装置

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LIU XIAOFAN; JIA QINGLIN; WU CHUHAN; LI JINGJIE; QUANYU DAI; BO LIN; ZHANG RUI; TANG RUIMING: "Task Adaptive Multi-learner Network for Joint CTR and CVR Estimation", ARCHITECTUAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, IEEE COMPUTER SOCIETY PRESS, 10662 LOS VAQUEROS CIRCLE PO BOX 3014 LOS ALAMITOS, CA 90720-1264 USA, 30 April 2023 (2023-04-30) - 8 October 1987 (1987-10-08), 10662 Los Vaqueros Circle PO Box 3014 Los Alamitos, CA 90720-1264 USA , pages 490 - 494, XP059078293, ISBN: 978-0-8186-0805-6, DOI: 10.1145/3543873.3584653 *
SUN XIAN, YANG ZHU-JUN, LI JUN-XI, DIAO WEN-HUI, FU KUN: "Lightweight Fine Classification of Complex Remote Sensing Images Based on Self Knowledge Distillation", JOURNAL OF COMMAND AND CONTROL., SCHOOL OF ELECTRONIC, ELECTRICAL AND COMMUNICATION ENGINEERING, UNIVERSITY OF CHINESE ACADEMY OF SCIENCES, BEIJING 100190% INSTITUTE OF AEROSPACE INFORMATION INNOVATION, CHINESE ACADEMY OF SCIENCES, BEIJING 100190, vol. 7, no. 4, 1 December 2021 (2021-12-01), pages 365 - 373, XP093109270, ISSN: 2096-0204, DOI: 10.3969/j.issn.2096-0204.2021.04.0365 *
TANG HONGYAN VIOLATANG@TENCENT.COM; LIU JUNNING KORCHINLIU@TENCENT.COM; ZHAO MING MARCOZHAO@TENCENT.COM; GONG XUDONG XUDONGGONG@TE: "Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations", PROCEEDINGS OF THE 2020 5TH INTERNATIONAL CONFERENCE ON BIG DATA AND COMPUTING, ACMPUB27, NEW YORK, NY, USA, 22 September 2020 (2020-09-22) - 30 May 2020 (2020-05-30), New York, NY, USA , pages 269 - 278, XP058851102, ISBN: 978-1-4503-7547-4, DOI: 10.1145/3383313.3412236 *

Also Published As

Publication number Publication date
CN114997412A (zh) 2022-09-02

Similar Documents

Publication Publication Date Title
WO2023221928A1 (zh) 一种推荐方法、训练方法以及装置
WO2022042002A1 (zh) 一种半监督学习模型的训练方法、图像处理方法及设备
WO2022083536A1 (zh) 一种神经网络构建方法以及装置
US20220198289A1 (en) Recommendation model training method, selection probability prediction method, and apparatus
EP4145308A1 (en) Search recommendation model training method, and search result sorting method and device
WO2022042713A1 (zh) 一种用于计算设备的深度学习训练方法和装置
JP2023060820A (ja) 機械学習モデル・スケーリングのためのディープ・ニューラル・ネットワーク最適化システム
WO2022016556A1 (zh) 一种神经网络蒸馏方法以及装置
CN113807399B (zh) 一种神经网络训练方法、检测方法以及装置
WO2021129668A1 (zh) 训练神经网络的方法和装置
WO2024041483A1 (zh) 一种推荐方法及相关装置
CN112801265A (zh) 一种机器学习方法以及装置
WO2023051369A1 (zh) 一种神经网络的获取方法、数据处理方法以及相关设备
WO2022179586A1 (zh) 一种模型训练方法及其相关联设备
WO2024067373A1 (zh) 一种数据处理方法及相关装置
WO2024002167A1 (zh) 一种操作预测方法及相关装置
WO2023185925A1 (zh) 一种数据处理方法及相关装置
WO2024067779A1 (zh) 一种数据处理方法及相关装置
Xiang et al. Spiking siamfc++: Deep spiking neural network for object tracking
WO2024012360A1 (zh) 一种数据处理方法及相关装置
WO2023246735A1 (zh) 一种项目推荐方法及其相关设备
WO2021136058A1 (zh) 一种处理视频的方法及装置
WO2023197910A1 (zh) 一种用户行为预测方法及其相关设备
WO2023197857A1 (zh) 一种模型切分方法及其相关设备
WO2023273934A1 (zh) 一种模型超参数的选择方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23806880

Country of ref document: EP

Kind code of ref document: A1