WO2022016556A1 - 一种神经网络蒸馏方法以及装置 - Google Patents

一种神经网络蒸馏方法以及装置 Download PDF

Info

Publication number
WO2022016556A1
WO2022016556A1 PCT/CN2020/104653 CN2020104653W WO2022016556A1 WO 2022016556 A1 WO2022016556 A1 WO 2022016556A1 CN 2020104653 W CN2020104653 W CN 2020104653W WO 2022016556 A1 WO2022016556 A1 WO 2022016556A1
Authority
WO
WIPO (PCT)
Prior art keywords
data set
neural network
unbiased
samples
biased
Prior art date
Application number
PCT/CN2020/104653
Other languages
English (en)
French (fr)
Inventor
程朋祥
董振华
何秀强
张小莲
殷实
胡粤麟
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20945809.0A priority Critical patent/EP4180991A4/en
Priority to CN202080104828.5A priority patent/CN116249991A/zh
Priority to PCT/CN2020/104653 priority patent/WO2022016556A1/zh
Publication of WO2022016556A1 publication Critical patent/WO2022016556A1/zh
Priority to US18/157,277 priority patent/US20230162005A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to a neural network distillation method and device.
  • Knowledge distillation is a model compression technology that distills the feature representation "knowledge" learned by a complex and strong learning ability network and transfers it to a network with a small number of parameters and weak learning ability. Knowledge distillation can transfer the knowledge of one network to another network, and the two networks can be homogeneous or heterogeneous.
  • the approach is to train a teacher network first, and then use the output of the teacher network to train the student network.
  • the training set for training the student network may be biased, which may easily lead to inaccurate output results of the student network.
  • the accuracy of the student network is limited by the accuracy of the teacher network, and the output accuracy of the student network cannot be further improved. Therefore, how to obtain a network with more accurate output has become an urgent problem to be solved.
  • the embodiments of the present application provide a neural network distillation method and device, which are used to provide a neural network with lower output bias, improve the output accuracy of the neural network, and can select an appropriate distillation method according to different scenarios, with strong generalization ability .
  • a first aspect of the present application provides a neural network distillation method, including: first, obtaining a sample set, the sample set includes a biased data set and an unbiased data set, and the biased data set includes biased samples , the unbiased data set includes unbiased samples.
  • the data volume of the biased data set is larger than that of the unbiased data set; then, the first distillation method is determined according to the data characteristics of the sample set.
  • the teacher model is trained using an unbiased dataset, and the student model is trained using a biased dataset; then, based on the biased dataset and the unbiased dataset, the first neural network is trained according to the first distillation method. The network is trained to obtain the updated first neural network.
  • the unbiased samples included in the unbiased data set can be used to guide the knowledge distillation process of the first neural network, so that the updated first neural network can output unbiased results, achieving Correction of input samples improves the output accuracy of the first neural network.
  • a distillation method that matches the data characteristics of the sample set can be selected, and different distillation methods can be used for different scenarios to adapt to different scenarios and improve the knowledge distillation method for neural networks.
  • Generalization Choose different knowledge distillation methods under different conditions to maximize the benefits of knowledge distillation.
  • the first distillation mode is selected from multiple preset distillation modes, and the multiple distillation modes include at least two distillation modes in which the teacher model guides the student model in different ways.
  • different distillation methods can be used for different scenarios to adapt to different scenarios and improve the generalization ability of knowledge distillation for neural networks. Choose different knowledge distillation methods under different conditions to maximize the benefits of knowledge distillation.
  • the samples in the biased data set and the unbiased data set include input features and actual labels
  • the first distillation method is to perform distillation based on the input features of the samples in the sample set.
  • the unbiased data set can guide the knowledge distillation process of the model of the biased data set in the form of samples, so that the output of the updated first neural network has a lower degree of bias.
  • the first neural network is trained according to the first distillation method to obtain the updated first neural network, which may include: alternately using the biased data set and unbiased dataset to train the first neural network to obtain an updated first neural network, wherein, in an alternating process, the number of batches of training the first neural network using the biased dataset, and using The batch training times of the unbiased data set for training the first neural network is a preset ratio, and the samples include input features as the input of the first neural network.
  • the biased data set and the unbiased data set can be used alternately for training, and the samples in the unbiased data set are used to rectify the bias of the first neural network trained with the biased data set, so as to update the The output of the latter first neural network is less biased.
  • the difference between the first regular term and the second regular term is added to the loss function of the first neural network, and the first regular term uses an unbiased data set
  • the included samples are parameters obtained by training the first neural network
  • the second regular term is a parameter obtained by training the first neural network using the samples included in the biased data set.
  • the first neural network can be trained in a 1:1 alternation between the biased data set and the unbiased data set.
  • a neural network performs bias correction, so that the updated output of the first neural network is less biased.
  • the first neural network is trained according to the first distillation method to obtain an updated first neural network, which may include: for the biased data set The confidence of the samples is set, and the confidence is used to indicate the degree of bias of the sample; based on the biased data set, the confidence of the samples in the biased data set, and the unbiased data set, the first neural network is trained to obtain the updated A first neural network, and the samples include input features as input to the first neural network when training the first neural network.
  • a confidence level representing the bias degree can be set for the sample, so that the bias degree of the sample can be learned when training the neural network, thereby reducing the bias degree of the output result of the updated neural network.
  • the samples included in the biased dataset and the unbiased dataset include input features and actual labels
  • the first distillation method is to perform distillation based on the predicted labels of the samples included in the unbiased dataset, and predict The labels are output by the updated second neural network for the samples in the unbiased data set, and the updated second neural network is obtained by training the second neural network using the unbiased data set.
  • knowledge distillation can be performed on the first neural network by using the predicted labels of the samples included in the unbiased data set. It can be understood that the predicted labels of the samples in the unbiased data set output by the teacher model can be used, The guidance of the learning model is completed, so that the updated first neural network can obtain an output result with a lower degree of bias under the guidance of the predicted label output by the teacher model.
  • the sample set further includes an unobserved data set, and the unobserved data set includes a plurality of unobserved samples; based on the biased data set and the unbiased data set, according to the first distillation method
  • the first neural network is trained to obtain the updated first neural network, which may include: training the first neural network by using a biased data set to obtain the trained first neural network, and using an unbiased data set to train the second neural network.
  • the neural network is trained to obtain the updated second neural network; multiple samples are collected from the sample set to obtain the auxiliary data set; the auxiliary data set is used, and the predicted labels of the samples in the data set are used as constraints to update the trained first neural network , the updated first neural network is obtained, and the predicted labels of the samples in the auxiliary data set are the labels output by the updated second neural network.
  • an unobserved data set can be introduced, thereby reducing the bias effect of the biased data set on the training process of the first neural network, so that the finally obtained output result of the first neural network has a lower degree of bias.
  • the first neural network is trained according to the first distillation method to obtain the updated first neural network, including: pairing the unbiased data set with The second neural network is trained to obtain the updated second neural network; the predicted labels of the samples in the biased data set are output through the updated second neural network; the predicted labels of the samples and the actual labels of the samples are weighted and merged to obtain the samples The merged label of ; use the merged label of the sample to train the first neural network to obtain the updated first neural network.
  • the guidance of the unbiased data set in the process of training the first neural network can be completed by weighting the predicted labels of the samples and the actual labels of the samples, so that the final obtained first neural network has The output result is less biased.
  • the data characteristics of the sample set include a first ratio, where the first ratio is a ratio between the sample size of the unbiased data set and the sample size of the biased data set, according to the sample size
  • Determining the first distillation mode based on the data features of the set may include: selecting a first distillation mode that matches the first ratio from a plurality of distillation modes.
  • the first distillation method can be selected according to the ratio between the sample size of the unbiased data set and the sample size of the biased data set, so as to adapt to the sample size of the unbiased data set and the biased data set. Scenarios with different ratios between the sample sizes of the partial datasets.
  • the first distillation method includes: training a teacher model based on the features extracted from the unbiased data set, obtaining a trained teacher model, and using the trained teacher model and the biased data set to train the teacher model The model performs knowledge distillation.
  • the features extracted from the unbiased data set can be used to train the teacher model to obtain a teacher model with a lower degree of bias and a more stable teacher model.
  • the output of the model is less biased.
  • the first neural network is trained according to the first distillation method to obtain the updated first neural network, which may include: using a preset algorithm from The input features of some samples are screened out from the unbiased data set, and the preset algorithm can be a deep global balancing regression (DGBR) algorithm; the second neural network is trained according to the input features of the samples, and updated The second neural network after the update; using the updated second neural network as the teacher model and the first neural network as the student model, using the biased data set to perform knowledge distillation on the first neural network to obtain the updated first neural network.
  • DGBR deep global balancing regression
  • the stable feature of the unbiased data set can be calculated, and the second neural network can be trained by using the stable feature, so as to obtain an updated first neural network with a lower degree of bias and better robustness of the output result.
  • Second neural network and use the updated second neural network as the teacher model, the first neural network as the student model, and use the biased data set to perform knowledge distillation on the first neural network, so as to obtain the updated output with a lower degree of bias. the first neural network.
  • the data features of the sample set include the number of feature dimensions
  • the above-mentioned determining the first distillation mode according to the data features of the sample set may include: selecting a distillation mode that matches the number of feature dimensions from multiple distillation modes. The first distillation method.
  • the method based on feature distillation can be selected according to the number of feature dimensions included in the unbiased data set and the biased data set, which can adapt to scenarios with a larger number of feature dimensions and obtain the output bias degree. Lower student model.
  • the first neural network is trained according to the first distillation method to obtain the updated first neural network, which may include: using the unbiased data set Update the second neural network to obtain the updated second neural network; take the updated second neural network as the teacher model and the first neural network as the student model, and use the biased data set to perform knowledge distillation on the first neural network to obtain The updated first neural network.
  • a conventional neural network knowledge distillation process can be used, and an unbiased data set can be used to train the teacher model, which reduces the output bias degree of the teacher model.
  • the student model performs knowledge distillation, thereby reducing the output bias of the student model.
  • the data features of the above-mentioned sample set determine the first distillation method, which may include: the data features of the sample set include a second ratio, and calculating the number of positive samples and the number of negative samples included in the unbiased data set The second proportion of the number, select the first distillation method that matches the second proportion from the multiple distillation methods; or, the data characteristics of the sample set include the third proportion, calculate the number of positive samples and negative samples included in the biased data set The third proportion of the number of the first distillation method that matches the third proportion is selected from a plurality of distillation methods.
  • the conventional distillation method based on the model structure can be selected by the proportion of positive and negative samples in the unbiased data set or the biased data set, so as to adapt to the positive and negative samples in the unbiased data set or the biased data set. Scenarios with negative sample ratios.
  • the types of samples included in the biased data set are different from the types of samples included in the unbiased data set.
  • the types of samples included in the biased data set are different from those included in the unbiased data set, and it can be understood that the samples included in the biased data set and the samples included in the unbiased data set belong to Data from different fields, so that data from different fields can be used for guidance and training, so that the updated first neural network can output data in different fields from the input data.
  • cross-field recommendation can be achieved. .
  • the above method may further include: acquiring at least one sample of the target user; using the at least one sample as the input of the updated first neural network, and outputting the At least one tag of the target user, at least one tag is used to construct a user portrait of the target user, and the user portrait is used to determine a sample matching the target user.
  • one or more labels of the user can be output through the updated first neural network, and the representative features of the user can be determined according to the one or more labels, thereby constructing a user portrait of the target user,
  • the user portrait is used to describe the target user, so that in subsequent recommendation scenarios, samples matching the target user can be determined through the user portrait.
  • the present application provides a recommended method, including:
  • the biased data set and the unbiased data set are obtained by training the first neural network according to the first distillation method.
  • the biased data set includes biased samples, and the unbiased data set includes unbiased samples.
  • the first distillation method is Determined according to the data characteristics of the sample set, the samples in the biased data set include the information of the first user, the information of the first recommended object, and the actual label, and the actual label of the samples in the unbiased data set is used to indicate whether the first user
  • a recommended object has an operation action
  • the samples in the unbiased data set include the information of the second user, the information of the second recommended object and the actual label
  • the actual label of the sample in the unbiased data set is used to indicate whether the second user recommends the second user or not.
  • Objects have actions.
  • the recommendable model is obtained by using the teacher model trained with unbiased data and guided by the student model trained with biased data, so that the recommendation model with low output bias can be used to recommend matching recommendation objects for users , to make the recommendation result more accurate and improve the user experience.
  • the unbiased data set is obtained under the condition that the probabilities of being shown the candidate recommended objects in the candidate recommended object set are the same, and the second recommended object is a candidate recommended object in the candidate recommended object set .
  • the unbiased data set is obtained under the condition that the candidate recommended objects in the candidate recommended object set have the same probability of being displayed, including: the samples in the unbiased data set are in the candidate recommended object set The candidate recommended objects of . are obtained when the second user is randomly displayed; or the samples in the unbiased data set are obtained when the second user searches for the second recommended object.
  • the samples in the unbiased data set belong to the data of the source domain, and the samples in the biased data set belong to the data of the target domain.
  • the present application provides a method for recommending, comprising: displaying a first interface, where the first interface includes a learning list of at least one application program, and the learning list of the at least one application program in the learning list of the first application program
  • the learning list includes at least one option, and the option in the at least one option is associated with an application; the first operation of the user on the first interface is sensed; in response to the first operation, the first application is opened or closed in the first application. Learn the cross-domain recommendation feature in the app associated with some or all of the options in the list.
  • the user interaction history records of the source domain and the target domain are both incorporated into the learning, so that the recommendation model can It can better learn the user's preferences, so that the recommendation model can well fit the user's interest preferences in the target domain, recommend the recommendation results that meet their interests, realize cross-domain recommendation, and alleviate the cold start problem.
  • one or more recommended objects are determined by inputting the user's information and the information of the candidate recommended objects into the recommendation model, and predicting the probability that the user has an operation action on the candidate recommended objects.
  • the recommendation model is obtained by training the first neural network using the biased data set and the unbiased data set in the sample set according to the first distillation method, and the biased data set includes biased samples,
  • the unbiased data set includes unbiased samples.
  • the first distillation method is determined according to the data characteristics of the sample set.
  • the samples in the biased data set include the information of the first user, the information of the first recommended object, and the actual label. Unbiased
  • the actual labels of the samples in the data set are used to indicate whether the first user has an action on the first recommended object.
  • the samples in the unbiased data set include the information of the second user, the information of the second recommended object and the actual label. In the unbiased data set
  • the actual label of the sample is used to indicate whether the second user has an operation action on the second recommended object.
  • the present application provides a neural network distillation device, which has the function of implementing the neural network distillation method of the first aspect.
  • This function can be implemented by hardware or by executing corresponding software by hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the present application provides a recommending device, the recommending device having the function of implementing the above-mentioned recommending method in the second aspect.
  • This function can be implemented by hardware or by executing corresponding software by hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the present application provides an electronic device having the function of implementing the method recommended in the first aspect.
  • This function can be implemented by hardware or by executing corresponding software by hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • an embodiment of the present application provides a neural network distillation apparatus, including: a processor and a memory, wherein the processor and the memory are interconnected through a line, and the processor invokes program codes in the memory to execute any one of the above-mentioned first aspects.
  • the processing-related functions in the neural network distillation method shown in are .
  • an embodiment of the present application provides a recommendation device, including: a processor and a memory, wherein the processor and the memory are interconnected through a line, and the processor invokes program codes in the memory to execute any one of the above-mentioned second aspects. processing-related functions in the recommended method shown.
  • an embodiment of the present application provides an electronic device, including: a processor and a memory, wherein the processor and the memory are interconnected through a line, and the processor invokes program codes in the memory to execute any of the above-mentioned third aspects. processing-related functions in the recommended method shown.
  • an embodiment of the present application provides a neural network distillation device.
  • the neural network distillation device may also be called a digital processing chip or a chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface.
  • the instructions are executed by a processing unit, and the processing unit is configured to perform processing-related functions as in the first aspect or any of the optional embodiments of the first aspect.
  • an embodiment of the present application provides a recommendation device.
  • the recommendation device may also be called a digital processing chip or a chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface, and the program instructions are processed.
  • the unit executes, and the processing unit is configured to execute the processing-related functions in the second aspect or any optional implementation manner of the second aspect.
  • an embodiment of the present application provides an electronic device, which may also be referred to as a digital processing chip or a chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface, and the program instructions are processed.
  • the unit executes, and the processing unit is configured to execute the processing-related functions in the third aspect or any optional implementation manner of the third aspect.
  • an embodiment of the present application provides a computer-readable storage medium, including instructions, which, when run on a computer, cause the computer to execute the first aspect, any optional implementation manner of the first aspect, and the second A method of the aspect, any optional embodiment of the second aspect, the third aspect, or any optional embodiment of the third aspect.
  • an embodiment of the present application provides a computer program product including instructions, which, when run on a computer, enables the computer to execute the first aspect, any optional implementation manner of the first aspect, the second aspect, The method in any optional embodiment of the second aspect, the third aspect, or any optional embodiment of the third aspect.
  • Fig. 1 is a schematic diagram of a main frame of artificial intelligence applied by the application
  • FIG. 2 is a schematic diagram of a system architecture provided by the application.
  • FIG. 3 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of another convolutional neural network provided by an embodiment of the present application.
  • Fig. 5 is another kind of system architecture schematic diagram provided by this application.
  • FIG. 6 is a schematic flowchart of a neural network distillation method provided by the application.
  • FIG. 7 is a schematic diagram of the relationship between a click-through rate and a recommended position provided by the present application.
  • FIG. 8 is a schematic diagram of a neural network distillation architecture provided by the application.
  • FIG. 9 is a schematic diagram of another neural network distillation architecture provided by the application.
  • FIG. 10 is a schematic diagram of another neural network distillation architecture provided by the application.
  • FIG. 12 is a schematic diagram of an application scenario of the recommendation method provided by this application.
  • FIG. 13 is a schematic diagram of an application scenario of the recommendation method provided by the present application.
  • FIG. 14 is a schematic diagram of an application scenario of the recommendation method provided by this application.
  • FIG. 15 is a schematic diagram of an application scenario of the recommendation method provided by this application.
  • 16 is a schematic diagram of an application scenario of the recommendation method provided by this application.
  • 17 is a schematic diagram of an application scenario of the recommendation method provided by the present application.
  • FIG. 19 is a schematic diagram of an application scenario of the recommendation method provided by this application.
  • 21 is a schematic structural diagram of a neural network distillation apparatus provided by the application.
  • 22 is a schematic structural diagram of a recommending device provided by the application.
  • Figure 24 is a schematic structural diagram of another neural network distillation device provided by the application.
  • 25 is a schematic structural diagram of another recommended device provided by the application.
  • 26 is a schematic structural diagram of another electronic device provided by the application.
  • FIG. 27 is a schematic diagram of a chip structure provided by this application.
  • AI artificial intelligence
  • AI is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new type of intelligent machine that responds in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theory.
  • Figure 1 shows a schematic structural diagram of the main frame of artificial intelligence.
  • the above-mentioned artificial intelligence theme framework is explained in two dimensions (vertical axis).
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
  • the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communication with the outside world through sensors; computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA); the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • smart chips hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA
  • the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall solution of artificial intelligence, and the productization of intelligent information decision-making to achieve landing applications. Its application areas mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, safe city, etc.
  • the embodiments of the present application involve a large number of related applications of neural networks.
  • the related terms and concepts of the neural networks that may be involved in the embodiments of the present application are first introduced below.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept 1 as input, and the output of the operation unit can be shown in formula (1-1):
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f is an activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting a plurality of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • a deep neural network also known as a multi-layer neural network, can be understood as a neural network with multiple intermediate layers.
  • the DNN is divided according to the position of different layers.
  • the neural network inside the DNN can be divided into three categories: input layer, intermediate layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all intermediate layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of convolutional layers and subsampling layers, which can be viewed as a filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a convolutional layer of a convolutional neural network a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way to extract image information is independent of location.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights by learning during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • Recurrent neural networks are used to process sequence data.
  • RNN Recurrent neural networks
  • the layers are fully connected, but each node in each layer is unconnected.
  • this ordinary neural network solves many problems, it is still powerless to many problems. For example, if you want to predict the next word of a sentence, you generally need to use the previous words, because the front and rear words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output.
  • RNN can process sequence data of any length.
  • the training of RNN is the same as the training of traditional CNN or DNN.
  • An additive neural network is a neural network that contains almost no multiplication. Unlike convolutional neural networks, additive neural networks use the L1 distance to measure the correlation between features and filters in the neural network. Since the L1 distance only includes addition and subtraction, a large number of multiplication operations in the neural network can be replaced by addition and subtraction, which greatly reduces the computational cost of the neural network.
  • is the absolute value operation
  • ⁇ ( ⁇ ) is the summation operation
  • Y(m,n,t) is the at least one output sub-feature map
  • Y(m,n,t) is the elements in the m-th row, n-th column and t-th page in the output feature map
  • X(m+i,n+j,k) is the i-th row, j-th column and The element of the kth page
  • F(i,j,k,t) is the element of the ith row, jth column and kth page in the feature extraction kernel
  • t is the number of channels of the feature extraction kernel
  • d is the number of rows of the feature extraction kernel
  • C is the number of channels of the input feature map
  • d, C, i, j, k, m, n, and t are all integers.
  • ANN only needs to use addition.
  • addition By changing the measurement method of calculating features in convolution to L1 distance, only addition can be used to extract features in the neural network and build an additive neural network.
  • the difference between the predicted value and the target value which is the loss function (loss function) or objective function (objective function), which are used to measure the difference between the predicted value and the target value.
  • the loss function as an example, the higher the output value of the loss function (loss), the greater the difference, then the training of the deep neural network becomes the process of reducing the loss as much as possible.
  • the difference between the objective function and the loss function is that the objective function may include a constraint function in addition to the loss function, which is used to constrain the update of the neural network, so that the updated neural network is closer to the desired one.
  • the neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller.
  • BP error back propagation
  • the input signal is passed forward until the output will generate error loss, and the parameters in the initial neural network model are updated by back-propagating the error loss information, so as to make the error loss converge.
  • the back-propagation algorithm is a back-propagation movement dominated by error loss, aiming to obtain the parameters of the optimal neural network model, such as the weight matrix.
  • an embodiment of the present application provides a system architecture 200 .
  • the system architecture includes a database 230 and a client device 240 .
  • the data collection device 260 is used to collect data and store it in the database 230 , and the training module 202 generates the target model/rule 201 based on the data maintained in the database 230 .
  • the following will describe in more detail how the training module 202 obtains the target model/rule 201 based on the data.
  • the target model/rule 201 is the neural network constructed in the following embodiments of the present application. For details, please refer to the relevant descriptions in FIGS. 6-20 below.
  • the computing module may include a training module 202, and the target model/rule obtained by the training module 202 may be applied to different systems or devices.
  • the execution device 210 configures a transceiver 212, which can be a wireless transceiver, an optical transceiver, or a wired interface (such as an I/O interface), etc., to perform data interaction with external devices, and a "user" can
  • the client device 240 inputs data to the transceiver 212.
  • the client device 240 can send target tasks to the execution device 210, request the execution device to build a neural network, and send the execution device 210 a database for training.
  • the execution device 210 can call data, codes, etc. in the data storage system 250 , and can also store data, instructions, etc. in the data storage system 250 .
  • the calculation module 211 uses the target model/rule 201 to process the input data. Specifically, the calculation module 211 is used to: obtain a biased data set and an unbiased data set, the biased data set includes biased samples, the unbiased data set includes unbiased samples, and the data volume of the biased data set is greater than The amount of data in the unbiased dataset; according to at least one of the data included in the biased dataset or the data included in the unbiased dataset, select the first distillation method from the preset multiple distillation methods, and multiple distillation methods.
  • the teacher model guides the student model differently, and the model trained with the unbiased dataset guides the model trained with the biased dataset; based on the biased dataset and the unbiased dataset, according to The first distillation method trains the first neural network to obtain an updated first neural network.
  • transceiver 212 returns the constructed neural network to client device 240 to deploy the neural network in client device 240 or other devices.
  • the training module 202 can obtain corresponding target models/rules 201 based on different data for different tasks, so as to provide users with better results.
  • the data input into the execution device 210 can be determined according to the input data of the user, for example, the user can operate in the interface provided by the transceiver 212 .
  • the client device 240 can automatically input data to the transceiver 212 and obtain the result. If the client device 240 automatically inputs data and needs to obtain the authorization of the user, the user can set the corresponding permission in the client device 240 .
  • the user can view the result output by the execution device 210 on the client device 240, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 240 can also act as a data collection end to store the collected data associated with the target task into the database 230 .
  • FIG. 2 is only an exemplary schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 250 is an external memory relative to the execution device 210 . In other scenarios, the data storage system 250 may also be placed in the execution device 210 .
  • the training or update process mentioned in this application may be performed by the training module 202 .
  • the training process of the neural network is to learn the way to control the spatial transformation, more specifically, to learn the weight matrix.
  • the purpose of training a neural network is to make the output of the neural network as close to the expected value as possible, so you can compare the predicted value and expected value of the current network, and then update the weight of each layer of the neural network in the neural network according to the difference between the two. vector (of course, the weight vector can usually be initialized before the first update, that is, the parameters are pre-configured for each layer in the deep neural network).
  • the value of the weight in the weight matrix is adjusted to reduce the predicted value, and after continuous adjustment, the value output by the neural network is close to or equal to the expected value.
  • the difference between the predicted value and the expected value of the neural network can be measured by a loss function or an objective function. Taking the loss function as an example, the higher the output value of the loss function (loss), the greater the difference.
  • the training of the neural network can be understood as the process of reducing the loss as much as possible. For the process of updating the weight of the starting point network and training the serial network in the following embodiments of the present application, reference may be made to this process, which will not be repeated below.
  • a target model/rule 201 is obtained by training according to the training module 202, and the target model/rule 201 may be the first neural network in the present application in this embodiment of the present application.
  • the first neural network, the second neural network, the teacher model or the student model, etc. may be deep convolutional neural networks (DCNN), recurrent neural networks (RNNS), and so on.
  • the neural network mentioned in this application may include various types, such as deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN) or residual network other neural networks, etc.
  • an embodiment of the present application further provides a system architecture 500 .
  • the execution device 210 is implemented by one or more servers, and optionally, cooperates with other computing devices, such as: data storage, routers, load balancers and other devices; the execution device 210 may be arranged on a physical site, or distributed in multiple on the physical site.
  • the execution device 210 can use the data in the data storage system 250, or call the program code in the data storage system 250 to implement the steps of the training set processing method corresponding to FIG. 6 below in this application.
  • a user may operate respective user devices (eg, local device 501 and local device 502 ) to interact with execution device 210 .
  • Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, etc.
  • Each user's local device can interact with the execution device 210 through any communication mechanism/standard communication network, which can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the communication network may include a wireless network, a wired network, or a combination of a wireless network and a wired network, and the like.
  • the wireless network includes but is not limited to: the fifth generation mobile communication technology (5th-Generation, 5G) system, the long term evolution (long term evolution, LTE) system, the global system for mobile communication (global system for mobile communication, GSM) or code division Multiple access (code division multiple access, CDMA) network, wideband code division multiple access (wideband code division multiple access, WCDMA) network, wireless fidelity (wireless fidelity, WiFi), Bluetooth (bluetooth), Zigbee protocol (Zigbee), Any one or a combination of radio frequency identification technology (radio frequency identification, RFID), long range (Long Range, Lora) wireless communication, and near field communication (near field communication, NFC).
  • the wired network may include an optical fiber communication network or a network composed of coaxial cables, and the like.
  • one or more aspects of the execution device 210 may be implemented by each local device, for example, the local device 501 may provide the execution device 210 with local data or feedback calculation results.
  • the data processing method provided by the embodiment of the present application may be executed on a server, and may also be executed on a terminal device.
  • the terminal device can be a mobile phone with image processing function, tablet personal computer (TPC), media player, smart TV, laptop computer (LC), personal digital assistant (PDA) ), a personal computer (PC), a camera, a video camera, a smart watch, a wearable device (WD), or an autonomous vehicle, etc., which are not limited in this embodiment of the present application.
  • knowledge distillation can transfer the knowledge of one network to another network, and the two networks can be homogeneous or heterogeneous.
  • the approach is to first train a teacher network, or a teacher model, and then use the output of the teacher network to train a student network, or a student model.
  • another simple network can be trained by using a pre-trained complex network, so that the simple network can have the same or similar data processing capabilities as the complex network.
  • Knowledge distillation can quickly and easily implement some small networks. For example, a complex network model with a large amount of data can be trained on cloud servers or enterprise-level servers, and then knowledge distillation can be performed to obtain a small model that implements the same function, and the small model can be compressed and merged. Migrate to small devices (such as mobile phones, smart bracelets, etc.). For another example, by collecting a large number of users' data on the smart bracelet, and performing complex and time-consuming network training on the cloud server, a user behavior recognition model is obtained, and then the model is compressed and migrated to the small carrier of the smart bracelet. It can quickly train the model and improve the user experience while ensuring the protection of user privacy.
  • the output accuracy of the student model is usually limited by the output accuracy of the teacher model, so that the output accuracy of the student model cannot be further improved.
  • a biased data set is usually used, so that the output of the trained student model is biased, that is, the output result is inaccurate.
  • the present application provides a neural network distillation method, which is used to select an appropriate guidance method for the data set used for training, complete the knowledge distillation of the neural network, and use the model trained on the unbiased data set for the biased data set.
  • the trained model provides guidance to reduce the output bias of the student model and improve the output accuracy of the student model.
  • the neural network distillation method provided in this application can be applied to recommendation systems, user portraits, image recognition or other debiasing scenarios.
  • the recommender system can be used to recommend applications (Application, App), music, images, videos or commodities, etc. to the user.
  • the user portrait is used to reflect the user's characteristics or preferences.
  • FIG. 6 a schematic flowchart of a neural network distillation method provided by the present application.
  • sample set includes a biased data set and an unbiased data set.
  • the sample set includes at least a biased data set and an unbiased data set.
  • the biased data set includes biased samples (hereinafter referred to as biased samples), and the unbiased data set includes unbiased samples (hereinafter referred to as “biased samples”). are unbiased samples), and usually the amount of data in the biased dataset is larger than that in the unbiased dataset.
  • a biased sample can be understood as a sample that is biased from a user's actual usage sample.
  • recommender systems usually face various bias problems, such as location bias, popularity bias and pre-order model bias. Feedback data does not reflect the real preferences of users.
  • the bias of the sample may also be different in different scenarios, such as location bias, selection bias or popularity bias.
  • location bias can be understood as: when describing the user, the tendency to choose an item in a better position for interaction has nothing to do with whether the item meets the actual needs of the user.
  • Selection bias can be understood as: it occurs when the "studied group” cannot represent the "target group", so that the measurement of the risks or benefits of the "studied group” cannot accurately characterize the "target group”, leading to the conclusion obtained. cannot be generalized effectively.
  • Figure 7 shows the click-through rate of the same APP at each recommended location under the random delivery strategy. The lower it is, the effect of positional bias on the click-through rate is illustrated. The location bias results in higher click-through rates for apps with higher recommended positions, and lower click-through rates for apps with recommended positions at the back. If such click data is used to train the model, it will aggravate the Matthew of the trained model. effect, leading to polarization of the output of the model. For example, if the user searches for APPs in the recommendation system, if the APPs that meet the user's needs include APP1 and APP2, and APP2 is more in line with the user's search needs.
  • the recommended position of APP1 is better, resulting in the user clicking on APP1 but not on APP2.
  • the historical data of the user's click on APP1 ie, a biased sample
  • the actual needs of users should be related to APP2 (ie, unbiased samples), which may lead to inaccurate recommendations to users.
  • Unbiased data can be collected by means of random traffic (uniform data). Take the recommendation system as an example.
  • the specific process of collecting unbiased data sets may include: randomly sampling from all candidate sets, then randomly displaying the randomly sampled samples, and then collecting feedback data for the randomly displayed samples, and obtaining from the feedback data Unbiased samples. It can be understood that all samples in the candidate set have an equal opportunity to be shown to the user for selection, so it can be regarded as a good unbiased proxy.
  • the first distillation mode may be determined according to data features included in the sample set. Specifically, after obtaining the biased data set and the unbiased data set, based on the biased data set and/or the unbiased data set, a matching distillation method is selected from a plurality of preset distillation methods to obtain the first distillation method.
  • the first distillation method is selected from multiple preset distillation methods, and the multiple distillation methods include at least two distillation methods in which the teacher model guides the student model in different ways.
  • the unbiased dataset is used to train the teacher model
  • the biased dataset is used to train the student model, that is, the model trained with the unbiased dataset is used to guide the model obtained with the biased dataset.
  • the preset multiple distillation methods may include, but are not limited to, one or more of the following: sample distillation, label distillation, feature distillation, or model structure distillation, and the like.
  • sample distillation refers to using the samples in the biased dataset and the unbiased dataset to perform distillation. Such as using samples from unbiased datasets to guide knowledge distillation of student models.
  • Label distillation refers to distilling the student model using the predicted labels based on the samples in the unbiased dataset as a guide.
  • the predicted labels are output by the teacher model, which is trained on the unbiased dataset.
  • Feature distillation refers to training the teacher model based on the features extracted from the unbiased dataset, and performing knowledge distillation through the teacher model and the biased dataset.
  • Model structure distillation that is, using the unbiased dataset to train the teacher model, using the teacher model and the biased dataset to perform knowledge distillation on the student model to obtain the updated student model.
  • the sample size of the unbiased dataset and the sample size of the biased dataset may be based on the sample size of the unbiased dataset and the sample size of the biased dataset, the ratio between positive and negative samples in the unbiased dataset, and the ratio of positive and negative samples in the biased dataset.
  • the ratio between the two, or the number of feature dimensions of the data included in the unbiased dataset and the biased dataset, etc., is used to select a matching distillation method as the first distillation method.
  • the data types of the input features of the samples in the sample set may be different.
  • the data type of each type can be understood as a dimension
  • the number of feature dimensions is the type of data types included in the sample set.
  • the method of selecting the distillation method may include, but is not limited to:
  • Condition 1 Calculate the first ratio between the sample size of the unbiased data set and the sample size of the biased data set. When the first ratio is smaller than the first threshold, sample distillation is selected as the first distillation method.
  • Condition 2 When the first ratio is not less than the first threshold, label distillation is selected as the first distillation method.
  • Condition 3 Calculate the second ratio of the number of positive samples and the number of negative samples included in the unbiased data set, when the second ratio is greater than the second threshold, select model structure distillation as the first distillation method; or, calculate biased A third ratio of the number of positive samples and the number of negative samples included in the data set, when the third ratio is greater than a third threshold, model structure distillation is selected as the first distillation method.
  • Condition 4 Calculate the biased and unbiased dataset and the number of feature dimensions included in the biased dataset. When the number of feature dimensions is greater than the preset dimension, feature distillation is selected as the first distillation method.
  • the priority of each distillation method can be preset, and when the above-mentioned multiple conditions are satisfied at the same time, an appropriate distillation method can be selected according to the priority.
  • the priority of feature distillation > the priority of model structure distillation > the priority of sample distillation the priority of label distillation, when the unbiased dataset and the unbiased dataset satisfy both condition 3 and condition 4, then select feature distillation as the first distillation method.
  • teacher model and the student model referred to in this application may be models with different structures, or models with the same structure and models obtained by using different data sets. adjustment, which is not limited in this application.
  • knowledge distillation can be performed on the first neural network according to the guidance method included in the first distillation method to obtain an updated first neural network.
  • the unbiased data set collected through Uniform data is not affected by the pre-order model, and conforms to the expected sample attributes of the model, that is, all candidate sets have an equal opportunity to be displayed to the user for selection. . Therefore an unbiased dataset can be regarded as a good unbiased proxy.
  • the model trained by unbiased data set is more unbiased, but the variance is larger, while the model trained with biased data is biased, but the variance is relatively small. Therefore, in the embodiment of the present application, the unbiased data set is effectively combined. Train with the biased data set, so that the unbiased data set forms a guide for the training of the biased data set, so that the final output of the first neural network has a lower degree of bias and improves the output of the first neural network. accuracy.
  • step 603 is described in detail below by taking several distillation methods as examples.
  • the samples in the biased data set and the unbiased data set include input features and actual labels, and the input features of the samples in the unbiased data set can be used as the input of the teacher model.
  • the teacher model is trained, and the input features of the samples in the biased data set are used as the input of the student model.
  • the student model is the first neural network, so as to complete the knowledge distillation of the first neural network and obtain the updated first neural network.
  • the specific process of performing knowledge distillation may include: alternately using biased data sets and unbiased data sets to train the first neural network to obtain an updated first neural network, wherein in a During the alternation process, the number of batch trainings for training the first neural network using the biased data set and the number of batch trainings for training the first neural network using the unbiased data set are preset ratios, and the first neural network is being trained. When , the input feature of the sample is used as the input of the first neural network.
  • the first neural network can be trained by alternately using the biased data set and the unbiased data set.
  • the generated bias makes the finally obtained output result of the first neural network less biased, and the output result is more accurate.
  • the difference between the first regular term and the second regular term is added to the loss function of the first neural network, and the first regular term uses an unbiased data set
  • the included samples are parameters obtained by training the first neural network
  • the second regular term is a parameter obtained by training the first neural network using the samples included in the biased data set.
  • the specific process of performing knowledge distillation may include: setting a confidence level for all or part of the samples in the biased data set, where the confidence level is used to represent the bias degree of the samples; based on the biased data set, The confidence of the samples in the biased data set and the unbiased data set, the first neural network is trained to obtain an updated first neural network, and the samples include input features as the first neural network when training the first neural network. input to the network.
  • the second neural network can be trained using an unbiased data set, and then the predicted labels of the samples in the biased data set are output through the trained second neural network, and then the predicted labels are used as constraints to carry out the first neural network. Train to get the updated first neural network.
  • the aforementioned sample set further includes an unobserved data set
  • the unobserved data set includes a plurality of unobserved samples
  • the specific process of performing knowledge distillation may include: using a biased data set to analyze the first neural network Perform training to obtain the trained first neural network, and train the second neural network through an unbiased data set to obtain an updated second neural network; collect multiple samples from the full data set to obtain an auxiliary data set; use For the auxiliary data set, the predicted labels of the samples in the auxiliary data set are used as constraints, and the first neural network after training is updated to obtain the updated first neural network.
  • the samples in the auxiliary data set have at least two predicted labels, and the at least two predicted labels are respectively output by the updated first neural network and the second neural network.
  • the unobserved data set can be introduced, and the samples included in the unobserved data set can be used to reduce the bias influence of the biased data set on the training of the first neural network, and the updated The degree of bias of the output of the first neural network.
  • the specific process of performing knowledge distillation may include: training the second neural network through an unbiased data set to obtain an updated second neural network; The predicted label of the sample in the partial data set; the weighted combination of the predicted label of the sample and the actual label of the sample is obtained to obtain the combined label of the sample; the first neural network is trained using the combined label of the sample to obtain the updated first neural network.
  • the first neural network can be updated by using the predicted labels of the samples in the biased data set output by the second neural network and the actual labels of the samples to be merged to update the first neural network.
  • the teacher model uses the prediction The method of labels guides the update of the first neural network, thereby reducing the bias of the output result of the updated first neural network and improving the accuracy of the output result of the updated first neural network.
  • stable features can be extracted from the unbiased data set, and then a second neural network can be trained based on the stable features to obtain an updated second neural network. Then use the biased data set to train the first neural network, and use the updated second neural network as the teacher model and the first neural network as the student model to perform knowledge distillation to obtain the updated first neural network.
  • the specific process of performing knowledge distillation may include: outputting input features of partial samples of the unbiased data set through a preset algorithm, and the input features of the partial samples can be understood as the stability of the unbiased data set.
  • the preset algorithm can be the DGBR algorithm; the second neural network is trained according to the input characteristics of the part of the sample to obtain the updated second neural network; the updated second neural network is used as the teacher model, the first neural network As a student model, the network uses a biased dataset to perform knowledge distillation on the first neural network to obtain an updated first neural network.
  • the stable features in the unbiased data set can be used to train the second neural network to obtain an updated second neural network, that is, the teacher model. Therefore, the output of the teacher model is more stable and more accurate. high. On this basis, using the teacher model for knowledge distillation, the output of the obtained student model is also more stable and more accurate.
  • the second neural network can be trained by using an unbiased data set to obtain an updated second neural network. Then take the updated neural network as the teacher model, the first neural network as the student model, use the biased data set, and the output results of the middle layer of the teacher model to perform knowledge distillation on the first neural network to obtain the updated neural network .
  • the unbiased samples included in the partial data set can be used to guide the knowledge distillation process of the first neural network, so that the updated neural network can output unbiased results and realize the input
  • the deviation correction of the sample improves the output accuracy of the first neural network.
  • the unbiased samples included in the unbiased dataset can be used to guide the knowledge distillation process of the first neural network, so that the updated neural network can output unbiased results and realize the input
  • the deviation correction of the sample improves the output accuracy of the first neural network.
  • a distillation method that matches the unbiased data set and the biased data set can be selected, and different distillation methods can be used for different scenarios to adapt to different scenarios and improve the performance of neural network.
  • Generalization ability for knowledge distillation Select different knowledge distillation methods under different conditions, and adapt according to the size of the data set, the ratio of positive and negative ratios, and the proportion of different data to maximize the benefits of knowledge distillation.
  • the types of samples in the unbiased data set and the types of samples in the biased data set are different.
  • the sample type included in the unbiased dataset is music
  • the sample type included in the biased dataset is video. Therefore, in the embodiments of the present application, data from different fields can be used to perform knowledge distillation, so as to realize the training of a cross-domain neural network, so as to realize cross-domain user recommendation and improve user experience.
  • At least one sample of the target user may be obtained, the at least one sample may be used as the input of the updated first neural network, and the target user's At least one tag, the user portrait of the target user is constructed by using the at least one tag, and the user portrait is used to describe the target user or recommend matching samples for the user.
  • the APP clicked by user A can be obtained, the APP clicked by the user can be used as the input of the updated first neural network, and one or more labels of user A can be output, and the one or more labels can be used to indicate that the user clicks
  • the probability of the corresponding APP when the probability exceeds the preset probability, the characteristics of the corresponding APP can be used as the characteristics of user A, so as to construct a user portrait of user A, and the characteristics included in the user portrait are used to describe the user, or Recommend matching apps for users, etc.
  • the updated first neural network can be used to generate a user portrait, so as to describe the user through the user portrait, or recommend matching samples for the user. Because the updated neural network is a neural network after bias correction, the bias of the output result can be reduced, thereby making the obtained user portrait more accurate and improving the user's recommendation experience.
  • a biased dataset 801 and an unbiased dataset 802 are obtained.
  • the preset distillation methods may include sample distillation 803 , label distillation 804 , feature distillation 805 and model structure distillation 806 .
  • the biased dataset 801 may include constructed or collected samples.
  • the biased data set 801 may be APPs clicked or downloaded by the user; music clicked or played by the user; videos clicked or played by the user; pictures clicked or saved by the user, and the like.
  • the biased dataset will be referred to as S c below .
  • the unbiased data set may be a data set collected using a random flow (uniform data) method, that is, randomly sampling multiple samples from the candidate set, and then selecting from the multiple samples.
  • a random flow (uniform data) method that is, randomly sampling multiple samples from the candidate set, and then selecting from the multiple samples.
  • a user recommends a picture as an example, multiple pictures can be randomly sampled from the candidate set, and the thumbnails of the multiple pictures can be randomly arranged and displayed in the recommendation interface, and then the pictures clicked or downloaded by the user are collected.
  • Get an unbiased dataset For ease of understanding, the biased dataset will be referred to as S t below .
  • S c and S t may be data in different fields.
  • S c may be collected music clicked or played by the user
  • S t may be a picture or video clicked by the user. Therefore, cross-domain knowledge distillation is subsequently implemented, so that the first neural network can output prediction results in different domains from the input data.
  • users can predict their preferences for another type of items according to their preferences for one type of item, so as to alleviate the cold start problem of new application scenarios and improve user experience.
  • an appropriate distillation method is selected from a variety of distillation methods based on the S c and S t.
  • model structure distillation can be selected as the distillation method.
  • model structure distillation can be selected as the distillation method.
  • the model finally trained will also become more complex, and the output effect of the model will also improve. Therefore, when the number of feature dimensions of the samples included in S t and S c is large, feature distillation can be selected, so that the final model output effect is better.
  • knowledge distillation can be performed on the first neural network by using the distillation method to obtain an updated first neural network.
  • sample distillation can be divided into various ways, and several possible implementations are exemplarily introduced below.
  • the same model can be trained alternately with S c and S t , and the training results with S t are used to constrain the training with S c .
  • the structure of the first neural network is selected first, and the first neural network may be a CNN or an ANN or the like.
  • This first neural network is then trained using S c and S t alternately.
  • S t is represented as a M t
  • S c using the trained model is represented as M c
  • M t is understood to be a teacher model
  • M c can be understood as student model.
  • an objective function can be used to train the first neural network.
  • the objective function can also include a constraint function, and the constraint term is used to form a constraint on the update of the first neural network, so as to so that alternately during the training, the parameters M c, and M t is close to or the same. Then, based on the value of the objective function, the weight parameters and the structure parameters are derived and the gradient is updated, etc., to obtain the updated parameters, such as the weight parameters or the structure parameters, so as to obtain the updated first neural network.
  • the objective function can be expressed as:
  • ⁇ c and ⁇ t refer to the weight parameters of the regular terms of the M c and M t models, respectively
  • refers to the weight of the squared difference terms of the parameters parameter.
  • this objective function in addition to the loss function for S c and S t , it can also include a regular term for S c and S t , and the squared difference term of the parameters, so as to update the parameters of the first neural network subsequently When the constraints are formed, the parameters of Mc and M t are closer or consistent.
  • S c and S t can be used to train the first neural network alternately, so that the model trained with S t is used to guide the model trained with S c to complete the correction of the student model. , to reduce the bias of the output of the student model.
  • the distillation method of this strategy is similar to the aforementioned causal representation strategy, except that the aforementioned causal representation measurement can be alternately trained with a 1:1 batch training ratio, while in this strategy, a s:1 batch training ratio can be used
  • the s is an integer greater than 1.
  • s can take an integer in the range 1-20.
  • the number of training batches can be understood as the number of iterations for iterative training of the neural network in each round of training.
  • the training process of a neural network is divided into multiple epochs, each epoch contains multiple batches, and the batch is batch training. For example, if the data set used for training includes 6000 pictures, the number of pictures used in each epoch training can be 6000 pictures, and a batch process uses 600 pictures, including a total of 100 batches, that is, the number of batch training is 100.
  • the objective function of the first neural network can be set as:
  • S t step represents the batch training times of using S t to train the first neural network
  • S c step represents the batch training times of using S c to train the first neural network
  • the ratio may be s:1.
  • a confidence variable ⁇ ij is added to all or part of the samples in S c and S t , and the value is in the range of [0, 1], and the ⁇ ij is used to indicate the bias degree of the samples.
  • the objective function used to update the first neural network can be expressed as:
  • the confidence variable for samples in S t can be set to 1.
  • the confidence of the sample of S c is set by two different mechanisms: in the global mechanism, the confidence is set to a predefined value in [0,1]; in the local mechanism, the sample is associated with an independent , and is learned during model training.
  • the confidence variable is used to constrain S c when using S c to train the first neural network, so that in the training process of the first neural network, the samples in S c and S t can be used to combine samples The confidence of the first neural network is trained.
  • the confidence variable can be used to reflect the bias degree of the samples in S c , so that in the subsequent training process, the training using S c can be constrained by the confidence variable to achieve the effect of de-biasing , reducing the bias of the output result of the updated first neural network.
  • label distillation refers to using the predicted label based on the sample in the unbiased dataset as a guide to distill the student model, the predicted label is output by the teacher model, and the teacher model is trained based on the unbiased dataset.
  • label distillation can also use a variety of strategies, and several possible strategies are exemplified.
  • an unobserved dataset that includes multiple unobserved samples. For example, taking an APP recommended by a user as an example, the icon of the APP recommended by the user can be displayed in the recommendation interface, the APP that the user has clicked or downloaded can be understood as the aforementioned bias sample, and the APP that the user has not clicked on the recommendation interface That is, the unobserved sample.
  • the combination of S c , S t and the unobserved data set is hereinafter referred to as the full data set, and then a plurality of samples are randomly sampled from the full data set to obtain the auxiliary data set S a .
  • a plurality of samples are randomly sampled from the full data set to obtain the auxiliary data set S a .
  • S a sample is not observed.
  • S a when updating the first neural network, S a can be used for training, thereby constraining M c and M t to be the same or close to the prediction results of the samples in S a.
  • the objective functions used can include:
  • S a refers to an amount not observed sample S a sample data set, It represents the sample S a S c model in the training and prediction error function S t tag on the trained models, S a is substituted for the output of a first neural network, Is substituted into the output S a second neural network. Accordingly, in the present strategy, introduced not observed correction data set to reduce differences between the model and M c M t model, introduced in the objective function on the sample S a M c in the model and the model M t The error function of the predicted label is used to constrain the training of the first neural network, thereby reducing the bias of the output result of the first neural network.
  • Pre-trained using S t M t, M t is then used to predict the S c, S c of the label to obtain prediction samples; tag for the prediction and the real standard check weighted combination of S c; then the new label training M c. Note that since the predicted labels and the actual labels of S c may have differences in distribution, the predicted labels need to be normalized to reduce the difference between the predicted labels and the actual labels.
  • the objective function used for training the first neural network can be expressed as:
  • represents the weight coefficient of the predicted label, indicates that for the predicted label
  • y ij represents the actual label of the sample in S t
  • M t represents the predicted labels of the samples in S c output by M t.
  • the stable feature can be understood as: using different data sets to train the neural network to obtain different neural networks, and the output results of the different neural networks have a small difference, then the same features, which can be understood as representative stable features.
  • DGBR deep global balancing regression
  • the specific process of using feature distillation to perform knowledge distillation on the first neural network can be, for example, the DGBR algorithm can be used to screen out samples with stable features from S t , and then train the second neural network based on the samples with stable features, and train the second neural network.
  • the second neural network model as a teacher
  • first as a student neural network model using the S c training the first neural network
  • the first neural network of knowledge distilled to give the first neural network after the update.
  • the student model and the teacher model are different types of networks, and the student model and the teacher model include the same number of neural network layers
  • the first neural network layer in the student model is the input from the The Nth layer where the layers start counting
  • the second neural network layer in the teacher model is the Nth layer starting from the input layer
  • the first neural network layer and the second neural network layer are corresponding neural network layers
  • the above-mentioned neural network layer may include an intermediate layer and an output layer.
  • the student model and the teacher model respectively process the data to be processed, and use the output of the corresponding neural network layer to construct a loss function.
  • the function performs knowledge distillation on the student model until the preset conditions are met.
  • the outputs of the above neural network layers with corresponding relationships are similar or the same, so that the student model after knowledge distillation can be Has the same or similar data processing capabilities as the teacher model.
  • the output of the first neural network layer and the second neural network layer is Similarly, since the number of neural network layers with corresponding relationships can be multiple, some or all of the neural network layers in the student model and the teacher model after knowledge distillation have the same or similar data processing capabilities.
  • the student model and teacher model have the same or similar data processing capabilities.
  • the teacher model can be obtained by using the stable feature training, so that the student model can be distilled by using the teacher model obtained based on the stable feature training, so that the subsequent student model can also be obtained under the guidance of the teacher model, Outputs no bias or lower bias results.
  • M t can be obtained by using S t for training. Then the intermediate layer outputs of M t, M c of the training guide.
  • alignment feature for M c M t and embedding is trained in the S t M t of the embedded features, this feature and use the initialization value as M c embedded variable.
  • S c is trained on features embedded, wherein the embedded random initialization of variables is M c, and then initializes the value of the weighting operation and a random initialization value, using weighted calculation results to train M c, to give training the M c.
  • the Hint layers that need to be aligned can be selected from M c and M t for pairing (one or more, and the network layer indices corresponding to M c and M t do not need to be consistent), and then the paired items are added to the M c
  • the paired item can be expressed as ⁇ *y t +(1- ⁇ )*y c , ⁇ (0.5,1)
  • y t represents the output result of the Hint layer of M t
  • y c represents the output of M c .
  • the output of the Hint layer, ⁇ represents the proportion of y t.
  • the soft label M t prediction i.e., the label network layer output before softmax layer M t, and during the training M c, the constraint M c the softmax preceding layer the same as or close to the network layer before the output layer softmax M t tag and label output network layer.
  • a corresponding paired item can be added to the objective function of M c , and the paired item can be expressed as ⁇ *y t +(1- ⁇ )*y c , ⁇ (0.5,1), and y t represents the softmax of M t the output layer of the network layer before, y c represents the output result of the network layer before the softmax layer M c, ⁇ y t denotes proportion.
  • the middle layer of the teacher model can be used to guide the training of the middle layer of the student model. Because the teacher model is trained using an unbiased data set, the teacher model is in the process of guiding the student model. , the output result of the student model will be constrained, the bias of the output result of the student model will be reduced, and the accuracy of the output result of the student model will be improved.
  • the first neural network After performing knowledge distillation through one of the above methods to obtain the updated first neural network, the first neural network can be used for subsequent prediction. For example, it can be used in recommendation scenarios to recommend music, videos or images for users.
  • a "lifelong learning project" for users can be established, based on the user's historical data in the fields of video, music, news, etc., through various models and algorithms, imitating the mechanism of the human brain, building a cognitive brain, and building a user's lifelong learning system framework .
  • the lifelong learning project is exemplarily divided into four stages, namely learning using historical data of users (stage 1), monitoring real-time data of users (stage 2), predicting future data of users (stage 3), and Make decisions for the user (Phase 4).
  • stage 1 learning using historical data of users
  • stage 2 monitoring real-time data of users
  • stage 3 predicting future data of users
  • Phase 4 Make decisions for the user
  • the neural network distillation method provided in this application can be applied to the first stage, the third stage or the fourth stage.
  • user data including terminal-side text messages, photos, email events, etc.
  • multi-domain platforms such as music apps, video apps, and browser apps.
  • the user's personal knowledge graph is constructed.
  • a recommendation request will be triggered, and the recommendation system will input the request and its related information into the prediction model, and then predict the user's click on the product in the system. Rate.
  • the products are arranged in descending order, and the recommendation system will display the products in different positions in order as the recommendation result for the user.
  • Users browse different locations and user behaviors such as browsing, clicking to download, etc. occur.
  • the actual behavior of the user will be stored in the log as training data, and the parameters of the prediction model will be continuously updated through offline training to improve the prediction effect of the model.
  • This application corresponds to the offline training of the recommendation system, and at the same time changes the prediction logic of the prediction model.
  • the user opens the mobile browser APP to trigger the browser's recommendation module, and the browser's recommendation module predicts the user's download based on the user's historical download records, user click records, the application's own characteristics, time, location and other environmental characteristic information. Likelihood of each candidate news/article given.
  • the browser displays in order according to the possibility, so as to achieve the effect of improving the application download probability. Specifically, rank news that is more likely to be downloaded at the top and news that is less likely to be downloaded at the bottom.
  • the user's behavior is also stored in the log, and the parameters of the prediction model are trained and updated through offline training.
  • the neural network distillation method provided by this application can be introduced into lifelong learning, taking a recommendation system applied to a terminal as an example, as shown in FIG. 10 , a schematic diagram of the framework of a recommendation system provided by this application.
  • various APPs are installed on the terminal, such as third-party APPs, video APPs, music APPs, browser APPs or application market APPs, etc., or system APPs such as text messages, emails, photos, calendars, or other terminals.
  • users use the APP installed on the terminal, they can obtain user behavior data, such as text messages, photos, email events, videos, browsing records and other information by collecting the data generated by the user.
  • user behavior data such as text messages, photos, email events, videos, browsing records and other information by collecting the data generated by the user.
  • Both unbiased data sets and biased data sets can be collected through the above APP.
  • Another example, taking the music app as an example can randomly sample some music from the music candidate set, and then randomly display the information of the sampled music in the recommendation interface, such as the title, singer and other information, and then obtain the information of the music that the user has clicked.
  • unobserved data sets may also be collected. For example, if 100 APPs are selected for recommendation and only 10 APP icons are displayed in the recommendation interface, the remaining 90 APPs are unobserved samples.
  • the knowledge distillation counterfactual recommendation (KDCRec) module of knowledge distillation is used for knowledge distillation to obtain the first neural network after training, that is, the memory model shown in Figure 10.
  • KDCRec knowledge distillation counterfactual recommendation
  • knowledge distillation can also be performed by collecting unobserved data sets. Please refer to the relevant introduction of label distillation 804 in the aforementioned FIG. 8 , which will not be repeated here. Therefore, in the process of knowledge distillation, the neural network distillation method provided in this application can be used to correct the bias problem of the user's historical data (including location bias, selection bias and popular bias, etc.), and obtain the real data of the user. distributed.
  • one or more prediction labels corresponding to the user can be output through the memory model, and the one or more labels can be used to construct, for example, the label can be used to indicate the probability that the user clicks on an APP, When the probability is greater than the preset probability value, the feature of the sample corresponding to the label can be added to the user image as the feature of the user.
  • the tags included in the user portrait are used to describe the user, such as the user's preferred APP type, music type, and the like.
  • the user's characteristic knowledge data and knowledge inference data can also be output, that is, user characteristics are mined through association analysis, cross-domain learning, causal reasoning and other technologies, and knowledge-based reasoning and presentation are realized with the help of external general knowledge graphs.
  • the feature extension based on general knowledge is input into the enhanced user portrait module, and the user portrait is enhanced through a visual and dynamic way.
  • the service server can determine information such as music, APP or video recommended for the user based on the enhanced user portrait, complete the accurate recommendation for the user, and improve the user experience.
  • this application provides a counterfactual learning method based on generalized knowledge distillation for realizing unbiased cross-domain recommendation, building an unbiased user portrait system and an unbiased personal knowledge graph.
  • Experiments are carried out on this application, including cross-domain recommendation, interest mining based on causal reasoning, and building a user portrait system.
  • the results of the offline experiments are as follows: in the user portrait, the gender prediction-based algorithm is more than 3% more accurate than the baseline, and the age multi-classification task is nearly 8% more accurate than the baseline.
  • the introduction of counterfactual causal learning improves the accuracy of each age group. Variance is reduced by 50%.
  • User interest mining based on counterfactual causal inference replaces the algorithm based on association rule learning, effectively reducing the user's effective action set and providing interpretability for user preference labels.
  • multiple lists can be displayed in the recommendation interface of the application market, and the user's click probability on the candidate set products can be predicted according to the user, the candidate set products and the context features, and the candidate products can be selected according to the sub-probability. Items are sorted in descending order, with apps most likely to be downloaded first. After seeing the recommendation results of the application market, users choose to browse, click or download according to their personal interests, and these user behaviors are stored in the log.
  • the collected user data has problems such as location bias and selection bias.
  • we introduce random traffic data uniform data combined with the 101 decision-making mechanism module proposed by the present invention, from the preceding figure Select the appropriate distillation method in 803-806 in 8, and combine the user log data, that is, the biased data, to jointly train the recommendation model, that is, the first neural network.
  • the counterfactual technology based on label distillation has an 8.7% improvement in the area under the roc curve (AUC).
  • AUC area under the roc curve
  • the counterfactual causal learning technology based on sample distillation There is a 6% improvement over the baseline, and the counterfactual causal learning technique based on model structure distillation has a 5% improvement over the baseline.
  • the flow and application scenarios of the neural network distillation method provided by the present application are described in detail above.
  • the first neural network obtained by the foregoing method can be applied to a recommendation scenario, and the recommendation method provided by this application will be described in detail below in combination with the foregoing method.
  • FIG. 11 shows a schematic diagram of a recommendation method 1100 provided by an embodiment of the present application.
  • the method shown in FIG. 11 can be executed by a recommendation device, which can be a cloud service device or a terminal device, for example, a device with sufficient computing power such as a computer and a server to execute the recommended method, or a cloud service device.
  • a recommendation device which can be a cloud service device or a terminal device, for example, a device with sufficient computing power such as a computer and a server to execute the recommended method, or a cloud service device.
  • the method 1100 may be performed by the execution device 210 in FIG. 2 or FIG. 5 or the local device in FIG. 5 .
  • the method 1100 may be specifically executed by the execution device 210 as shown in FIG. 3 , and the target users and candidate recommendation objects in the method 1100 may be the data in the database 230 as shown in FIG. 3 .
  • the method 1100 includes steps S1110 and S1120. Steps S1110 to S1120 will be described in detail below.
  • S1110 Acquire the information of the target user and the information of the candidate recommendation objects.
  • a recommendation request is triggered.
  • the recommendation system may take the user who triggers the recommendation request as the target user, and the recommendation object that can be displayed to the user in the recommendation system as the candidate recommendation object.
  • the information of the target user may include the identification of the user, such as the target user ID, and the information of the target user may also include some attribute information personalized by the user, such as the gender of the target user, the age of the target user, and the occupation of the target user. , the income of the target recommended user, the hobbies of the target user or the education of the target user, etc.
  • the information of the candidate recommendation object may include an identifier of the candidate recommendation object, such as an ID of the candidate recommendation object.
  • the information of the candidate recommendation object may also include some attribute information of the candidate recommendation object, for example, the name of the candidate recommendation object or the type of the candidate recommendation object, and the like.
  • S1120 Input the information of the target user and the information of the candidate recommendation object into the recommendation model, and predict the probability that the target user has an operation action on the candidate recommendation object.
  • the recommendation model is the updated first neural network obtained in the aforementioned FIG. 6 .
  • the updated first neural network is hereinafter referred to as the recommendation model.
  • the training method of the recommendation model please refer to the aforementioned steps 601- The relevant description in 603 will not be repeated here.
  • the candidate recommendation objects in the candidate recommendation set may be sorted by predicting the probability that the target user has an operation action on the candidate recommendation object, so as to obtain the recommendation result of the candidate recommendation object.
  • the candidate recommendation object with the highest selection probability is displayed to the user.
  • the candidate recommendation object may be a candidate recommendation application.
  • FIG. 12 shows a "recommended" page in the application market.
  • the list may include premium applications and premium games.
  • the recommendation system of the application market predicts the probability that the user will download (install) the candidate recommended application according to the user's information and the information of the candidate recommended application, and sorts the candidate recommended applications in descending order based on this probability. Rank the apps that are most likely to be downloaded first.
  • the recommendation result in the boutique application can be that App5 is located in the recommended position one in the boutique game, App6 is located in the recommended position in the boutique game two, App7 is located in the recommended position in the boutique game three, App8 is located in the recommended position in the boutique game.
  • App5 is located in the recommended position one in the boutique game
  • App6 is located in the recommended position in the boutique game two
  • App7 is located in the recommended position in the boutique game three
  • App8 is located in the recommended position in the boutique game.
  • the application market shown in Figure 12 can obtain training data from user behavior logs to train a recommendation model.
  • the recommendation model is obtained by training the first neural network using the biased data set and unbiased data set in the sample set according to the first distillation method.
  • the biased data set includes biased samples
  • the unbiased data set includes unbiased samples.
  • Sample, the first distillation method is determined according to the data characteristics of the sample set.
  • the samples in the biased data set include the information of the first user, the information of the first recommended object and the actual label, and the actual label of the sample in the unbiased data set is used for Indicates whether the first user has an action on the first recommended object.
  • the samples in the unbiased data set include the information of the second user, the information of the second recommended object, and the actual label.
  • the actual label of the sample in the unbiased data set is used to indicate the first 2. Whether the user has an operation action on the second recommended object.
  • the unbiased data set is obtained under the condition that the probabilities of being shown the candidate recommended objects in the candidate recommended object set are the same, and the second recommended object is a candidate recommended object in the candidate recommended object set .
  • the unbiased data set is obtained under the condition that the candidate recommended objects in the candidate recommended object set have the same probability of being displayed, which may include: the samples in the unbiased data set are in the candidate recommended object set The candidate recommended objects in are obtained when the second user is randomly presented; or the samples in the unbiased data set are obtained when the second user searches for the second recommended object.
  • the samples in the unbiased data set belong to the data of the source domain, and the samples in the biased data set belong to the data of the target domain.
  • the method corresponding to FIG. 6 is the training phase of the recommendation model (the phase performed by the training module 202 as shown in FIG. 3 ), and the specific training is to adopt the updated first neural network provided in the method corresponding to FIG. 6 , That is, the recommendation model is performed; and the method corresponding to FIG. 11 can be understood as the application phase of the recommendation model (the phase performed by the execution device 210 as shown in FIG. 3 ), which can be embodied in the training by the method corresponding to FIG. 6 .
  • the obtained recommendation model, and according to the information of the target user and the information of the candidate recommendation object, the output result is obtained, that is, the probability that the target user has an operation on the candidate recommendation object.
  • Example 1 Three examples (Example 1, Example 2, and Example 3) are used to illustrate the application of the solutions of the embodiments of the present application to different scenarios.
  • the training method of the recommendation model described below can be regarded as one of the methods corresponding to the aforementioned FIG. 6 .
  • the recommended method described below can be regarded as a specific implementation manner of FIG. 11 .
  • the repeated description is appropriately omitted when introducing the three examples of the embodiments of the present application.
  • the recommendation system usually needs to perform multiple processes such as recall, refinement or manual rules for all items in the full library based on the user portrait to generate the final recommendation result, and then display it to the user .
  • the number of items recommended to users is much smaller than the total number of items, introducing various bias problems in the process, such as location bias and selection bias.
  • a user profile refers to a collection of tags for a user's personalized preferences. For example, personas can be generated from a user's interaction history.
  • Selection bias refers to the bias in the collected data due to different probabilities of items being displayed.
  • the ideal training data is obtained when items are presented to users with the same probability of presentation. In reality, due to the limited number of display slots, not all items can be displayed.
  • the recommendation system usually recommends the user according to the predicted user's selection rate of the item. The user can only interact with the displayed items, and the items that have not been displayed cannot be selected, that is, they cannot participate in the interaction. Opportunities to get shown are not the same. In the entire recommendation process, such as recall, refinement and other processes, there will be a truncation operation, that is, some recommended objects are selected from the candidate recommended objects for display.
  • Position offset refers to the offset of the collected data due to the different positions of the items displayed.
  • Recommender systems usually display the recommendation results in order from top to bottom or from left to right. According to people's browsing habits, the items located in the front are easier to see, and the user's selection rate is higher.
  • the same application application, APP
  • the download rate of the APP displayed in the first position is much higher than the download rate displayed in the last position.
  • FIG. 13 by performing the fine-arrangement process, differences in the display positions of the items are caused, thereby introducing positional offsets.
  • the item with more display opportunities has a higher probability of being selected by the user, and the higher the probability of the user's selection, the item is more likely to be recommended to the user in the subsequent recommendation, thereby obtaining more display opportunities. It is easy to be clicked by other users, aggravating the influence of the bias problem, causing the Matthew effect, and leading to the aggravation of the long tail problem.
  • the long tail problem leads to the inability to meet the personalized needs of most niches, which affects the user experience.
  • many items in the recommender system cannot generate actual commercial value because they have no opportunity to be exposed, consuming storage resources and computing resources, resulting in a waste of resources.
  • a lifelong learning project refers to a project based on the user's historical data in video, music, news and other fields, through various models and algorithms, imitating the mechanism of the human brain to build a cognitive brain and achieve the goal of lifelong learning.
  • FIG. 14 A schematic diagram of a lifelong learning framework is shown in FIG. 14 .
  • This framework includes multiple recommendation scenarios such as video APP, reading APP and browser APP.
  • the traditional recommendation learning scheme is to learn the hidden rules of users in the historical behavior of the domain in each recommendation scenario or each domain, and then recommend according to the learned rules.
  • the entire learning and implementation process does not consider the inter-domain relationship at all. Knowledge transfer and sharing.
  • Cross-domain recommendation is a recommendation method that learns user preferences in the source domain and applies it to the target domain. Through cross-domain recommendation, the rules learned in the source domain can be used to guide the recommendation results in the target domain, realize knowledge transfer and sharing between domains, and solve the cold start problem.
  • the cold start problem of the user in the recommendation scene of the music app can be solved.
  • books are recommended for user A, based on user A's interaction history data, user A's interests and preferences in the recommended scenario of reading APP can be learned.
  • the interest preference in the recommendation scenario can determine the neighbor users who have the same interest as the user A.
  • recommend music for users learn the interest preferences of neighbor users in the music APP recommendation scene based on the interaction history data of neighbor users in the music APP recommendation scene, and then guide based on the learned interest preferences Provide recommendation results for user A in the recommendation scene of the music app.
  • the recommended scene of the reading app is the source domain
  • the recommended scene of the music app is the target domain.
  • the data distribution of the source domain and the target domain are often inconsistent, so the data distribution of the source domain is biased relative to the data distribution of the target domain.
  • Directly using the above association rules to implement cross-domain recommendation will lead to the introduction of bias in the learning process. set.
  • the model will take more into account the user's interests and preferences in the source domain to make recommendations, that is to say, the trained model is biased, so that the model learned on the data of the source domain cannot be effectively generalized in the target domain. There is a risk of distortion.
  • step S1110 An implementation manner of step S1110 is described below by taking the recommended scene of the reading APP as the source domain and the recommended scene of the video APP as the target domain as an example.
  • the recommended scene of the reading APP refers to the recommended scene of recommending books for the user
  • the recommended scene of the video APP refers to the recommended scene of recommending the video for the user.
  • the biased samples are obtained according to the user's interaction history in the recommendation scene (target domain) of the video APP.
  • Table 1 shows the data obtained based on the user's interaction history (eg, user behavior log) in the recommendation scenario of the video APP.
  • a row in Table 1 is a sample.
  • the biased sample includes the information of the first user and the information of the first recommended object.
  • the information of the first user includes the ID of the first user
  • the first recommended object is a video
  • the information of the first recommended object includes the ID of the first recommended object, the label of the first recommended object, the producer of the first recommended object, the first recommended object Actors of a recommendation object and ratings of the first recommendation object. That is to say, the biased sample includes a total of 6 types of features.
  • Table 1 is only for illustration, and the information of the user and the information corresponding to the recommendation may also include information of more or less items than Table 1, or more or less type of feature information.
  • the processed data is stored in libSVM format, for example, the data in Table 1 can be stored in the following form:
  • n biased samples can be obtained to form a biased data set.
  • the unbiased samples are obtained based on the interaction history of users in the recommended scene (source domain) of the reading APP.
  • the data in the source domain may also include data of other recommended scenarios, and may also include data of multiple recommended scenarios.
  • the data of the source domain may include data in the recommended scenarios of the reading APP.
  • FIG. 16 is for illustration only, and unbiased samples may not be used as data in the validation set.
  • Table 2 shows the data obtained based on the user's interaction history (eg, user behavior log) in the recommended scenario of reading the APP.
  • a row in Table 2 is a training sample.
  • the sample is an unbiased sample, and the unbiased sample includes information of the second user and information of the second recommended object.
  • the information of the second user includes the ID of the second user
  • the second recommendation object is a book
  • the information of the second recommendation object includes the ID of the second recommendation object, the label of the second recommendation object, the publisher of the second recommendation object, the second recommendation object
  • Table 2 is only for illustration, and the information of the user and the information corresponding to the recommendation may also include information of more or less items than Table 2, or more or less type of feature information.
  • the processed data is stored in libSVM format, for example, the data in Table 2 can be stored in the following form:
  • the recommendation model can be applied in the target domain, for example, the recommendation scene of the video APP in FIG. 16 .
  • the user interaction data in the recommendation scene of the reading APP is richer, and the data distribution can more accurately reflect the user's preference.
  • the solutions of the embodiments of the present application can enable the recommendation model to better grasp the user's personalized preference in the reading scene, thereby guiding the recommendation result in the video scene, and improving the accuracy of the recommendation result.
  • the user interaction history records of both the source domain (for example, the recommended scene of reading APP) and the target domain (for example, the recommended scene of video APP) are all recorded. Incorporated into the learning, so that the trained model has a better evaluation effect in the source domain. At this time, the trained model can well capture the user's interest preference in the source domain, and in the approximate recommendation scenario, the user's interest The preferences are also similar, so the recommendation model can also well fit the user's interest preferences in the target domain, recommend the recommendation results that meet their interests, realize cross-domain recommendation, and alleviate the cold start problem.
  • the recommendation model can predict the probability that the user has an action on the recommended object in the target domain, that is, the probability that the user selects the recommended object.
  • the target recommendation model is deployed in the target domain (for example, in the recommendation scene of a video APP), and the recommendation system can determine the recommendation result based on the output of the target recommendation model and display it to the user.
  • the traditional recommendation learning scheme is to learn the hidden rules of users in the historical behavior of the domain in each recommendation scenario or each domain, and then make recommendations according to the learned rules.
  • the entire learning and implementation process is completely Knowledge transfer and sharing between domains is not considered.
  • an application recommends a user, it usually only learns the user's preference based on the user's interaction data in the application, and then recommends it for the user, regardless of the user's interaction data in other applications.
  • the embodiments of the present application provide a recommendation method and electronic device, which can make recommendations for users by learning preferences of users in other domains, thereby improving the accuracy of prediction results and improving user experience.
  • user behavior data can be considered to express the same meaning, and can be understood as when the recommended object is displayed to the user. , the data related to the user's operation behavior.
  • FIG. 17 is a schematic diagram of a group of graphical user interfaces (graphical user interface, GUI) provided by an embodiment of the present application.
  • the user can perform a click operation on the setting application in the mobile phone, and in response to the click operation, the mobile phone performs the main interface 301 of the setting application, and the main interface of the setting application can display as shown in (a) of FIG. 17 .
  • the main interface 301 may include a batch management control, a cross-domain recommendation management control for each application, a sidebar alphabetical sorting index control, and the like.
  • the main interface 301 may also display the cross-domain recommendation function "enabled” or "disabled” of each application (eg, music APP, reading APP, browser APP, news APP, video APP, or shopping APP, etc.).
  • the cross-domain recommendation management controls of each application displayed in the main interface 301 may be displayed in the order of the first letter of the application name from "A" to "Z", wherein each application corresponds to its own cross-domain recommendation management control. Domain recommendation management controls. It should be understood that the main interface 301 may also include other more or less or similar display contents, which is not limited in this application.
  • the mobile phone can display the cross-domain recommendation management interface corresponding to the application.
  • the user performs the click operation on the cross-domain recommendation management control of the browser APP shown in (a) of FIG. 17 , and in response to the click operation, the mobile phone enters the browser APP cross-domain recommendation management interface 302,
  • the cross-domain recommendation management interface 302 may display content as shown in (b) of FIG. 17 .
  • a control to allow cross-domain recommendations may be included.
  • the cross-domain recommendation management interface 302 may also include other more or less similar display contents, and the cross-domain recommendation management interface 302 may also include different display contents according to different applications, which are not limited in this embodiment of the present application .
  • the default state of the cross-domain recommendation management control may be a closed state.
  • the cross-domain recommendation control is allowed to be enabled, and the cross-domain recommendation function of the browser APP is enabled. Accordingly, the browser APP can obtain user interactions from multiple APPs. data and learn to recommend relevant videos to users. Further, when the cross-domain recommendation control is enabled, the cross-domain recommendation management interface 302 may also present a learning list of the browser APP, and the learning list includes multiple options. An option on the cross-domain recommendation management interface 302 can be understood as the name of an application and its corresponding switch control.
  • the cross-domain recommendation management interface 302 includes controls that allow cross-domain recommendation and a plurality of options, each option in the plurality of options is associated with an application, and the option associated with the application is used to control the browser APP Turns on and off the permission to obtain user behavior data from the app. It can also be understood that an option associated with a certain application is used to control the browser APP to perform a cross-domain recommendation function based on the user behavior data in the application. For the convenience of understanding, in the following embodiments, switch controls are still used to represent the meaning of options.
  • the learning list includes multiple options, that is, the names of multiple applications and their corresponding switch controls are presented on the cross-domain recommendation management interface 302 .
  • the video APP can obtain user behavior data from the APP, and perform learning to make recommendations for the user.
  • the cross-domain recommendation interface 302 may also display “Allowed” or “Forbidden” to obtain each application (for example, a music APP, a reading APP, a browser APP, a news APP, a video APP, or a shopping APP) for applications that enable the cross-domain recommendation function. etc.) user data.
  • the switch control corresponding to an application is turned on
  • the cross-domain recommendation interface 302 may also display “Allowed” or “Forbidden” to obtain each application (for example, a music APP, a reading APP, a browser APP, a news APP, a video APP, or a shopping APP) for applications that enable the cross-domain recommendation function. etc.) user data.
  • FIG. 17 when the switch control
  • a plurality of switch controls are presented on the first interface, and the plurality of switch controls are respectively associated with the music APP, reading APP, shopping APP, and video APP. , news APP and chat APP and other applications corresponding.
  • the controls corresponding to the music APP as an example, when the controls corresponding to the “Music APP are turned on, that is, the music APP is in the “Allowed” state, the browser APP can obtain the user behavior data from the music APP and learn it to think Users make recommendations.
  • the mobile phone presents the content shown in (c) in FIG. 17 , and the browser APP no longer obtains user behavior data from the music APP, that is, does not Allow the browser APP to obtain user behavior data in the music APP.
  • the browser APP will disable the cross-domain recommendation function in response to the closing operation, that is, the browser APP is not allowed to obtain user interaction data in other APPs. For example, the user performs a click operation on the control allowing cross-domain recommendation as shown in (b) of FIG.
  • the cross-domain recommendation management interface can display the content as shown in (d) in FIG. 17 , and the cross-domain recommendation function of the browser APP in all applications in the learning list is disabled. In this way, management efficiency can be improved and user experience can be improved.
  • the content recommended by the application for the user is the recommended object, and the recommended object can be displayed in the application.
  • a recommendation request can be triggered, and the recommendation model recommends relevant content for the user according to the recommendation request.
  • the information flow recommended by the browser APP for the user may be displayed on the main interface of the browser APP.
  • the mobile phone displays the main interface 303 of the browser APP as shown in (a) of FIG. 18 , the main interface 303 of the browser APP is displayed.
  • a recommendation list of one or more recommended contents can be displayed in the browser APP, and the one or more recommended contents are the recommended objects in the browser APP.
  • the main interface 303 of the browser APP may also include other more or less display contents, which is not limited in this application.
  • the user may perform certain operations on the content presented in the recommendation list on the main interface 303 of the browser APP to view the recommended content, delete (or ignore) the recommended content, or view relevant information of the recommended content, and the like. For example, a user clicks on a certain recommended content, and in response to the click operation, the mobile phone can open the recommended content. Another example is that the user swipes left (or right) a certain recommended content, and in response to the operation, the mobile phone can delete the recommended content from the recommended list. For another example, the user presses a certain recommended content for a long time, and in response to the long-pressing operation, the mobile phone may display relevant information of the recommended content. As shown in (a) of FIG.
  • the user performs a long-press operation as shown in (a) of FIG. 18 , and in response to the long-press operation, the mobile phone may display a prompt box as shown in the figure.
  • Prompt information is displayed in the selection box, and the prompt information is used to prompt the user that the recommended content is recommended based on user interaction data in other applications.
  • the prompt information is used to prompt the user that the recommended content is the content recommended for the user based on the data of the user in the video APP.
  • the user can open the video or delete the recommended content in other ways, and can also call up the relevant information of the recommended content in other ways, such as slow sliding left and right, which is not limited in the embodiments of the present application.
  • the mobile phone when the user clicks on the browser APP, in response to the click operation, the mobile phone can also display the main interface 304 of the browser APP as shown in (b) of FIG. 18 , and the main interface 304 can display A recommendation list of one or more recommended contents and prompt information corresponding to the one or more recommended contents, and the one or more recommended contents are the recommended objects in the browser APP.
  • the main interface 304 may also include other more or less display contents, which is not limited in this application.
  • the prompt information is used to prompt the user that the recommended content is recommended based on user interaction data in other applications.
  • the user may perform certain operations on the videos presented in the recommendation list of the main interface 304 to view the recommended content, delete (or ignore) the recommended content, and the like. For example, a user clicks on a certain recommended content, and in response to the click operation, the mobile phone can open the recommended content. Another example is that the user swipes left (or right) a certain recommended content, and in response to the operation, the mobile phone can delete the recommended content from the recommended list. It should be understood that, in some other embodiments, the user can open the recommended content or delete the recommended content in other ways, and can also delete the relevant information of the recommended content in other ways, such as slow sliding left and right, which is not limited in the embodiments of the present application.
  • the prompt information mainly provides reference information for the user, so that the user knows that the current recommendation object is obtained based on the function of cross-domain recommendation, and the content of the prompt information may also have other forms, which are not limited in this embodiment of the present application.
  • the user when the user deletes the recommended content on the main interface, it can be understood that the user only deletes a certain recommended content from the recommendation list on the main interface, that is, the user is not interested in the recommended content.
  • This behavior can be recorded in the user behavior log and used as training data for the recommendation model. For example, as a biased sample in the aforementioned method.
  • the cross-domain recommendation function of the application can be turned on.
  • the cross-domain recommendation function of an application can be turned on or off in the following two ways.
  • One is to close or open the cross-domain recommendation function of an application with a single click.
  • enabling or disabling the control for allowing cross-domain recommendation can enable or disable the cross-domain recommendation function of the application at a single point.
  • FIG. 19(a) shows the same interface as FIG. 17(a) .
  • the user performs the click operation of the batch management control shown in (a) of FIG. 19.
  • the user enters the batch management interface 305, which may include a search application control, a cross-domain recommendation master switch control, and various application programs.
  • Users can turn on and off the cross-domain recommendation master switch control (that is, the switch after "all" in the figure) to realize the cross-domain learning function of all applications as a whole or the cross-domain recommendation function of all applications to be closed as a whole.
  • the batch management interface 305 also includes cross-domain recommendation switch controls for each application, and the user can also turn on or off the cross-domain recommendation function of a single application by controlling the opening and closing of the cross-domain recommendation switch control of a certain application.
  • the cross-domain recommendation switch controls of each application displayed in the batch management interface 305 may be displayed in the order of the first letter of the application name from "A" to "Z", and the cross-domain recommendation function of each application is determined by The respective cross-domain recommended switch controls are controlled.
  • close cross-domain recommendation may be considered to express the same meaning, It can be understood that the cross-domain recommendation function of the application is turned off, and the application no longer performs cross-domain recommendation.
  • enable cross-domain recommendation can be considered to express the same meaning, and can be understood as open The cross-domain recommendation function of the application, the application can make cross-domain recommendation.
  • FIG. 20 is a schematic flowchart of a recommendation method provided by an embodiment of the present application. As shown in FIG. 20 , the method 1200 may include the following steps:
  • the first interface may include a study list of at least one application program, and the study list of the first application program in the at least one application program study list includes at least one option, each option in the at least one option is associated with an application program.
  • the first interface may be the cross-domain recommendation management interface 302 of the browser APP.
  • the cross-domain recommendation management interface 302 is used to control the opening and closing of the cross-domain recommendation function of the browser APP.
  • the learning list of the first application may be the learning list of the browser APP.
  • the at least one option may be the same as the application name, such as a “shopping” option, a “map” option, a “health” option, a “video” option, and the like.
  • Each option in the at least one option is associated with an application, and the option associated with the application is used to control on and off the function of learning the user's behavior in the application.
  • the options associated with the application are used to control whether the first application is allowed to obtain the data of the application for cross-domain recommendation.
  • the first operation may be a click operation, a double-click operation, a long-press operation, a sliding operation, or the like.
  • S1230 in response to the first operation, enable or disable the cross-domain recommendation function of the first application in the application associated with some or all of the options in the learning list of the first application.
  • the first application is allowed to acquire user behavior data in the application associated with some or all of the options, and learn the user's preference in the application, so as to recommend the user in the first application.
  • the user can see from the interface that the cross-domain recommendation function of the first application is on or off.
  • the first operation acts on the first option, and in response to the user's first operation on the first option, enables or disables the cross-domain recommendation function of the first application in the application associated with the first option; Wherein, the first option is located in the study list of the first application.
  • the first option may be the “music” option on the first interface. It should be understood that the first option may be any option associated with the application in the learning list of the first application on the first interface, such as a "music” option, a “shopping” option, a “browser” option, and so on.
  • the first operation may be an opening or closing operation of the switch control corresponding to the first option.
  • the first operation may be used to close the switch control corresponding to the first option, and correspondingly, the application program associated with the first option in the first application is closed function for cross-domain recommendation in .
  • the switch control corresponding to the first option is in an off state
  • the first operation may be used to turn on the switch control corresponding to the first option, and correspondingly, the first application program in the application associated with the first option is opened.
  • the function of cross-domain recommendation in the program In this way, the user can individually control on and off the cross-domain recommendation function of the first application in each of the other applications.
  • the first operation acts on the switch control corresponding to the learning list of the first application, and in response to the user's first operation on the switch control, opening or closing the first application is all in the learning list of the first application.
  • the cross-domain recommendation feature in the application associated with the option.
  • the first operation may be a closing operation of the control allowing cross-domain recommendation.
  • the first operation may be an opening operation for the control for allowing cross-domain recommendation. In this way, the user can overall control the cross-domain recommendation function of the first application, improve management efficiency, and improve user experience.
  • the method 1200 further includes: displaying a second interface, where the second interface is used to present one or more recommended objects and prompt information of the one or more recommended objects, the one or more recommended objects
  • the prompt information is used to indicate that the one or more recommended objects are determined based on user behavior data in the application program in the at least one application program.
  • the second interface may be the main interface 303 of the browser APP.
  • the second interface may be the main interface 304 of the browser APP.
  • the prompt information may be used to prompt the user that the currently recommended content is obtained based on the data of the video APP.
  • the one or more recommended objects are determined by inputting the user's information and the information of the candidate recommended objects into the recommendation model, and predicting the probability that the user has an operation action on the candidate recommended objects.
  • the user behavior data in the video APP is used as the data of the source domain
  • the user behavior in the browser APP is used as the data of the target domain
  • the aforementioned method 1100 is executed to obtain a recommendation model, and the recommendation model can be used to predict the user's recommendation to the candidate.
  • the object has the probability of an operation action, and the recommended content is determined based on the probability value, and then the content shown in FIG. 14 is displayed.
  • the recommendation model is obtained by training the first neural network using the biased data set and the unbiased data set in the sample set according to the first distillation method.
  • the biased data set includes biased samples
  • the unbiased data set includes biased samples.
  • the set includes unbiased samples.
  • the first distillation method is determined according to the data characteristics of the sample set.
  • the samples in the biased data set include the information of the first user, the information of the first recommended object and the actual label.
  • the actual label of the sample is used to indicate whether the first user has an action on the first recommended object.
  • the samples in the unbiased data set include the information of the second user, the information of the second recommended object and the actual label.
  • the actual label is used to indicate whether the second user has an operation action on the second recommended object.
  • the first application may obtain user behavior data from the application associated with the first option, and use the user behavior data in the application associated with the first option as the source domain data.
  • the data of the source domain may also include user behavior data in other applications.
  • the first application may acquire user behavior data from the applications associated with all options, The obtained user behavior data is taken as the data of the source domain.
  • the recommendation model may use the updated first neural network trained in the aforementioned FIG. 6 .
  • the method before displaying the first interface, further includes: displaying a third interface, where the third interface includes a switch control corresponding to at least one application; and detecting a user's switch control corresponding to the at least one application on the third interface The third operation of the switch control of the first application in the ; in response to the third operation, the first interface is displayed.
  • the third interface may be the setting application main interface 301 .
  • the switch control of the first application may be the cross-domain recommendation management control of the browser APP.
  • the third operation may be a click operation on the switch control corresponding to the first application, and in response to the click operation, the display shown in (b) of FIG. 17 is displayed. interface shown.
  • the user interaction history records of the source domain and the target domain are both incorporated into the learning, so that the recommendation model can It can better learn the user's preferences, so that the recommendation model can well fit the user's interest preferences in the target domain, recommend the recommendation results that meet their interests, realize cross-domain recommendation, and alleviate the cold start problem.
  • FIG. 21 a schematic structural diagram of a neural network distillation apparatus provided by the present application.
  • the neural network distillation apparatus may include:
  • the acquisition module 2101 is used for sample sets.
  • the sample sets include obtaining biased data sets and unbiased data sets.
  • the biased data sets include biased samples, and the unbiased data sets include unbiased samples.
  • biased data The sample size of the set is larger than the sample size of the unbiased data set;
  • the decision-making module 2102 is used to determine the first distillation method according to the data characteristics of the sample set, wherein, when different distillation methods are used for knowledge distillation, the teacher model guides the student model in different ways, and the teacher model is obtained by training with an unbiased data set. Yes, the student model is trained using a biased dataset;
  • the training module 2103 is configured to train the first neural network according to the first distillation method based on the biased data set and the unbiased data set to obtain an updated first neural network.
  • the samples in the sample set include input features and actual labels
  • the first distillation method is to perform distillation based on the input features of the samples in the biased data set and the unbiased data set.
  • the training module 2103 is specifically configured to alternately use the biased data set and the unbiased data set to train the first neural network to obtain an updated first neural network, wherein in an alternating process , the number of batch trainings for training the first neural network using the biased data set and the number of batch training for training the first neural network using the unbiased data set are preset ratios, and the samples include input features as the first neural network. input to the network.
  • the difference between the first regular term and the second regular term is added to the loss function of the first neural network, and the first regular term uses an unbiased data set
  • the included samples are parameters obtained by training the first neural network
  • the second regular term is a parameter obtained by training the first neural network using the samples included in the biased data set.
  • the training module 2103 is specifically used to set a confidence level for the samples in the biased data set, and the confidence level is used to represent the bias degree of the samples; based on the biased data set and the samples in the biased data set
  • the confidence and unbiased data set of , the first neural network is trained to obtain an updated first neural network, and the samples include input features as the input of the first neural network when training the first neural network.
  • the samples included in the biased dataset and the unbiased dataset include input features and actual labels
  • the first distillation method is to perform distillation based on the predicted labels of the samples included in the unbiased dataset, and predict The labels are output by the updated second neural network for the samples in the unbiased data set, and the updated second neural network is obtained by training the second neural network using the unbiased data set.
  • the sample set further includes an unobserved data set, and the unobserved data set includes a plurality of unobserved samples;
  • the training module 2103 is specifically used for: training the first neural network by using the biased data set, Obtain the trained first neural network, and train the second neural network through an unbiased data set to obtain an updated second neural network; collect multiple samples from the sample set to obtain an auxiliary data set; use the auxiliary data set, Taking the predicted labels of the samples in the data set as constraints, the first neural network after training is updated to obtain the updated first neural network, and the predicted labels of the samples in the data set are output by the updated second neural network.
  • the training module 2103 is specifically configured to: train the second neural network through an unbiased data set to obtain an updated second neural network; output a biased output through the updated second neural network The predicted label of the sample in the data set; the weighted combination of the predicted label of the sample and the actual label of the sample is obtained to obtain the combined label of the sample; the first neural network is trained using the combined label of the sample to obtain the updated first neural network.
  • the decision-making module 2102 is specifically configured to calculate the first ratio between the sample size of the unbiased data set and the sample size of the biased data set, and select the first ratio from multiple distillation methods For the matching first distillation method, the data features of the sample set include the first proportion.
  • the first distillation method includes: training the teacher model based on the features extracted from the unbiased data set, and performing knowledge distillation on the student model by using the teacher model and the biased data set.
  • the training module 2103 is specifically configured to: output the features of the unbiased data set through a preset algorithm; train the second neural network according to the features of the unbiased data set to obtain the updated second neural network.
  • Neural network take the second neural network as the teacher model and the first neural network as the student model, and use the biased data set to perform knowledge distillation on the first neural network to obtain the updated first neural network.
  • the training module 2103 is specifically used to: obtain the number of feature dimensions included in the unbiased data set and the biased data set; select a first distillation that matches the number of feature dimensions from multiple distillation methods In this way, the data features of the sample set include the number of feature dimensions.
  • the training module 2103 is specifically used for: updating the second neural network through the unbiased data set to obtain the updated second neural network; using the updated second neural network as the teacher model, the first A neural network is used as a student model to perform knowledge distillation on the first neural network using a biased data set to obtain an updated first neural network.
  • the decision module 2102 is specifically configured to: calculate the positive samples included in the unbiased data set The second proportion of the number of negative samples and the number of negative samples, select the first distillation method that matches the second proportion from multiple distillation methods, and the data characteristics of the sample set include the second proportion; or, calculate the positive data included in the biased data set.
  • a third ratio between the number of samples and the number of negative samples, a first distillation mode matching the third ratio is selected from multiple distillation modes, and the data characteristics of the sample set include the third ratio.
  • the types of samples included in the biased data set are different from the types of samples included in the unbiased data set.
  • the apparatus further includes:
  • the output module 2104 is used to obtain at least one sample of the target user; the at least one sample is used as the input of the updated first neural network, and at least one label of the target user is output, and at least one label forms the user portrait of the target user, and the user portrait Used to identify samples that match the target user.
  • FIG. 22 is a schematic structural diagram of a recommending device provided by the present application, as described below.
  • Obtaining unit 2201 used to obtain the information of the target user and the information of the candidate recommendation object;
  • the processing unit 2202 is used to input the information of the target user and the information of the candidate recommendation object into the recommendation model, and predict the probability that the target user has an operation action on the candidate recommendation object;
  • the recommendation model is obtained by training the first neural network using the biased data set and unbiased data set in the sample set according to the first distillation method.
  • the biased data set includes biased samples
  • the unbiased data set includes unbiased samples.
  • the first distillation method is determined according to the data characteristics of the sample set.
  • the samples in the biased data set include the information of the first user, the information of the first recommended object and the actual label, and the actual label of the samples in the unbiased data set It is used to indicate whether the first user has an action on the first recommended object.
  • the samples in the unbiased data set include the information of the second user, the information of the second recommended object, and the actual label.
  • the actual label of the sample in the unbiased data set is used for Indicates whether the second user has an operation action on the second recommended object.
  • the unbiased data set is obtained under the condition that the probabilities of being shown the candidate recommended objects in the candidate recommended object set are the same, and the second recommended object is a candidate recommended object in the candidate recommended object set .
  • the unbiased data set is obtained under the condition that the candidate recommended objects in the candidate recommended object set have the same probability of being displayed, which may include: the samples in the unbiased data set are in the candidate recommended object set The candidate recommended objects in are obtained when the second user is randomly presented; or the samples in the unbiased data set are obtained when the second user searches for the second recommended object.
  • the samples in the unbiased data set belong to the data of the source domain, and the samples in the biased data set belong to the data of the target domain.
  • FIG. 23 is a schematic structural diagram of an electronic device provided by the present application, as described below.
  • the display unit is 2301 yuan, the display unit is used to display the first interface, the first interface includes the learning list of at least one application program, and the learning list of the first application program in the learning list of the at least one application program includes at least one option, at least one The options in the options are associated with an application;
  • processing unit 2302 the processing unit is used to perceive the first operation of the user on the first interface
  • the display unit is further configured to, in response to the first operation, enable or disable the cross-domain recommendation function of the first application in the application associated with some or all of the options in the learning list of the first application.
  • one or more recommended objects are determined by inputting the user's information and the information of the candidate recommended objects into the recommendation model, and predicting the probability that the user has an operation action on the candidate recommended objects.
  • the recommendation model is obtained by training the first neural network using the biased data set and the unbiased data set in the sample set according to the first distillation method, and the biased data set includes biased samples,
  • the unbiased data set includes unbiased samples.
  • the first distillation method is determined according to the data characteristics of the sample set.
  • the samples in the biased data set include the information of the first user, the information of the first recommended object, and the actual label. Unbiased
  • the actual labels of the samples in the data set are used to indicate whether the first user has an action on the first recommended object.
  • the samples in the unbiased data set include the information of the second user, the information of the second recommended object and the actual label. In the unbiased data set
  • the actual label of the sample is used to indicate whether the second user has an operation action on the second recommended object.
  • FIG. 24 is a schematic structural diagram of another neural network distillation apparatus provided by the present application, as described below.
  • the neural network distillation apparatus may include a processor 2401 and a memory 2402 .
  • the processor 2401 and the memory 2402 are interconnected by wires.
  • the memory 2402 stores program instructions and data.
  • the memory 2402 stores program instructions and data corresponding to the aforementioned steps in FIG. 6 .
  • the processor 2401 is configured to perform the method steps performed by the neural network distillation apparatus shown in any of the foregoing embodiments in FIG. 6 .
  • the neural network distillation apparatus may further include a transceiver 2403 for receiving or sending data.
  • Embodiments of the present application also provide a computer-readable storage medium, where a program is stored in the computer-readable storage medium, and when the computer-readable storage medium runs on the computer, the computer executes the method described in the embodiment shown in FIG. 6 above. A step of.
  • the aforementioned neural network distillation apparatus shown in FIG. 24 is a chip.
  • FIG. 25 is a schematic structural diagram of another recommended device provided by the present application, as described below.
  • the recommending apparatus may include a processor 2501 and a memory 2502 .
  • the processor 2501 and the memory 2502 are interconnected by wires.
  • the memory 2502 stores program instructions and data.
  • the memory 2502 stores program instructions and data corresponding to the aforementioned steps in FIG. 11 .
  • the processor 2501 is configured to execute the method steps executed by the recommending apparatus shown in any of the foregoing embodiments in FIG. 11 .
  • the recommending apparatus may further include a transceiver 2503 for receiving or sending data.
  • Embodiments of the present application further provide a computer-readable storage medium, where a program is stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a computer, the computer executes the method described in the embodiment shown in FIG. 11 . step.
  • the aforementioned recommended device shown in FIG. 25 is a chip.
  • FIG. 26 is a schematic structural diagram of another electronic device provided by the present application, as described below.
  • the electronic device may include a processor 2601 and a memory 2602.
  • the processor 2601 and the memory 2602 are interconnected by wires.
  • the memory 2602 stores program instructions and data.
  • the memory 2602 stores program instructions and data corresponding to the aforementioned steps in FIG. 20 .
  • the processor 2601 is configured to execute the aforementioned method steps performed by the electronic device shown in FIG. 20 .
  • the electronic device may further include a transceiver 2603 for receiving or transmitting data.
  • Embodiments of the present application further provide a computer-readable storage medium, where a program is stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a computer, the computer causes the computer to execute the method described in the embodiment shown in FIG. 20 above. step.
  • the aforementioned electronic device shown in FIG. 26 is a chip.
  • the embodiments of the present application also provide a neural network distillation device.
  • the neural network distillation device may also be called a digital processing chip or a chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface, and the program instructions are processed.
  • the unit is executed, and the processing unit is used to execute the method steps of the aforementioned FIGS. 6-20 .
  • the embodiments of the present application also provide a digital processing chip.
  • the digital processing chip integrates circuits and one or more interfaces for implementing the functions of the processor 2401 , the processor 2501 , and the processor 2601 , or the functions of the processor 2301 , the processor 2501 , and the processor 2601 .
  • the digital processing chip can perform the method steps of any one or more of the foregoing embodiments.
  • the digital processing chip does not integrate the memory, it can be connected with the external memory through the communication interface.
  • the digital processing chip implements the actions performed by the neural network distillation apparatus, the recommendation apparatus or the electronic device in the above embodiments according to the program codes stored in the external memory.
  • the embodiments of the present application also provide a computer program product that, when driving on the computer, causes the computer to execute the steps of the method described in the embodiments shown in the foregoing FIGS. 6 to 20 .
  • the neural network distillation apparatus may be a chip, and the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin, or a circuit, etc. .
  • the processing unit can execute the computer-executed instructions stored in the storage unit, so that the chip in the server executes the training set processing method described in the embodiments shown in FIG. 6 to FIG. 10 .
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • the aforementioned processing unit or processor may be a central processing unit (CPU), a network processor (neural-network processing unit, NPU), a graphics processing unit (graphics processing unit, GPU), a digital signal processing digital signal processor (DSP), application specific integrated circuit (ASIC) or field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or it may be any conventional processor or the like.
  • FIG. 27 is a schematic structural diagram of a chip provided by an embodiment of the application
  • the chip may be represented as a neural network processor NPU 270
  • the NPU 270 is mounted as a co-processor to the main CPU ( Host CPU), the task is allocated by the Host CPU.
  • the core part of the NPU is the arithmetic circuit 2703, which is controlled by the controller 2704 to extract the matrix data in the memory and perform multiplication operations.
  • the arithmetic circuit 2703 includes multiple processing units (process engines, PEs). In some implementations, the arithmetic circuit 2703 is a two-dimensional systolic array. The arithmetic circuit 2703 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 2703 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 2702 and buffers it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 2701 to perform matrix operation, and stores the partial result or final result of the matrix in the accumulator 2708 .
  • Unified memory 2706 is used to store input data and output data.
  • the weight data is directly passed through the storage unit access controller (direct memory access controller, DMAC) 2705, and the DMAC is transferred to the weight memory 2702.
  • Input data is also moved to unified memory 2706 via the DMAC.
  • DMAC direct memory access controller
  • a bus interface unit (BIU) 2710 is used for the interaction between the AXI bus and the DMAC and an instruction fetch buffer (instruction fetch buffer, IFB) 2709.
  • IFB instruction fetch buffer
  • the bus interface unit 2710 (bus interface unit, BIU) is used for the instruction fetch memory 2709 to acquire instructions from the external memory, and also for the storage unit access controller 2705 to acquire the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 2706 , the weight data to the weight memory 2702 , or the input data to the input memory 2701 .
  • the vector calculation unit 2707 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc., if necessary. It is mainly used for non-convolutional/fully connected layer network computations in neural networks, such as batch normalization, pixel-level summation, and upsampling of feature planes.
  • the vector computation unit 2707 can store the vector of processed outputs to the unified memory 2706 .
  • the vector calculation unit 2707 may apply a linear function and/or a non-linear function to the output of the operation circuit 2703, such as linear interpolation of the feature plane extracted by the convolution layer, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit 2707 generates normalized values, pixel-level summed values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 2703, eg, for use in subsequent layers in a neural network.
  • the instruction fetch memory (instruction fetch buffer) 2709 connected to the controller 2704 is used to store the instructions used by the controller 2704;
  • Unified memory 2706, input memory 2701, weight memory 2702 and instruction fetch memory 2709 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • each layer in the recurrent neural network can be performed by the operation circuit 2703 or the vector calculation unit 2707 .
  • the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the programs of the above-mentioned methods in FIGS. 6-20 .
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • U disk U disk
  • mobile hard disk ROM
  • RAM random access memory
  • disk or CD etc.
  • a computer device which can be a personal computer, server, or network device, etc. to execute the methods described in the various embodiments of the present application.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wire eg, coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server, data center, etc., which includes one or more available media integrated.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Vaporization, Distillation, Condensation, Sublimation, And Cold Traps (AREA)
  • Image Analysis (AREA)
  • Feedback Control In General (AREA)

Abstract

人工智能领域的一种神经网络蒸馏方法以及装置,用于提供输出偏置更低的神经网络,提高神经网络的输出精度,且可以根据不同的场景选择合适的蒸馏方式,泛化能力强。该方法包括:获取样本集,该样本集包括有偏数据集和无偏数据集(601),有偏数据集中包括有偏置的样本,无偏数据集中包括无偏置的样本;根据样本集的数据特征确定第一蒸馏方式(602),在第一蒸馏方式中使用无偏数据集训练老师模型,使用有偏数据集训练学生模型;基于有偏数据集和无偏数据集,按照第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络(603)。

Description

一种神经网络蒸馏方法以及装置 技术领域
本申请涉及人工智能领域,尤其涉及一种神经网络蒸馏方法以及装置。
背景技术
知识蒸馏是一种模型压缩技术,将复杂、学习能力强的网络学到的特征表示“知识”蒸馏出来并传递给参数量小、学习能力弱的网络。知识蒸馏可以将一个网络的知识转移到另一个网络,两个网络可以是同构或者异构。做法是先训练一个teacher网络,然后使用这个teacher网络的输出训练student网络。
然而,训练student网络的训练集可能存在偏置,容易导致student网络的输出结果不准确。并且,在通过teacher网络指导student网络时,student网络的精度受限于teacher网络的精度影响,student网络的输出精度无法有更大空间的提升。因此,如何得到输出更准确的网络,成为亟待解决的问题。
发明内容
本申请实施例提供了一种神经网络蒸馏方法以及装置,用于提供输出偏置更低的神经网络,提高神经网络的输出精度,且可以根据不同的场景选择合适的蒸馏方式,泛化能力强。
有鉴于此,本申请第一方面提供一种神经网络蒸馏方法,包括:首先,获取样本集,该样本集中包括有偏数据集和无偏数据集,该有偏数据集中包括有偏置的样本,该无偏数据集中包括无偏置的样本,通常,有偏数据集的数据量大于无偏数据集的数据量;随后,根据样本集的数据特征来确定第一蒸馏方式,在该第一蒸馏方式中,老师模型是使用无偏数据集进行训练得到的,学生模型是使用有偏数据集训练得到;随后,基于有偏数据集和无偏数据集,按照第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络。
因此,在本申请中,可以使用无偏数据集所包括的无偏样本,对第一神经网络的知识蒸馏过程进行指导,从而使更新后的第一神经网络可以输出无偏置的结果,实现对输入样本的纠偏,提高第一神经网络的输出准确率。此外,本申请提供的神经网络蒸馏方法中,可以选择与样本集的数据特征匹配的蒸馏方式,针对不同的场景可以使用不同的蒸馏方式,适应不同的场景,提高了对神经网络进行知识蒸馏的泛化能力。在不同条件下选择不同的知识蒸馏方式,最大化知识蒸馏的效益。
在一种可能的实施方式中,第一蒸馏方式是从预设的多种蒸馏方式中选择得到,多种蒸馏方式包括老师模型对学生模型的指导方式不相同的至少两种蒸馏方式。
因此,本申请实施方式中,针对不同的场景可以使用不同的蒸馏方式,适应不同的场景,提高了对神经网络进行知识蒸馏的泛化能力。在不同条件下选择不同的知识蒸馏方式,最大化知识蒸馏的效益。
在一种可能的实施方式中,有偏数据集和无偏数据集中的样本包括输入特征和实际标 签,第一蒸馏方式为基于样本集的样本的输入特征进行蒸馏。
本申请实施方式中,无偏数据集可以通过样本的形式,对有偏数据集的模型的知识蒸馏过程进行指导,从而使得到的更新后的第一神经网络的输出的偏置程度更低。
在一种可能的实施方式中,基于有偏数据集和无偏数据集,按照第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络,可以包括:交替使用有偏数据集和无偏数据集对第一神经网络进行训练,得到更新后的第一神经网络,其中,在一个交替过程中,使用有偏数据集对第一神经网络进行训练的批训练次数,和使用无偏数据集对第一神经网络进行训练的批训练次数为预设比例,且样本包括输入特征作为第一神经网络的输入。因此,在本申请实施方式中,可以交替使用有偏数据集和无偏数据集进行训练,在使用无偏数据集中的样本对使用有偏数据集训练的第一神经网络进行纠偏,从而使更新后的第一神经网络的输出的偏置程度更低。
在一种可能的实施方式中,当预设比例为1时,在第一神经网络的损失函数中增加第一正则项和第二正则项的差值,第一正则项是使用无偏数据集包括的样本对第一神经网络进行训练得到的参数,第二正则项是使用有偏数据集包括的样本对第一神经网络进行训练得到的参数。
因此,在本申请实施方式中,可以通过有偏数据集和无偏数据集1:1的交替方式来训练第一神经网络,在使用无偏数据集中的样本对使用有偏数据集训练的第一神经网络进行纠偏,从而使更新后的第一神经网络的输出的偏置程度更低。
在一种可能的实施方式中,基于有偏数据集和无偏数据集,按照第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络,可以包括:为有偏数据集中的样本设置置信度,置信度用于表示样本的偏置程度;基于有偏数据集、有偏数据集中的样本的置信度和无偏数据集,对第一神经网络进行训练,得到更新后的第一神经网络,且在对第一神经网络进行训练时样本包括输入特征作为第一神经网络的输入。
在本申请实施方式中,可以通过为样本设置表示偏置程度的置信度,以在训练神经网络时,学习到样本的偏置程度,从而降低更新后的神经网络的输出结果的偏置程度。
在一种可能的实施方式中,有偏数据集和无偏数据集所包括的样本包括输入特征和实际标签,第一蒸馏方式为基于无偏数据集所包括的样本的预测标签进行蒸馏,预测标签由更新后的第二神经网络针对无偏数据集中的样本输出,更新后的第二神经网络为使用无偏数据集对第二神经网络进行训练得到。
因此,本申请实施方式中,可以通过无偏数据集所包括的样本的预测标签对第一神经网络进行知识蒸馏,可以理解为,可以使用老师模型输出的无偏数据集中的样本的预测标签,完成对学习模型的指导,从而使更新后的第一神经网络在老师模型输出的预测标签的指导下,得到偏置程度更低的输出结果。
在一种可能的实施方式中,样本集中还包括未观测数据集,该未观测数据集中包括了多个未观测样本;上述的基于有偏数据集和无偏数据集,按照第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络,可以包括:通过有偏数据集对第一神经网络进行训练,得到训练后的第一神经网络,以及通过无偏数据集对第二神经网络进行训练,得 到更新后的第二神经网络;从样本集中采集多个样本,得到辅助数据集;使用辅助数据集,以数据集中样本的预测标签作为约束,更新训练后的第一神经网络,得到更新后的第一神经网络,辅助数据集中样本的预测标签由更新后的第二神经网络输出的标签。
本申请实施方式中,可以引入未观测数据集,从而降低有偏数据集对第一神经网络训练过程中的偏置影响,使最终得到的第一神经网络的输出结果偏置程度更低。
在一种可能的实施方式中,基于有偏数据集和无偏数据集,按照第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络,包括:通过无偏数据集对第二神经网络进行训练,得到更新后的第二神经网络;通过更新后的第二神经网络输出有偏数据集中样本的预测标签;将样本的预测标签和样本的实际标签进行加权合并,得到样本的合并标签;使用样本的合并标签训练第一神经网络,得到更新后的第一神经网络。
本申请实施方式中,可以通过将样本的预测标签和样本的实际标签进行加权合并的方式,完成无偏数据集在训练第一神经网络的过程中的指导,使最终得到的第一神经网络的输出结果偏置程度更低。
在一种可能的实施方式中,样本集的数据特征包括第一比例,该第一比例为所述无偏数据集的样本量和所述有偏数据集的样本量之间的比例,根据样本集的数据特征确定第一蒸馏方式,可以包括:从多种蒸馏方式中选择与第一比例匹配的第一蒸馏方式。
因此,在本申请实施方式中,可以通过无偏数据集的样本量和有偏数据集的样本量之间的比例,来选择第一蒸馏方式,从而可以适应无偏数据集的样本量和有偏数据集的样本量之间的不同比例的场景。
在一种可能的实施方式中,第一蒸馏方式包括:基于从无偏数据集中提取到的特征训练老师模型,得到训练后的老师模型,并通过训练后的老师模型以及有偏数据集对学生模型进行知识蒸馏。
因此,本申请实施方式中,可以使用从无偏数据集中提取到的特征训练老师模型,得到偏置程度更低、更稳定的老师模型,在此基础上,进一步使通过老师模型指导得到的学生模型的输出结果偏置程度更低。
在一种可能的实施方式中,基于有偏数据集和无偏数据集,按照第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络,可以包括:通过预设算法从无偏数据集中筛选出部分样本的输入特征,该预设算法可以是深度全局平衡回归算法(deep global balancing regression,DGBR)算法;根据该部分样本的输入特征对第二神经网络进行训练,得到更新后的第二神经网络;将更新后的第二神经网络作为老师模型,第一神经网络作为学生模型,使用有偏数据集对第一神经网络进行知识蒸馏,得到更新后的第一神经网络。
因此,在本申请实施方式中,可以通过计算无偏数据集的稳定特征,并使用该稳定特征训练第二神经网络,得到输出结果偏置程度更低、鲁棒性更优的更新后的第二神经网络,并使用更新后的第二神经网络作为老师模型,第一神经网络作为学生模型,使用有偏数据集对第一神经网络进行知识蒸馏,从而得到输出偏置程度更低的更新后的第一神经网络。
在一种可能的实施方式中,样本集的数据特征包括该特征维度数量,上述的根据样本 集的数据特征确定第一蒸馏方式,可以包括:从多种蒸馏方式中选择与特征维度数量匹配的第一蒸馏方式。
因此,在本申请实施方式中,可以根据无偏数据集以及有偏数据集中所包括的特征维度数量,来选择基于特征蒸馏的方式,可以适应特征维度数量更大的场景,得到输出偏置程度更低的学生模型。
在一种可能的实施方式中,基于有偏数据集和无偏数据集,按照第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络,可以包括:通过无偏数据集更新第二神经网络,得到更新后的第二神经网络;将更新后的第二神经网络作为老师模型,第一神经网络作为学生模型,使用有偏数据集对第一神经网络进行知识蒸馏,得到更新后的第一神经网络。
因此,在本申请实施方式中,可以使用常规的神经网络知识蒸馏过程,使用无偏数据集来训练老师模型,降低了老师模型的输出偏置程度,并通过该老师模型,使用有偏数据库对学生模型进行知识蒸馏,从而降低学生模型的输出偏置程度。
在一种可能的实施方式中,上述的样本集的数据特征确定第一蒸馏方式,可以包括:样本集的数据特征包括第二比例,计算无偏数据集中包括的正样本的数量和负样本的数量的第二比例,从多种蒸馏方式中选择与第二比例匹配的第一蒸馏方式;或者,样本集的数据特征包括第三比例,计算有偏数据集中包括的正样本的数量和负样本的数量的第三比例,从多种蒸馏方式中选择与第三比例匹配的第一蒸馏方式。
因此,在本申请实施方式中,可以通过无偏数据集或者有偏数据集中的正负样本比例来选择常规的基于模型结构的蒸馏方式,从而适应于无偏数据集或者有偏数据集中的正负样本比例的场景。
在一种可能的实施方式中,有偏数据集包括的样本的类型,和无偏数据集包括的样本的类型不相同。
因此,在本申请实施方式中,有偏数据集包括的样本的类型和无偏数据集包括的样本的类型不相同,可以理解为有偏数据集包括的样本和无偏数据集包括的样本属于不同领域的数据,从而可以使用不同领域的数据来进行指导和训练,使得到的更新后的第一神经网络可以输出与输入数据不同领域的数据,如在推荐场景中,可以实现跨领域的推荐。
在一种可能的实施方式中,在得到更新后的第一神经网络之后,上述方法可以还包括:获取目标用户的至少一个样本;将至少一个样本作为更新后的第一神经网络的输入,输出目标用户的至少一个标签,至少一个标签用于构建目标用户的用户画像,用户画像用于确定与目标用户匹配的样本。
因此,在本申请实施方式中,可以通过更新后的第一神经网络,输出用户的一个或者多个标签,根据该一个或者多个标签来确定用户的代表特征,从而构建目标用户的用户画像,该用户画像用于描述目标用户,从而在后续的推荐场景中,可以通过该用户画像确定与目标用户匹配的样本。
第二方面,本申请提供一种推荐方法,包括:
获取目标用户的信息和候选推荐对象的信息;将目标用户的信息和候选推荐对象的信 息输入至推荐模型,预测目标用户对候选推荐对象有操作动作的概率;其中,推荐模型是使用样本集中的有偏数据集和无偏数据集按照第一蒸馏方式对第一神经网络进行训练得到,有偏数据集中包括有偏置的样本,无偏数据集中包括无偏置的样本,第一蒸馏方式是根据样本集的数据特征确定的,有偏数据集中的样本包括第一用户的信息和第一推荐对象的信息以及实际标签,无偏数据集中的样本的实际标签用于表示第一用户是否对第一推荐对象有操作动作,无偏数据集中的样本包括第二用户的信息和第二推荐对象的信息以及实际标签,无偏数据集中的样本的实际标签用于表示第二用户是否对第二推荐对象有操作动作。
其中,可以推荐模型是使用无偏数据训练得到的老师模型,对有偏数据训练得到的学生模型进行指导得到的,从而可以使用输出偏置程度低的推荐模型,来为用户推荐匹配的推荐对象,使推荐结果更准确,提高用户体验。
在一种可能的实施方式中,无偏数据集是在候选推荐对象集合中的候选推荐对象被展示的概率相同的情况下获得的,第二推荐对象为候选推荐对象集合中的一个候选推荐对象。
在一种可能的实施方式中,无偏数据集是在候选推荐对象集合中的候选推荐对象被展示的概率相同的情况下获得的,包括:无偏数据集中的样本是在候选推荐对象集合中的候选推荐对象被随机展示给第二用户的情况下获得的;或者无偏数据集中的样本是在第二用户搜索第二推荐对象的情况下获得的。
在一种可能的实施方式中,无偏数据集中的样本属于源域的数据,有偏数据集中的样本属于目标域的数据。
第三方面,本申请提供一种推荐方法,其特征在于,包括:显示第一界面,第一界面包括至少一个应用程序的学习列表,该至少一个应用程序的学习列表中的第一应用程序的学习列表包括至少一个选项,至少一个选项中的选项关联一个应用程序;感知到用户在第一界面上的第一操作;响应于第一操作,打开或关闭第一应用程序在第一应用程序的学习列表中部分或全部选项所关联的应用程序中的跨域推荐功能。
根据本申请实施例中的方案,通过在不同域间进行知识(例如,用户的兴趣偏好)进行迁移和共享,将源域和目标域的用户交互历史记录都纳入到学习中,以使推荐模型能够更好地学习用户的偏好,使推荐模型在目标域也能很好的拟合用户的兴趣偏好,给用户推荐符合其兴趣的推荐结果,实现跨域推荐,缓解冷启动问题。
在一种可能的实施方式中,一个或多个推荐对象是通过将用户的信息和候选推荐对象的信息输入推荐模型中,预测用户对候选推荐对象有操作动作的概率确定的。
在一种可能的实施方式中,推荐模型是使用样本集中的有偏数据集和无偏数据集按照第一蒸馏方式对第一神经网络进行训练得到,有偏数据集中包括有偏置的样本,无偏数据集中包括无偏置的样本,第一蒸馏方式是根据样本集的数据特征确定的,有偏数据集中的样本包括第一用户的信息和第一推荐对象的信息以及实际标签,无偏数据集中的样本的实际标签用于表示第一用户是否对第一推荐对象有操作动作,无偏数据集中的样本包括第二用户的信息和第二推荐对象的信息以及实际标签,无偏数据集中的样本的实际标签用于表示第二用户是否对第二推荐对象有操作动作。
第四方面,本申请提供一种神经网络蒸馏装置,该神经网络蒸馏装置具有实现上述第 一方面神经网络蒸馏方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
第五方面,本申请提供一种推荐装置,该推荐装置具有实现上述第二方面推荐方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
第六方面,本申请提供一种电子设备,该电子设备具有实现上述第一方面推荐方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
第七方面,本申请实施例提供一种神经网络蒸馏装置,包括:处理器和存储器,其中,处理器和存储器通过线路互联,处理器调用存储器中的程序代码用于执行上述第一方面任一项所示的神经网络蒸馏方法中与处理相关的功能。
第八方面,本申请实施例提供一种推荐装置,包括:处理器和存储器,其中,处理器和存储器通过线路互联,处理器调用存储器中的程序代码用于执行上述第二方面任一项所示的推荐方法中与处理相关的功能。
第九方面,本申请实施例提供一种电子设备,包括:处理器和存储器,其中,处理器和存储器通过线路互联,处理器调用存储器中的程序代码用于执行上述第三方面任一项所示的推荐方法中与处理相关的功能。
第十方面,本申请实施例提供了一种神经网络蒸馏装置,该神经网络蒸馏装置也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行如上述第一方面或第一方面任一可选实施方式中与处理相关的功能。
第十一方面,本申请实施例提供了一种推荐装置,该推荐装置也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行如上述第二方面或第二方面任一可选实施方式中与处理相关的功能。
第十二方面,本申请实施例提供了一种电子设备,该电子设备也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行如上述第三方面或第三方面任一可选实施方式中与处理相关的功能。
第十三方面,本申请实施例提供了一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行上述第一方面、第一方面任一可选实施方式、第二方面、第二方面任一可选实施方式、第三方面或第三方面任一可选实施方式中的方法。
第十四方面,本申请实施例提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面、第一方面任一可选实施方式、第二方面、第二方面任一可选实施方式、第三方面或第三方面任一可选实施方式中的方法。
附图说明
图1为本申请应用的一种人工智能主体框架示意图;
图2为本申请提供的一种系统架构示意图;
图3为本申请实施例提供的一种卷积神经网络结构示意图;
图4为本申请实施例提供的另一种卷积神经网络结构示意图;
图5为本申请提供的另一种系统架构示意图;
图6为本申请提供的一种神经网络蒸馏方法的流程示意图;
图7为本申请提供的一种点击率和推荐位置的关系示意图;
图8为本申请提供的一种神经网络蒸馏架构的示意图;
图9为本申请提供的另一种神经网络蒸馏架构的示意图;
图10为本申请提供的另一种神经网络蒸馏架构的示意图;
图11为本申请提供的一种推荐方法的流程示意图;
图12为本申请提供的推荐方法的一种应用场景示意图;
图13为本申请提供的推荐方法的一种应用场景示意图;
图14为本申请提供的推荐方法的一种应用场景示意图;
图15为本申请提供的推荐方法的一种应用场景示意图;
图16为本申请提供的推荐方法的一种应用场景示意图;
图17为本申请提供的推荐方法的一种应用场景示意图;
图18为本申请提供的推荐方法的一种应用场景示意图;
图19为本申请提供的推荐方法的一种应用场景示意图;
图20为本申请提供的另一种推荐方法的流程示意图;
图21为本申请提供的一种神经网络蒸馏装置的结构示意图;
图22为本申请提供的一种推荐装置的结构示意图;
图23为本申请提供的一种电子设备的结构示意图;
图24为本申请提供的另一种神经网络蒸馏装置的结构示意图
图25为本申请提供的另一种推荐装置的结构示意图;
图26为本申请提供的另一种电子设备的结构示意图;
图27为本申请提供的一种芯片结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请提供的训练集处理方法可以应用于人工智能(artificial intelligence,AI)场景中。AI是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类 智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、平安城市等。
本申请实施例涉及了大量神经网络的相关应用,为了更好地理解本申请实施例的方案,下面先对本申请实施例可能涉及的神经网络的相关术语和概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以如公式(1-1)所示:
Figure PCTCN2020104653-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有多层中间层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,中间层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是中间层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。
(3)卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。示例性地,卷积神经网络的结构可以参阅图3以及图4所示的结构。
(4)循环神经网络(recurrent neural networks,RNN)是用来处理序列数据的。在传统的神经网络模型中,是从输入层到中间层再到输出层,层与层之间是全连接的,而对于每一层层内之间的各个节点是无连接的。这种普通的神经网络虽然解决了很多难题,但是却仍然对很多问题无能无力。例如,你要预测句子的下一个单词是什么,一般需要用到前面的单词,因为一个句子中前后单词并不是独立的。RNN之所以称为循环神经网路,即一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中,即中间层本层之间的节点不再无连接而是有连接的,并且中间层的输入不仅包括输入层的输出还包括上一时刻中间层的输出。理论上,RNN能够对任 何长度的序列数据进行处理。对于RNN的训练和对传统的CNN或DNN的训练一样。
(5)加法神经网络(Adder Neural Network,ANN)
加法神经网络是一种几乎不包含乘法的神经网络。不同于卷积神经网络,加法神经网络使用L1距离来度量神经网络中特征和滤波器之间的相关性。由于L1距离中只包含加法和减法,神经网络中大量的乘法运算可以被替换为加法和减法,从而大大减少了神经网络的计算代价。
在ANN中,通常使用一种只具有加法的度量函数,即L1距离,来代替卷积神经网络中的卷积计算。通过使用L1距离,输出的特征可以被重新计算为:
Figure PCTCN2020104653-appb-000002
或,
Figure PCTCN2020104653-appb-000003
其中,|(·)|为取绝对值运算,∑(·)为求和运算,Y(m,n,t)为所述至少一个输出子特征图,Y(m,n,t)为所述输出特征图中第m行、第n列及第t页的元素,X(m+i,n+j,k)为所述至少一个输入子特征图中的第i行、第j列及第k页的元素,F(i,j,k,t)为所述特征提取核中的第i行、第j列及第k页的元素,t为所述特征提取核的通道数,d为所述特征提取核的行数,C为所述输入特征图的通道数,d、C、i、j、k、m、n、t均为整数。
可以看到,ANN只需要使用加法,通过将卷积中计算特征的度量方式改为L1距离,可以只使用加法来提取神经网络中的特征,并构建加法神经网络。
(5)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。在本申请实施方式中,目标函数与损失函数的区别在于,目标函数中除了包括损失函数,还可能包括约束函数,用于对神经网络的更新进行约束,使更新得到的神经网络更接近期望的神经网络。
(6)反向传播算法
神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经 网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。
参见附图2,本申请实施例提供了一种系统架构200。该系统架构中包括数据库230、客户设备240。数据采集设备260用于采集数据并存入数据库230,训练模块202基于数据库230中维护的数据生成目标模型/规则201。下面将更详细地描述训练模块202如何基于数据得到目标模型/规则201,目标模型/规则201即本申请以下实施方式中构建得到的神经网络,具体参阅以下图6-图20中的相关描述。
计算模块可以包括训练模块202,训练模块202得到的目标模型/规则可以应用不同的系统或设备中。在附图2中,执行设备210配置收发器212,该收发器212可以是无线收发器、光收发器或有线接口(如I/O接口)等,与外部设备进行数据交互,“用户”可以通过客户设备240向收发器212输入数据,例如,本申请以下实施方式,客户设备240可以向执行设备210发送目标任务,请求执行设备构建神经网络,并向执行设备210发送用于训练的数据库。
执行设备210可以调用数据存储系统250中的数据、代码等,也可以将数据、指令等存入数据存储系统250中。
计算模块211使用目标模型/规则201对输入的数据进行处理。具体地,计算模块211用于:获取有偏数据集和无偏数据集,有偏数据集中包括有偏置的样本,无偏数据集中包括无偏置的样本,有偏数据集的数据量大于无偏数据集的数据量;根据有偏数据集所包括的数据或无偏数据集所包括的数据中的至少一种,从预设的多种蒸馏方式中选择第一蒸馏方式,多种蒸馏方式在进行知识蒸馏时老师模型对学生模型的指导方式不相同,且使用无偏数据集训练得到的模型指导使用有偏数据集训练得到的模型;基于有偏数据集和无偏数据集,按照第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络。
最后,收发器212将构建得到的神经网络返回给客户设备240,以在客户设备240或者其他设备中部署该神经网络。
更深层地,训练模块202可以针对不同的任务,基于不同的数据得到相应的目标模型/规则201,以给用户提供更佳的结果。
在附图2中所示情况下,可以根据用户的输入数据确定输入执行设备210中的数据,例如,用户可以在收发器212提供的界面中操作。另一种情况下,客户设备240可以自动地向收发器212输入数据并获得结果,若客户设备240自动输入数据需要获得用户的授权,用户可以在客户设备240中设置相应权限。用户可以在客户设备240查看执行设备210输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备240也可以作为数据采集端将采集到与目标任务关联的数据存入数据库230。
需要说明的是,附图2仅是本申请实施例提供的一种系统架构的示例性的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制。例如,在附图2中,数据存储系统250相对执行设备210是外部存储器,在其它场景中,也可以将数据存储系统250置于执行设备210中。
在本申请所提及的训练或者更新过程可以由训练模块202来执行。可以理解的是,神 经网络的训练过程即学习控制空间变换的方式,更具体即学习权重矩阵。训练神经网络的目的是使神经网络的输出尽可能接近期望值,因此可以通过比较当前网络的预测值和期望值,再根据两者之间的差异情况来更新神经网络中的每一层神经网络的权重向量(当然,在第一次更新之前通常可以先对权重向量进行初始化,即为深度神经网络中的各层预先配置参数)。例如,如果网络的预测值过高,则调整权重矩阵中的权重的值从而降低预测值,经过不断的调整,直到神经网络输出的值接近期望值或者等于期望值。具体地,可以通过损失函数(loss function)或目标函数(objective function)来衡量神经网络的预测值和期望值之间的差异。以损失函数举例,损失函数的输出值(loss)越高表示差异越大,神经网络的训练可以理解为尽可能缩小loss的过程。本申请以下实施方式中更新起点网络的权重以及对串行网络进行训练的过程可以参阅此过程,以下不再赘述。
如图2所示,根据训练模块202训练得到目标模型/规则201,该目标模型/规则201在本申请实施例中可以是本申请中的第一神经网络,具体的,本申请实施例提供的第一神经网络、第二神经网络、老师模型或者学生模型等,可以是深度卷积神经网络(deep convolutional neural networks,DCNN),循环神经网络(recurrent neural network,RNNS)等等。本申请提及的神经网络可以包括多种类型,如深度神经网络(deep neural network,DNN)、卷积神经网络(convolutional neural network,CNN)、循环神经网络(recurrent neural networks,RNN)或残差网络其他神经网络等。
参见附图5,本申请实施例还提供了一种系统架构500。执行设备210由一个或多个服务器实现,可选的,与其它计算设备配合,例如:数据存储、路由器、负载均衡器等设备;执行设备210可以布置在一个物理站点上,或者分布在多个物理站点上。执行设备210可以使用数据存储系统250中的数据,或者调用数据存储系统250中的程序代码实现本申请以下图6对应的训练集处理方法的步骤。
用户可以操作各自的用户设备(例如本地设备501和本地设备502)与执行设备210进行交互。每个本地设备可以表示任何计算设备,例如个人计算机、计算机工作站、智能手机、平板电脑、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。
每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与执行设备210进行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。具体地,该通信网络可以包括无线网络、有线网络或者无线网络与有线网络的组合等。该无线网络包括但不限于:第五代移动通信技术(5th-Generation,5G)系统,长期演进(long term evolution,LTE)系统、全球移动通信系统(global system for mobile communication,GSM)或码分多址(code division multiple access,CDMA)网络、宽带码分多址(wideband code division multiple access,WCDMA)网络、无线保真(wireless fidelity,WiFi)、蓝牙(bluetooth)、紫蜂协议(Zigbee)、射频识别技术(radio frequency identification,RFID)、远程(Long Range,Lora)无线通信、近距离无线通信(near field communication,NFC)中的任意一种或多种的组合。该有线网络可以包括光纤通信网络或同轴电缆组成的网络等。
在另一种实现中,执行设备210的一个方面或多个方面可以由每个本地设备实现,例如,本地设备501可以为执行设备210提供本地数据或反馈计算结果。
本申请实施例提供的数据处理方法可以在服务器上被执行,还可以在终端设备上被执行。其中该终端设备可以是具有图像处理功能的移动电话、平板个人电脑(tablet personal computer,TPC)、媒体播放器、智能电视、笔记本电脑(laptop computer,LC)、个人数字助理(personal digital assistant,PDA)、个人计算机(personal computer,PC)、照相机、摄像机、智能手表、可穿戴式设备(wearable device,WD)或者自动驾驶的车辆等,本申请实施例对此不作限定。
通常,知识蒸馏可以将一个网络的知识转移到另一个网络,两个网络可以是同构或者异构。做法是先训练一个teacher网络,或者称为老师模型,然后使用这个teacher网络的输出训练student网络,或者称为学生模型。在进行知识蒸馏时,可以通过采用预先训练好的复杂网络去训练另外一个简单的网络,以使得简单的网络可以具有和复杂网络相同或相似的数据处理能力。
知识蒸馏可以快速方便地实现一些小型的网络,例如,可以在云服务器或企业级服务器训练大量数据的复杂网络模型,然后进行知识蒸馏得到实现了相同功能的小型模型,并将该小型模型压缩并迁移到小型设备(如手机、智能手环等)上。又例如,通过收集大量用户在智能手环上的数据,在云服务器上进行复杂并耗时的网络训练,得到用户行为识别模型,再把该模型压缩并迁移到智能手环这一小型载体,可以在保证保护用户隐私的同时,快速训练模型,并提升用户体验。
然而,在使用老师模型对学生模型进行指导时,通常学生模型的输出精度受到老师模型的输出精度的限制,导致学生模型的输出精度无法更大空间的提升。并且,在进行知识蒸馏时,通常使用的是存在偏置的数据集,从而使训练得到的学生模型的输出存在偏置,即输出结果不准确。
因此,本申请提供一种神经网络蒸馏方法,用于针对训练所使用的数据集选择合适的指导方式,完成神经网络的知识蒸馏,并且,使用无偏数据集所训练的模型对有偏数据集训练得到的模型进行指导,降低学生模型的输出偏置程度,提高学生模型的输出准确度。
本申请提供的神经网络蒸馏方法,可以应用于推荐系统、用户画像、图像识别或者其他去偏置场景等。该推荐系统可以用于为用户推荐应用程序(Application,App)、音乐、图像、视频或商品等。该用户画像用于反映用户的特征或者喜好等。
下面对本申请提供神经网络蒸馏方法进行详细介绍。参阅图6,本申请提供的一种神经网络蒸馏方法的流程示意图。
601、获取样本集,样本集中包括有偏数据集和无偏数据集。
其中,样本集至少包括有偏数据集和无偏数据集,有偏数据集中包括了存在偏置的样本(以下称为有偏样本),无偏数据集中包括了无偏置的样本(以下称为无偏样本),且通常有偏数据集的数据量大于无偏数据集的数据量。
为便于理解,存在偏置的样本可以理解为与用户的实际使用样本存在偏差的样本。例如,推荐系统作为反馈循环(feedback loop)系统,通常会面临着各种偏置问题,如位置 偏置,流行偏置和前序模型偏置,这些偏置的存在使得推荐系统收集到的用户反馈数据并不能反映用户的真实偏好。
而样本的偏置在不同的场景中也可能不相同,如位置偏置、选择偏置或者流行偏置等。例如,以为用户推荐物品的场景为例,位置偏置可以理解为:描述用户时,倾向性地选择处于更好位置的物品进行交互,这种倾向性与物品是否满足用户的实际需求无关。选择偏置可以理解为:发生于“被研究群体”不能够代表“目标群体”,以至于对“被研究群体”的风险或者收益的衡量不能够准确地表征“目标群体”,导致所获结论不能够被有效地泛化。
示例性地,以为用户推荐APP的场景为例,图7表示在随机投放策略下,同一个APP在各个推荐位置的点击率,可以看出,随着推荐位置的靠后,该APP的点击率越低,说明了位置偏置对于点击率的影响。位置偏置导致推荐位置更靠前的APP的点击率更高,而推荐位置更靠后的APP的点击率更低,若使用这样的点击数据训练模型,则会加剧训练得到的模型的马太效应,导致模型的输出结果两极分化。例如,若用户在推荐系统中搜索APP,若符合用户需求的APP包括了APP1和APP2,且APP2更符合用户的搜索需求。但因APP1的点击率较高,因此APP1的推荐位置更优,导致用户点击了APP1,而未点击APP2,后续在为用户推荐APP时,结合了用户点击APP1的历史数据(即有偏样本)来进行推荐,而用户的实际需求应该与APP2(即无偏样本)相关关联,则可能导致对用户的推荐不准确。
无偏数据可以通过随机流量(uniform data)的方式采集。以推荐系统为例。采集无偏数据集的具体过程可以包括:从所有的候选集中进行随机采样,然后将随机采样得到的样本进行随机展示,然后收集针对该随机展示的样本的反馈数据,并从反馈数据中获取到无偏样本。可以理解为候选集中的所有样本均有机会均等地展示给用户进行选择,因此它可以被视为一个好的无偏代理。
602、根据样本集的数据特征确定第一蒸馏方式。
其中,可以根据样本集所包括的数据特征,确定第一蒸馏方式。具体地,在得到有偏数据集和无偏数据集之后,基于有偏数据集和/或无偏数据集,从预设的多种蒸馏方式中选择匹配的蒸馏方式,得到第一蒸馏方式。
通常,第一蒸馏方式是从预设的多种蒸馏方式中选择得到,多种蒸馏方式中包括老师模型对学生模型的指导方式不相同的至少两种蒸馏方式。通常,无偏数据集用于训练老师模型,有偏数据集用于训练学生模型,即使用无偏数据集训练得到的模型来指导使用有偏数据集得到的模型。
可选地,该预设的多种蒸馏方式可以包括但不限于以下一种或者多种:样本蒸馏、标签蒸馏、特征蒸馏或模型结构蒸馏等。
其中,样本蒸馏是指使用有偏数据集和无偏数据集中的样本来进行蒸馏。如使用无偏数据集中的样本来指导对学生模型的知识蒸馏。
标签蒸馏是指使用基于无偏数据集中的样本的预测标签作为指导对学生模型进行蒸馏,该预测标签由老师模型输出,该老师模型是基于无偏数据集进行训练得到。
特征蒸馏是指基于从无偏数据集中提取到的特征训练老师模型,并通过老师模型以及 有偏数据集进行知识蒸馏。
模型结构蒸馏,即使用无偏数据集训练得到老师模型,使用老师模型以及有偏数据集,对学生模型进行知识蒸馏,得到更新后的学生模型。
具体地,针对前述的多种蒸馏方式的更详细的介绍可以参阅以下图8的介绍,此处不再赘述。
在一些可能的实施方式中,可以基于无偏数据集的样本量和有偏数据集的样本量、无偏数据集中正样本和负样本之间的比例、有偏数据集中正样本和负样本之间的比例、或者无偏数据集和有偏数据集的所包括的数据的特征维度数量等来选择匹配的蒸馏方式作为第一蒸馏方式。例如,样本集中的样本的输入特征的数据类型可能不相同,例如,每一个类型的数据类型即可理解为一个维度,特征维度数量即样本集中所包括的数据类型的种类。
示例性地,选择蒸馏方式的方式可以包括但不限于:
条件1:计算无偏数据集的样本和有偏数据集的样本量之间的第一比例,当该第一比例小于第一阈值时,选择样本蒸馏作为第一蒸馏方式。
条件2:当第一比例不小于第一阈值时,选择标签蒸馏作为第一蒸馏方式。
条件3:计算无偏数据集中包括的正样本的数量和负样本的数量的第二比例,当该第二比例大于第二阈值时,选择模型结构蒸馏作为第一蒸馏方式;或者,计算有偏数据集中包括的正样本的数量和负样本的数量的第三比例,当该第三比例大于第三阈值时,选择模型结构蒸馏作为第一蒸馏方式。
条件4:计算有偏无偏数据集以及有偏数据集中所包括的特征维度数量,当该特征维度数量大于预设维度时,选择特征蒸馏作为第一蒸馏方式。
其中,可以预先设定每种蒸馏方式的优先级,当上述的多种条件同时满足时,可以按照优先级选择合适的蒸馏方式。例如,特征蒸馏的优先级>模型结构蒸馏的优先级>样本蒸馏的优先级=标签蒸馏的优先级,当无偏数据集和无偏数据集同时满足条件3和条件4时,则选择特征蒸馏作为第一蒸馏方式。
当然,每种蒸馏方式的优先级在不同的场景中可以不同,此处仅仅是示例性说明,不作为限定。
还需要说明的是,本申请所指的老师模型和学生模型,可以是结构不相同的模型,也可以是针对相同结构的模型,使用不同的数据集得到的模型,具体可以根据实际应用场景进行调整,本申请对此并不作限定。
603、基于有偏数据集和所述无偏数据集,按照第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络。
在选择了第一蒸馏方式之后,即可按照该第一蒸馏方式所包括的指导方式,对第一神经网络进行知识蒸馏,得到更新后的第一神经网络。
为便于理解,以一种场景为例,通过Uniform data采集到的无偏数据集不受到前序模型的影响,并且符合期望的模型的样本属性,即所有候选集机会均等地展示给用户进行选择。因此无偏数据集可以被视为一个好的无偏代理。但由于无偏数据集的样本量少,无法直接用于训练线上模型。且通过无偏数据集训练的模型更无偏,但方差较大,而有偏数据 训练的模型有偏置,但方差相对较小,因此,本申请实施方式中,有效结合了无偏数据集和有偏数据集进行训练,使无偏数据集对有偏数据集的训练形成指导,从而使最终得到的第一神经网络的输出结果的偏置程度更低,提高第一神经网络的输出结果的准确度。
具体地,下面以几种蒸馏方式为例,对步骤603进行详细说明。
一、当第一蒸馏方式为样本蒸馏。
其中,基于数据集中的样本蒸馏的方式可以有多种,有偏数据集和无偏数据集中的样本包括输入特征和实际标签,可以将无偏数据集中的样本的输入特征作为老师模型的输入,训练老师模型,将有偏数据集中样本的输入特征作为学生模型的输入,学生模型即第一神经网络,从而完成对第一神经网络的知识蒸馏,得到更新后的第一神经网络。
在一种可能的实施方式中,进行知识蒸馏的具体过程可以包括:交替使用有偏数据集和无偏数据集对第一神经网络进行训练,得到更新后的第一神经网络,其中,在一个交替过程中,使用有偏数据集对第一神经网络进行训练的批训练次数,和使用无偏数据集对第一神经网络进行训练的批训练次数为预设比例,且在训练第一神经网络时,样本的输入特征作为第一神经网络的输入。
因此,在本申请实施方式中,可以通过交替使用有偏数据集和无偏数据集对第一神经网络进行训练,在通过无偏数据集进行训练时,可以纠正使用有偏数据集进行训练时产生的偏置,使最终得到的第一神经网络的输出结果的偏置程度更低,输出结果更准确。
在一种可能的实施方式中,当预设比例为1时,在第一神经网络的损失函数中增加第一正则项和第二正则项的差值,第一正则项是使用无偏数据集包括的样本对第一神经网络进行训练得到的参数,第二正则项是使用有偏数据集包括的样本对第一神经网络进行训练得到的参数。
因此,在本申请实施方式中,
在一种可能的实施方式中,进行知识蒸馏的具体过程可以包括:为有偏数据集中的全部或者部分样本设置置信度,该置信度用于表示样本的偏置程度;基于有偏数据集、有偏数据集中的样本的置信度和无偏数据集,对第一神经网络进行训练,得到更新后的第一神经网络,且在对第一神经网络进行训练时样本包括输入特征作为第一神经网络的输入。
二、当第一蒸馏方式为标签蒸馏。
其中,可以使用无偏数据集对第二神经网络进行训练,然后通过训练后的第二神经网络输出有偏数据集中的样本的预测标签,然后使用该预测标签作为约束,对第一神经网络进行训练,得到更新后的第一神经网络。
在一种可能的实施方式中,前述的样本集中还包括未观测数据集,未观测数据集中包括多个未观测样本,进行知识蒸馏的具体过程可以包括:通过有偏数据集对第一神经网络进行训练,得到训练后的第一神经网络,以及通过无偏数据集对第二神经网络进行训练,得到更新后的第二神经网络;从全量数据集中采集多个样本,得到辅助数据集;使用辅助数据集,以辅助数据集中样本的预测标签作为约束,更新训练后的第一神经网络,得到更新后的第一神经网络。通常,辅助数据集中样本具有至少两个预测标签,且该至少两个预测标签由更新后的第一神经网络和第二神经网络分别输出。
因此,本申请实施方式中,可以通过引入未观测数据集的方式,通过未观测数据集所包括的样本来降低偏置数据集在对第一神经网络训练时的偏置影响,降低更新后的第一神经网络的输出结果的偏置程度。
在一种可能的实施方式中,进行知识蒸馏的具体过程可以包括:通过无偏数据集对第二神经网络进行训练,得到更新后的第二神经网络;通过更新后的第二神经网络输出有偏数据集中样本的预测标签;将样本的预测标签和样本的实际标签进行加权合并,得到样本的合并标签;使用样本的合并标签训练第一神经网络,得到更新后的第一神经网络。
因此,本申请实施方式中,可以使用第二神经网络输出的有偏数据集中的样本的预测标签与样本的实际标签进行合并后的标签对第一神经网络进行更新,可以理解为老师模型使用预测标签的方式对第一神经网络的更新进行了指导,从而降低更新后的第一神经网络的输出结果的偏置程度,提高更新后的第一神经网络的输出结果的准确度。
三、当第一蒸馏方式为特征蒸馏。
其中,可以从无偏数据集中提取稳定特征,然后基于该稳定特征训练第二神经网络,得到更新后的第二神经网络。然后使用有偏数据集对第一神经网络进行训练,并将更新后的第二神经网络作为老师模型,将第一神经网络作为学生模型,进行知识蒸馏,得到更新后的第一神经网络。
在一种可能的实施方式中,进行知识蒸馏的具体过程可以包括:通过预设算法输出无偏数据集的部分样本的输入特征,该部分样本的输入特征可以理解为无偏数据集集中的稳定特征,该预设算法可以是DGBR算法;根据该部分样本的输入特征对第二神经网络进行训练,得到更新后的第二神经网络;将更新后的第二神经网络作为老师模型,第一神经网络作为学生模型,使用有偏数据集对第一神经网络进行知识蒸馏,得到更新后的第一神经网络。
因此,在本申请实施方式中,可以使用无偏数据集中的稳定特征来训练第二神经网络,得到更新后的第二神经网络,即老师模型,因此,老师模型的输出更稳定,准确度更高。在此基础上,使用该老师模型进行知识蒸馏,得到的学生模型的输出也更稳定,准确度更高。
四、当第一蒸馏方式为模型结构蒸馏。
其中,可以使用无偏数据集对第二神经网络进行训练,得到更新后的第二神经网络。然后将更新后的神经网络作为老师模型,第一神经网络作为学生模型,使用有偏数据集,以及老师模型的中间层的输出结果,对第一神经网络进行知识蒸馏,得到更新后的神经网络。
因此,在本实施方式中,可以使用偏数据集所包括的无偏样本,对第一神经网络的知识蒸馏过程进行指导,从而使更新后的神经网络可以输出无偏置的结果,实现对输入样本的纠偏,提高第一神经网络的输出准确率。
因此,在本申请中,可以使用无偏数据集所包括的无偏样本,对第一神经网络的知识蒸馏过程进行指导,从而使更新后的神经网络可以输出无偏置的结果,实现对输入样本的纠偏,提高第一神经网络的输出准确率。此外,本申请提供的神经网络蒸馏方法中,可以 选择与无偏数据集和有偏数据集匹配的蒸馏方式,针对不同的场景可以使用不同的蒸馏方式,适应不同的场景,提高了对神经网络进行知识蒸馏的泛化能力。在不同条件下选择不同的知识蒸馏方式,根据数据集的大小、正负例比、不同数据的占比等条件进行适配,最大化知识蒸馏的效益。
在一种可能的实施方式中,无偏数据集中的样本的类型和有偏数据集中的样本的类型不相同。例如,无偏数据集所包括的样本类型为音乐,有偏数据集所包括的样本的类型为视频。因此,本申请实施方式中,可以使用不同领域的数据进行知识蒸馏,实现跨领域的神经网络的训练,从而可以实现跨领域的用户推荐,提高用户体验。
在一种可能的实施方式中,在得到更新后的第一神经网络之后,可以获取目标用户的至少一个样本,将该至少一个样本作为该更新后的第一神经网络的输入,输出目标用户的至少一个标签,使用该至少一个标签构建目标用户的用户画像,该用户画像用于描述该目标用户,或者为用户推荐匹配的样本。例如,可以获取用户A点击过的APP,将用户点击过的APP作为更新后的第一神经网络的输入,输出用户A的一个或者多个标签,该一个或者多个标签可以用于表示用户点击对应的APP的概率,当该概率超过预设概率时,即可使用对应的APP的特征作为用户A的特征,从而构建用户A的用户画像,该用户画像所包括的特征用于描述用户,或者为用户推荐匹配的APP等。
本申请实施方式中,可以使用更新后的第一神经网络,生成用户画像,从而通过该用户画像来描述用户,或者为用户推荐匹配的样本。因该更新后的神经网络是进行了纠偏后的神经网络,可以降低输出结果的偏置,从而使得到的用户画像更准确,提高用户的推荐体验。
前述对本申请提供的神经网络蒸馏方法的流程进行了介绍,下面结合更具体的应用场景,对本申请提供的神经网络蒸馏方法进行更详细的介绍。
首先,获取有偏数据集801和无偏数据集802。
预先设定的蒸馏方式可以包括样本蒸馏803、标签蒸馏804、特征蒸馏805和模型结构蒸馏806。
结合有偏数据集801和无偏数据集802从样本蒸馏803、标签蒸馏804、特征蒸馏805和模型结构蒸馏806中选择匹配的蒸馏方式,然后进行知识蒸馏807,得到更新后的第一神经网络。
下面对本申请实施例中所涉及的数据和步骤进行详细介绍。
具体地,有偏数据集801可以包括构建或者采集到的样本。例如,该有偏数据集801可以是用户点击或者下载过的APP;用户点击或者播放过的音乐;用户点击或者播放过的视频;用户点击或者保存过的图片等。为便于理解,以下将有偏数据集称为S c
无偏数据集可以是使用随机流量(uniform data)方式采集到的数据集,即从候选集中随机采样多个样本,然后从该多个样本。例如,以为用户推荐APP为例,可以从候选集中随机采样多个APP,并在推荐界面中对该多个APP的图片进行随机排列显示,然后收集用户点击或者下载过的APP,得到无偏置的样本,组成无偏数据集。又例如,以为用户推荐图片的场景为例,可以从候选集中随机采样多个图片,并在推荐界面中对该多个图片的 缩略图进行随机排列显示,然后收集用户点击或者下载过的图片,得到无偏数据集。为便于理解,以下将有偏数据集称为S t
可选地,S c和S t可以是不同领域的数据,例如,S c可以是采集到的用户点击或者播放过的音乐,S t可以是用户点击过的图片或者视频等。因此,后续实现跨领域的知识蒸馏,从而使第一神经网络可以输出与输入数据不同领域的预测结果。例如,在跨领域的推荐系统中,可以根据用户对一类物品的喜好,预测其对另一类物品的喜好,缓解新应用场景的冷启动问题,提高用户体验。
在得到S c和S t之后,基于该S c和S t从多种蒸馏方式中选择合适的蒸馏方式。
例如,计算S t与S c样本量的比例,当S t的样本量所占的比例较小时,使用S t训练的模型的方差将较大,不适合标签蒸馏,而更加适合样本蒸馏,即选择样本蒸馏803作为蒸馏方式。当S t样本量所占的比例较大时,选择标签蒸馏804作为蒸馏方式。
又例如,计算S t中的正负样本的比例,当该比例较大时,优于样本分布不均匀,样本蒸馏或者标签蒸馏效果变差,此时可以选择模型结构蒸馏作为蒸馏方式。或者,计算S c中的正负样本的比例,当该比例较大时,优于样本分布不均匀,样本蒸馏或者标签蒸馏效果变差,此时可以选择模型结构蒸馏作为蒸馏方式。
还例如,通常,随着数据集中所包括的样本的特征维度数量增大时,最终训练得到的模型也将变得复杂,模型的输出效果也将提升。因此,当S t和S c所包括的样本的特征维度数量较大时,可以选择特征蒸馏,从而使最终得到的模型输出效果更好。
在选择了合适的蒸馏方式之后,即可使用该蒸馏方式对第一神经网络进行知识蒸馏,得到更新后的第一神经网络。
下面对使用各种蒸馏方式进行蒸馏的具体过程进行详细介绍。
一、样本蒸馏
其中,进行样本蒸馏可以分为多种方式,下面示例性地对几种可能的实施方式进行介绍。
1、因果表征(causal embedding)策略
其中,可以使用S c和S t对相同的模型进行交替训练,并使用S t的训练结果约束使用S c进行的训练。
具体地,首先选定第一神经网络的结构,该第一神经网络可以是CNN,也可以是ANN等。然后交替使用S c和S t对该第一神经网络进行训练。为便于理解,以一次交替过程为例,将使用S t训练得到的模型表示为M t,将使用S c训练得到的模型表示为M c,M t可以理解为老师模型,M c可以理解为学生模型。
其中,在进行训练时,可以使用目标函数对第一神经网络进行训练,该目标函数除了包括损失函数,还可以包括约束函项,该约束项用于对第一神经网络的更新形成约束,以使在交替训练的过程中,M c和M t的参数接近或者一致。然后基于目标函数的值对权重参数和结构参数进行求导以及梯度更新等,得到更新后的参数,如权重参数或者结构参数等,从而得到更新后的第一神经网络。
例如,该目标函数可以表示为:
Figure PCTCN2020104653-appb-000004
其中,|S c|和|S t|分别指的是S c和S t的样本量,
Figure PCTCN2020104653-appb-000005
是S c代入第一神经网络的输出,
Figure PCTCN2020104653-appb-000006
是S t代入第一神经网络的输出,
Figure PCTCN2020104653-appb-000007
Figure PCTCN2020104653-appb-000008
分别指的是S c和S t代入第一神经网络之后的损失函数的值,该损失函数可以是二值、交叉熵或者均值误差损失等。
Figure PCTCN2020104653-appb-000009
Figure PCTCN2020104653-appb-000010
分别指的是S c和S t模型的参数,
Figure PCTCN2020104653-appb-000011
Figure PCTCN2020104653-appb-000012
分别指的是S c和S t模型参数的正则项,λ c和λ t分别指M c和M t模型正则项的权重参数,λ ||t-c||指的是参数的平方差项的权重参数。在该目标函数中,除了包括针对S c和S t的损失函数之外,还可以包括针对S c和S t的正则项,以及参数的平方差项,从而在后续更新第一神经网络的参数时形成约束,使M c和M t的参数更接近或者一致。
因此,在本申请实施方式中,可以使用S c和S t对第一神经网络交替进行训练,从而使用S t训练得到的模型对使用S c进行训练的模型进行指导,完成对学生模型的纠偏,降低学生模型的输出结果的偏置。
2、交替组合(delayed combination)策略
此策略的蒸馏方式与前述因果表征策略类似,区别在于,前述的因果表征测量可以以1:1的批训练次数比例进行交替训练,而在本策略中,可以使用s:1的批训练次数比例进行交替训练,该s为大于1的整数。例如,s可以取1-20范围内的整数。该批训练次数可以理解每一轮训练过程中,对神经网络进行迭代训练的迭代次数。通常,神经网络的训练过程分为多个epoch,每个epoch包含多个batch,该batch就是批训练。例如,若训练使用的数据集中包括了6000张图片,每个epoch训练所使用的图片数量可以是6000张,一个batch过程使用了600张图片,总共包括100个batch,即批训练次数为100。
相应地,第一神经网络的目标函数可以设置为:
Figure PCTCN2020104653-appb-000013
其中,S t step表示使用S t对第一神经网络进行训练的批训练次数,S c step表示使用S c对第一神经网络进行训练的批训练次数,该比例可以是s:1。
3、加权组合(weighted combination)策略
其中,为S c和S t中全部或者部分样本添加一个置信度变量α ij,取值在[0,1]范围内,该α ij用于指示样本的偏置程度。
例如,针对第一神经网络进行更新所使用的目标函数可以表示为:
Figure PCTCN2020104653-appb-000014
通常,S t中的样本的置信度变量可以设置为1。S c的样本的置信度通过两种不同的机制来设置:在全局机制中,该置信度被设置成一个预定义的在[0,1]内的值;在局部机制中,样本关联一个独立的置信度,并在模型训练过程中学习。该置信度变量用于在使用S c对第一神经网络进行训练时,针对S c进行约束,从而可以使对第一神经网络的训练过程中,可以使用S c和S t中的样本结合样本的置信度,对第一神经网络进行训练。可以理解为,可以使用置信度变量来反应S c中的样本的偏置程度,从而在后续的训练过程中,通过该置信度变量对使用S c进行的训练进行约束,实现去偏置的效果,降低更新后的第一神经网络的输出结果的偏置程度。
二、标签蒸馏
其中,标签蒸馏是指使用基于无偏数据集中的样本的预测标签作为指导对学生模型进行蒸馏,该预测标签由老师模型输出,该老师模型是基于无偏数据集进行训练得到。
具体地,标签蒸馏也可以使用多种策略,示例性地对几种可能的策略进行说明。
1、搭桥(Bridge)策略
在本策略中,使用S c和S t分别进行训练得到M c和M t
引入未观测数据集,该未观测数据集中包括了多个未观测样本。例如,以为用户推荐APP为例,可以在推荐界面中展示为用户推荐的APP的图标,用户点击或者下载过的APP即可以理解为前述的偏置样本,而推荐界面中用户未点击过的APP即未观测样本。
以下将S c、S t和未观测数据集的组合称为全量数据集,然后从对该全量数据集进行随机采样多个样本,得到辅助数据集S a。通常,由于数据稀疏性,S a中的大部分数据都属于未观测样本。
其中,在对第一神经网络进行更新时,可以使用S a来进行训练,从而约束M c和M t在对S a中的样本的预测结果相同或者接近。所使用的目标函数可以包括:
Figure PCTCN2020104653-appb-000015
|S a|指的是未观测样本S a数据集的样本量,
Figure PCTCN2020104653-appb-000016
表示S a中的样本在S c训练的模型和S t训练的模型上的预测标签的误差函数,
Figure PCTCN2020104653-appb-000017
为S a代入第一神经网络的输出结果,
Figure PCTCN2020104653-appb-000018
为S a代入第二神经网络的输出结果。因此,在本策略中,引入了未观测数据集来进行纠偏,减少了M c模型和M t模型之间的差异,在目标函数引入了S a中的样本在M c模型和M t模型上的预测标签的误差函数,来对第一神经网络的训练形成约束,从而降低第一神经网络的输出结果的偏置。
2、重定义(Refine)策略
S t使用预训练M t,然后使用M t对S c进行预测,得到S c中样本的预测标签;将该预测标 签和S c的真实标进行签加权合并;然后使用新标签训练M c。注意由于预测标签和S c的实际标签可能存在分布上的差异,需要对预测标签进行归一化,从而减少预测标签和实际标签之间的差异。
具体地,对第一神经网络进行训练使用的目标函数可以表示为:
Figure PCTCN2020104653-appb-000019
其中,α表示预测标签的权重系数,
Figure PCTCN2020104653-appb-000020
表示对于预测标签
Figure PCTCN2020104653-appb-000021
的归一化处理,y ij表示S t中的样本的实际标签,
Figure PCTCN2020104653-appb-000022
表示M t输出的S c中的样本的预测标签。
三、特征蒸馏
其中,可以从S t中筛选出稳定特征,然后使用该稳定特征进行训练得到M t,即老师模型,然后使用S c上训练一个M c,并使用M t对M c进行知识蒸馏,得到蒸馏后的M c
为便于理解,该稳定特征可以理解为:使用不同的数据集对神经网络进行训练,得到不同的神经网络,而该不同的神经网络的输出结果相差较小,则该不同的数据集中的相同的特征,即可理解为有代表性的稳定特征。例如,可以使用深度全局平衡回归算法(deep global balancing regression,DGBR)算法从S t中筛选出有代表性的稳定特征。
使用特征蒸馏方式对第一神经网络进行知识蒸馏的具体过程可以例如,可以通过DGBR算法从S t筛选出具有稳定特征的样本,然后基于该具有稳定特征的样本训练第二神经网络,并将训练后的第二神经网络作为老师模型,第一神经网络作为学生模型,使用S c训练第一神经网络,并对该第一神经网络进行知识蒸馏,得到更新后的第一神经网络。具体例如,确定学生模型和老师模型中一部分神经网络层之间的对应关系,需要说明的是,这里的对应关系是指该神经网络层在学生模型和老师模型中的相对位置是相同或相似的,例如,若学生模型和老师模型是不同类型的网络,而学生模型和老师模型包括的神经网络层的数量是相同的,在这种情况下,学生模型中的第一神经网络层为从输入层开始计数的第N层,老师模型中的第二神经网络层为从输入层开始计数的第N层,此时第一神经网络层和第二神经网络层为具有对应关系的神经网络层;其中,上述神经网络层可以包括中间层和输出层,在进行知识蒸馏时,学生模型和老师模型分别对待处理数据进行处理,并将具有对应关系的神经网络层的输出来构建损失函数,通过损失函数对学生模型进行知识蒸馏,直到满足预设条件。此时,知识蒸馏后的学生模型和老师模型在对相同的待处理数据进行处理时,上述具有对应关系的神经网络层的输出是相似或相同的,以此,使得知识蒸馏后的学生模型可以具有和老师模型相同或相似的数据处理能力。以上述第一神经网络层和第二神经网络层为例,知识蒸馏后的学生模型和老师模型在对相同的待处理数据进行处理时,第一神经网络层和第二神经网络层的输出是相似的,由于具有对应关系的神经网络层的数量可以是多个,使得知识蒸馏后的学生模型和老师模型中的一部分或全部神经网络层具有相同或相似的数据处理能力,进而使得知识蒸馏后的学生模型和老师模型具有相同或相似的数据处理能力。
因此,在本蒸馏方式中,可以使用稳定特征训练得到老师模型,从而使用该基于稳定特征训练得到的老师模型来对学生模型进行蒸馏,使后续得到的学生模型也可以在老师模型的指导下,输出无偏置或者偏置较低的结果。
四、模型结构蒸馏
在本蒸馏方式中,可以使用S t进行训练得到M t。然后使用M t的中间层的输出结果,对M c的训练进行指导。
例如,为了对齐M t和M c的特征嵌入(feature embedding),在S t上训练得到M t的特征嵌入,然后使用该特征嵌入作为M c的变量的初始化值。在S c上训练得到特征嵌入,将该特征嵌入对M c的变量的进行随机初始化,然后对该初始化值和随机初始化的值进行加权运算,使用加权运算的结果去训练M c,得到训练后的M c
又例如,可以M c和M t中选择需要进行对齐的Hint层进行配对(一个或者多个,且M c和M t对应的网络层索引可以无需保持一致),然后添加配对项到M c的目标函数中,该配对项可以表示为α*y t+(1-α)*y c,α∈(0.5,1),y t表示M t的Hint层的输出结果,y c表示M c的Hint层的输出结果,α表示y t所占的比例。
还例如,可以引入温度变量和softmax操作得到M t预测的软标签,即由M t的softmax层之前的网络层输出的标签,然后在训练M c的过程中,约束M c的softmax层之前的网络层输出的标签与M t的softmax层之前的网络层输出的标签相同或者接近。如可以在M c的目标函数中添加相应的配对项,该配对项可以表示为ω*y t+(1-ω)*y c,ω∈(0.5,1),y t表示M t的softmax层之前的网络层层的输出结果,y c表示M c的softmax层之前的网络层的输出结果,ω表示y t所占的比例。
因此,在本蒸馏方式中,可以使用老师模型的中间层,来指导学生模型的中间层的训练,因老师模型是使用无偏数据集训练得到的,因此,老师模型在对学生模型的指导过程中,将对学生模型的输出结果形成约束,降低学生模型的输出结果的偏置,提高学生模型的输出结果的准确性。
在通过上述的其中一种方式进行知识蒸馏,得到更新后的第一神经网络之后,即可使用该第一神经网络进行后续的预测。例如,可以应用于推荐场景中,为用户推荐音乐、视频或者图像等。
前述对本申请提供的神经网络蒸馏方法的流程进行了详细介绍,下面结合前述流程,对本申请提供的神经网络蒸馏方法的应用场景进行示例性说明。
例如,可以建立针对用户的“终身学习项目”,基于用户在视频、音乐、新闻等域的历史数据,通过各种模型和算法,仿照人脑机制,构建认知大脑,搭建用户终身学习系统框架。
该终身学习项目示例性地划分为四个阶段,即使用用户的历史数据进行学习(第一阶段),监控用户的实时数据(第二阶段),预测用户的未来数据(第三阶段),以及为用户进行决策(第四阶段)。本申请提供的神经网络蒸馏方法,可以应用于第一阶段、第三阶段或者第四阶段。
例如,可以根据音乐APP、视频APP和浏览器APP等多域平台获取用户的数据(包含 端侧短信、照片、邮件事件等信息),一方面使用获取到的数据构建用户画像,另一方面实现基于用户信息过滤、关联分析、跨域推荐、因果推理等的学习与记忆模块,构建用户个人知识图谱。
示例性地,如图9所示,若用户进入推荐系统的界面,会触发一个推荐的请求,推荐系统会将该请求及其相关信息输入到预测模型,然后预测用户对系统内的商品的点击率。下一步,根据预测的点击率或基于该点击率的某个函数将商品降序排列,推荐系统按顺序将商品展示在不同的位置作为对用户的推荐结果。用户浏览不同的位置并发生用户行为,如浏览、点击下载等。同时,用户的实际行为会存入日志中作为训练数据,通过离线训练不断更新预测模型的参数,提高模型的预测效果。本申请对应于推荐系统的离线训练,同时会改变预测模型的预测逻辑。具体例如,用户打开手机浏览器APP即可触发浏览器的推荐模块,浏览器的推荐模块根据用户的历史下载记录、用户点击记录,应用的自身特征,时间、地点等环境特征信息,预测用户下载给定的各个候选新闻/文章的可能性。根据计算的结果,浏览器按照可能性按序展示,达到提高应用下载概率的效果。具体来说,将更有可能下载的新闻排在靠前的位置,将不太可能下载的新闻排列在靠后的位置。而用户的行为也会存入日志并通过离线训练对预测模型的参数进行训练和更新。
更具体地,可以将本申请提供的神经网络蒸馏方法引入终身学习中,以应用于终端上的一种推荐系统为例,如图10所示,本申请提供的一种推荐系统的框架示意图。其中,终端上安装有各种APP,如第三方APP、视频APP、音乐APP、浏览器APP或者应用市场APP等,或者短信、邮件、照片、日历或者其他终端的系统APP等。用户在使用终端上安装的APP时,可以通过收集用户使用时产生的数据,从而获取到用户行为数据,如短信、照片、邮件事件、视频、浏览记录等信息。当然,采集APP的数据之前,可以获取采集权限,以保障用户的隐私。
无偏数据集和有偏数据集都可以通过上述APP采集得到。例如,在采集无偏数据集时,以在应用市场中推荐APP为例,可以从APP候选集中随机采样部分APP来为用户进行推荐,并在推荐界面中随机展示采样到的APP的图标,然后获取用户点击过的APP的信息。又例如,以音乐APP为例,可以从音乐候选集中随机采样部分音乐,然后在推荐界面中随机展示采样到的音乐的信息,如因为标题、歌手等信息,随后获取用户点击过的音乐的信息。例如,在采集有偏数据集时,可以按照预先设定的推荐规则,如为用户推荐于用户的标签关联度更高的APP、音乐或者视频等,采集用户点击或者下载过的音乐、APP或者视频等,得到有偏数据集。
可选地,还可以采集未观测数据集,例如,若选择了100个APP进行推荐,而推荐界面中仅显示了10个APP的图标,则剩余的90个APP即为未观测样本。
在采集到无偏数据集和有偏数据集之后,即可使用该无偏数据集和有偏数据集进行知识蒸馏,即将无偏数据集和有偏数据集输入至图10中所示的基于知识蒸馏的反事实推荐(knowledge distillation counterfactual recommend,KDCRec)模块,以进行知识蒸馏,得到训练后的第一神经网络,即图10中所示的记忆模型。知识蒸馏的过程可以参阅前述图8的介绍,此处不再赘述。可选地,还可以集合未观测数据集进行知识蒸馏,参阅前述图8 中的标签蒸馏804的相关介绍,此处不再赘述。因此,在知识蒸馏的过程中,可以通过本申请提供的神经网络蒸馏方法,纠正用户历史数据的偏置问题(包括位置偏置、选择偏置及流行偏置等),得出用户的真实数据分布。
在得到记忆模型之后,即可通过该记忆模型输出用户对应的一个或者多个预测标签,该一个或者多个标签用于构建出,例如,该标签可以用于表示用户点击某个APP的概率,当该概率大于预设概率值时,即可将该标签对应的样本的特征作为用户的特征加入用户图像中。该用户画像中所包括的标签用于描述用户,如用户偏好的APP类型、音乐类型等。
可选地,还可以输出用户的特征知识化数据和知识可推理数据等,即通过关联分析、跨域学习、因果推理等技术挖掘用户特征,借助外部通用知识图谱实现基于知识的推理和呈现,并将基于通用知识的特征扩展输入到增强用户画像模块,通过可视化、动态化的方式,增强用户画像。
然后,业务服务器可以基于增强的用户画像,确定为用户推荐的音乐、APP或者视频等信息,完成针对用户的精准推荐,提高用户体验。
可以理解为,本申请提供了一种基于广义知识蒸馏的反事实学习方法,用于实现无偏的跨域推荐,构建无偏的用户画像系统和无偏的个人知识图谱。对本申请展开实验,包括跨域推荐、基于因果推理的兴趣挖掘和构建用户画像系统。离线实验结果如下:在用户画像中,基于性别预测的算法较基线准确率提升超3%,年龄多分类任务较基线准去了提升近8%,引入反事实因果学习使得各年龄段的准确率方差降低50%。基于反事实因果推理的用户兴趣挖掘,替换了基于关联规则学习的算法,有效降低了用户的有效动作集,并提供出对于用户喜好标签的可解释性。
例如,以某一个应用市场为例,可以在应用市场的推荐界面中,显示多个榜单,根据用户、候选集商品和上下文特征预测用户对候选集商品的点击概率,并以次概率将候选商品降序排列,将最可能被下载的应用排在最靠前的位置。用户看到应用市场的推荐结果之后,根据个人的兴趣,选择浏览、点击或者下载等操作,这些用户行为被存入日志。
将这些累积的用户行为日志作为训练数据训练点击率预测模型,离线训练点击率预测模型时,需要用到用户行为日志。而收集的用户数据存在位置偏置、选择偏置等问题,为消除这些偏置对于点击率预测模型的影响,我们引入随机流量数据uniform data,结合本发明提出的101决策机制模块,从前述图8中的803-806中选择合适的蒸馏方式,联合用户日志数据即有偏数据,共同训练推荐模型,即第一神经网络。基于标签蒸馏的反事实技术相较于基线在受试者工作特征曲线与坐标轴围成面积(the area under the roc curve,AUC)上有8.7%的提升,基于样本蒸馏的反事实因果学习技术较基线有6%的提升,基于模型结构蒸馏的反事实因果学习技术较基线有5%的提升。
前述对本申请提供的神经网络蒸馏方法的流程以及应用场景进行了详细说明。针对前述方法得到的第一神经网络,可以应用于推荐场景,下面结合前述的方法,对本申请提供的推荐方法进行详细介绍。
图11示出了本申请实施例提供的推荐方法1100的示意图。图11所示的方法可以由推荐装置来执行,该装置可以是云服务设备,也可以是终端设备,例如,电脑、服务器等运 算能力足以用来执行推荐方法的装置,也可以是由云服务设备和终端设备构成的系统。示例性地,方法1100可以由图2或图5中的执行设备210或图5中的本地设备执行。
例如,方法1100具体可以由如图3所示的执行设备210执行,所述方法1100中的目标用户和候选推荐对象可以是如图3中的数据库230中的数据。
方法1100包括步骤S1110和步骤S1120。下面对步骤S1110至步骤S1120进行详细介绍。
S1110,获取目标用户的信息和候选推荐对象的信息。
例如,当用户进入推荐系统时,会触发推荐请求。推荐系统可以将触发该推荐请求的用户作为目标用户,将推荐系统中可以展示给用户的推荐对象作为候选推荐对象。
示例性地,目标用户的信息可以包括用户的标识,例如目标用户ID,目标用户的信息还可以包括用户个性化的一些属性信息,例如,目标用户的性别、目标用户的年龄、目标用户的职业、目标推荐用户的收入、目标用户的爱好或目标用户的教育情况等。
示例性地,候选推荐对象的信息可以包括候选推荐对象的标识,例如候选推荐对象ID。候选推荐对象的信息还可以包括候选推荐对象的一些属性信息,例如,候选推荐对象的名称或候选推荐对象的类型等。
S1120,将目标用户的信息和候选推荐对象的信息输入至推荐模型,预测目标用户对候选推荐对象有操作动作的概率。
其中,推荐模型是前述图6中得到的更新后的第一神经网络,为便于理解,以下将该更新后的第一神经网络称为推荐模型,该推荐模型的训练方式可以参阅前述步骤601-603中的相关描述,此处不再赘述。
示例性地,可以通过预测目标用户对候选推荐对象有操作动作的概率对候选推荐集合中的候选推荐对象进行排序,从而得到候选推荐对象的推荐结果。例如,选择概率最高的候选推荐对象展示给用户。比如,候选推荐对象可以是候选推荐应用程序。
如图12所示,图12示出了应用市场中的“推荐”页,该页面上可以有多个榜单,比如,榜单可以包括精品应用和精品游戏。以精品游戏为例,应用市场的推荐系统根据用户的信息和候选推荐应用程序的信息预测用户对候选推荐应用程序有下载(安装)行为的概率,并以此概率将候选推荐应用程序降序排列,将最可能被下载的应用程序排在最靠前的位置。
示例性地,在精品应用中推荐结果可以是App5位于精品游戏中的推荐位置一、App6位于精品游戏中的推荐位置二、App7位于精品游戏中的推荐位置三、App8位于精品游戏中的推荐位置四。当用户看到应用市场的推荐结果之后,可以根据自身的兴趣爱好对上述推荐结果进行操作动作,用户的操作动作执行后会被存入用户行为日志中。
图12所示的应用市场可以通过用户行为日志得到训练数据训练推荐模型。
应理解,上述举例说明是为了帮助本领域技术人员理解本申请实施例,而非要将本申请实施例限于所例示的具体数值或具体场景。本领域技术人员根据所给出的上述举例说明,显然可以进行各种等价的修改或变化,这样的修改或变化也落入本申请实施例的范围内。
推荐模型是使用样本集中的有偏数据集和无偏数据集按照第一蒸馏方式对第一神经网 络进行训练得到,有偏数据集中包括有偏置的样本,无偏数据集中包括无偏置的样本,第一蒸馏方式是根据样本集的数据特征确定的,有偏数据集中的样本包括第一用户的信息和第一推荐对象的信息以及实际标签,无偏数据集中的样本的实际标签用于表示第一用户是否对第一推荐对象有操作动作,无偏数据集中的样本包括第二用户的信息和第二推荐对象的信息以及实际标签,无偏数据集中的样本的实际标签用于表示第二用户是否对第二推荐对象有操作动作。
在一种可能的实施方式中,无偏数据集是在候选推荐对象集合中的候选推荐对象被展示的概率相同的情况下获得的,第二推荐对象为候选推荐对象集合中的一个候选推荐对象。
在一种可能的实施方式中,无偏数据集是在候选推荐对象集合中的候选推荐对象被展示的概率相同的情况下获得的,可以包括:无偏数据集中的样本是在候选推荐对象集合中的候选推荐对象被随机展示给第二用户的情况下获得的;或者无偏数据集中的样本是在第二用户搜索第二推荐对象的情况下获得的。
在一种可能的实施方式中,无偏数据集中的样本属于源域的数据,有偏数据集中的样本属于目标域的数据。
可以理解图6对应的方法为该推荐模型的训练阶段(如图3所示的训练模块202执行的阶段),具体训练是采用由图对应6的方法中提供的更新后的第一神经网络,即推荐模型进行的;而图11对应的方法则可以理解为是该推荐模型的应用阶段(如图3所示的执行设备210执行的阶段),具体可以体现为采用由图6对应的方法训练得到的推荐模型,并根据目标用户的信息和候选推荐对象的信息,从而得到输出结果,即目标用户对候选推荐对象有操作的概率。
下面通过三个示例(示例1、示例2和示例3)对本申请实施例的方案应用于不同场景进行说明,应理解,下面描述的推荐模型的训练方法可以视为前述图6对应的方法的一种具体实现方式,下面描述的推荐方法可以视为图11的发的一种具体实现方式,为了避免不必要的重复,下面在介绍本申请实施例的三个示例时适当省略重复的描述。
示例1:
如图13所示,对于每个推荐请求,推荐系统中通常需要基于用户画像对全量库中的所有物品执行召回、精排或者人工规则等多个流程来生成最终的推荐结果,然后展示给用户。被推荐给用户的物品的数量远远小于物品的总数量,在该过程中引入了多种偏置问题,例如,位置偏置和选择偏置。
用户画像指的是用户个性化偏好的标签集合。例如,用户画像可以由用户的交互历史生成。
选择偏置指的是由于物品被展示的概率不同导致采集到的数据有偏置。理想的训练数据是在将物品按照相同的展示概率展示给用户的情况下得到的。现实情况中,由于展示位的数量限制,无法展示所有物品。推荐系统通常根据预测的用户对物品的选择率为用户进行推荐,用户只能与被展示出来的物品中进行交互,而没有得到展示的机会的物品无法被选择,即无法参与交互,这样导致物品得到展示的机会并不相同。在整个推荐流程中,例如召回、精排等多个流程,均会出现截断操作,即从候选推荐对象中选择部分推荐对象进 行展示。
位置偏置指的是由于物品展示的位置不同导致采集到的数据有偏置。推荐系统通常按照从上到下或从左到右的先后顺序依次展示推荐结果。按照人们的浏览习惯,位于前面的物品更容易被看到,用户的选择率更高。例如,在应用市场的一个榜单中,同一个应用程序(application,APP)可以展示在第一位,也可以展示在最后一位。通过随机投放策略可以验证,该APP展示在第一位的下载率远高于展示在最后一位的下载率。如图13所示,通过执行精排流程,导致物品展示位置出现差异,由此引入位置偏置。
由于偏置问题的存在,导致展示机会多的物品被用户选择的概率更高,用户选择的概率越高,在之后的推荐中该物品更容易被推荐给用户,进而获得更多的展示机会,容易被其他用户点击,加剧了偏置问题的影响,造成马太效应,导致长尾问题的加剧。长尾问题导致绝大部分小众的个性化需求无法得到满足,影响了用户体验。此外,推荐系统中的很多物品由于没有曝光机会,也就无法产生实际的商业价值,空耗存储资源和计算资源,造成资源的浪费。
示例2:
终生学习项目指的是基于用户在视频、音乐、新闻等多个领域的历史数据,通过各种模型和算法,仿照人脑机制,构建认知大脑,实现终生学习的目标的项目。
图14中示出了一种终生学习框架的示意图。在该框架中包括视频APP、阅读APP和浏览器APP等多个推荐场景。传统的推荐学习方案是在每个推荐场景或者说每个域中学习用户在该域的历史行为中隐藏的规律,然后根据学习到的规律进行推荐,整个学习与实施过程完全不考虑域间的知识迁移与共享。
然而,在新的推荐场景上线初期,用户的交互历史匮乏,如果仅基于本域的交互历史学习得到的推荐模型难以发现用户的历史行为中隐藏的规律,进而导致预测结果不准确,也就是新的推荐场景中存在冷启动问题。
跨域推荐是学习用户在源域的偏好并应用于目标域的推荐方法。通过跨域推荐能够利用在源域学习到的规律指导目标域中的推荐结果,实现域间的知识迁移和共享,解决冷启动问题。
例如,通过根据用户在阅读App的推荐场景中的阅读喜好来预测其对音乐、视频的偏好,从而解决该用户在音乐App的推荐场景中的冷启动问题。
如图15所示,在阅读APP的推荐场景中,为用户A推荐图书,基于用户A的交互历史数据可以学习用户A在阅读APP的推荐场景中的兴趣偏好,基于该用户A在阅读APP的推荐场景中的兴趣偏好可以确定与该用户A的兴趣相同的邻居用户。在音乐APP的推荐场景中,为用户推荐音乐,基于邻居用户在音乐APP的推荐场景中的交互历史数据学习邻居用户的在音乐APP的推荐场景中的兴趣偏好,然后基于学习到的兴趣偏好指导在音乐APP的推荐场景中为用户A提供推荐结果。阅读APP的推荐场景即为源域,音乐APP的推荐场景即为目标域。然而,源域和目标域的数据分布往往不一致,那么源域的数据分布相对于目标域的数据分布是有偏的,直接利用上述关联规则等方式实现跨域推荐会导致在学习过程中引入偏置。模型会更多的考虑用户在源域的兴趣偏好来进行推荐,也就是说训练出来的 模型是有偏的,这样导致在源域的数据上学习的模型不能在目标域得到有效泛化,模型存在失真风险。
示例3:
下面以阅读APP的推荐场景作为源域、视频APP的推荐场景作为目标域为例对步骤S1110的一种实现方式进行说明。
阅读APP的推荐场景指的是为用户推荐图书的推荐场景,视频APP的推荐场景指的是为用户推荐视频的推荐场景。
如16图所示,有偏样本是根据视频APP的推荐场景(目标域)中的用户的交互历史得到的。
表1示出了视频APP的推荐场景中基于用户的交互历史(例如,用户行为日志)获得的数据。
表1
标签 用户ID 视频ID 标签 制片商 演员 评分
1 13718bbd 5316a17f 惊悚 1号制片商 张三 6.8
0 0b153874 93bad2c0 文艺 2号制片商 李四 7.1
表1中的一行即为一个样本。以该训练样本为有偏样本为例,有偏样本包括第一用户的信息和第一推荐对象的信息。第一用户的信息包括第一用户的ID,第一推荐对象为视频,第一推荐对象的信息包括第一推荐对象的ID、第一推荐对象的标签、第一推荐对象的制片商、第一推荐对象的演员和第一推荐对象的评分。也就是说有偏样本中共包括6类特征。
应理解,表1仅为示意,用户的信息以及推荐对应的信息还可以包括比表1更多或更少项的信息,或者说更多或更少类特征信息。
进一步地,以libSVM格式存储处理后的数据,例如表1中的数据可以按照如下形式存储:
Figure PCTCN2020104653-appb-000023
基于上述数据可以得到n个有偏样本,组成有偏数据集。
如图16所示,无偏样本是根据阅读APP的推荐场景(源域)中的用户的交互历史得到的。需要说明的是,图16中仅为示意,源域中的数据还可以包括其他推荐场景的数据,也可以包括多个推荐场景的数据,例如,源域的数据可以包括阅读APP的推荐场景中的用户历史数据和音乐APP的推荐场景中的用户历史数据。
应理解,图16中仅为示意,也可以不将无偏样本作为验证集中的数据。
表2示出了阅读APP的推荐场景中基于用户的交互历史(例如,用户行为日志)获得的数据。
表2
标签 用户ID 图书ID 标签 出版社 作者 评分
1 25c83c98 68fd1e64 悬疑 1号出版社 张三 6.8
0 efea433b 0b153874 艺术 2号出版社 李四 7.1
表2中的一行即为一个训练样本。该样本为无偏样本,无偏样本包括第二用户的信息和第二推荐对象的信息。第二用户的信息包括第二用户的ID,第二推荐对象为图书,第二推荐对象的信息包括第二推荐对象的ID、第二荐对象的标签、第二推荐对象的出版社、第二推荐对象的作者和第二推荐对象的评分。也就是说无偏样本中共包括6类特征。
应理解,表2仅为示意,用户的信息以及推荐对应的信息还可以包括比表2更多或更少项的信息,或者说更多或更少类特征信息。
进一步地,以libSVM格式存储处理后的数据,例如表2中的数据可以按照如下形式存储:
Figure PCTCN2020104653-appb-000024
该推荐模型可以应用于目标域中,例如,图16中的视频APP的推荐场景中。
相对于视频APP的推荐场景,阅读APP的推荐场景中的用户的交互数据更丰富,数据分布更能准确反映用户的偏好,根据直观推理以及用户在阅读场景与视频场景的兴趣的互通性,通过本申请实施例的方案,能够使推荐模型更好地把握用户在阅读场景的个性化偏好,进而指导在视频场景中的推荐结果,提高推荐结果的准确率。
通过在不同域间进行知识(例如,用户的兴趣偏好)进行迁移和共享,将源域(例如,阅读APP的推荐场景)和目标域(例如,视频APP的推荐场景)的用户交互历史记录都纳入到学习中,这样训练得到的模型在源域有较好的评估效果,此时训练得到的模型很好的捕获了用户在源域的兴趣偏好,而在近似的推荐场景中,用户的兴趣偏好也类似,因此推荐模型在目标域也能很好的拟合用户的兴趣偏好,给用户推荐符合其兴趣的推荐结果,实现跨域推荐,缓解冷启动问题。
推荐模型可以在目标域中预测用户对推荐对象有操作动作的概率,也就是预测用户选择该推荐对象的概率。将目标推荐模型部署于目标域(例如,视频APP的推荐场景中),推荐系统可以基于该目标推荐模型的输出确定推荐结果展示给用户。
如前所述,传统的推荐学习方案是在每个推荐场景或者说每个域中学习用户在该域的历史行为中隐藏的规律,然后根据学习到的规律进行推荐,整个学习与实施过程完全不考虑域间的知识迁移与共享。
目前,很多电子设备例如手机、平板电脑中都具有多个应用程序,每个应用程序均可以视为一个应用场景。应用程序在为用户进行推荐时,通常仅基于用户在该应用程序中的交互数据学习用户的偏好,进而为用户进行推荐,不考虑用户在其他应用程序中的交互数 据。
然而,在用户刚下载的应用程序中,用户的交互数据匮乏,如果仅基于本域的交互历史学习得到的推荐模型难以发现用户的历史行为中隐藏的规律,进而导致预测结果不准确,影响用户体验,也就是新的推荐场景中存在冷启动问题。
本申请实施例提供了一种推荐方法和电子设备,可以通过学习其他域中的用户的偏好,为用户进行推荐,从而提高预测结果的准确率,提升用户体验。
应理解,本申请实施例中,“用户行为数据”、“用户交互数据”、“交互数据”、“行为数据”等可以认为表达相同的含义,均可以理解为当推荐对象被展示给用户时,与用户的操作行为相关的数据。
为了便于理解,本申请将以手机作为电子设备,首先介绍本申请的一些人机交互实施例。图17是本申请实施例提供的一组图形用户界面(graphical user interface,GUI)示意图。
用户可以执行对手机中的设置应用程序的点击操作,响应于该点击操作,手机进行设置应用程序的主界面301,,设置应用程序的主界面可以显示如图17的(a)图所示的内容。在该主界面301中,可以包括批量管理控件、各个应用程序的跨域推荐管理控件以及侧边栏字母排序索引控件等。主界面301中还可以显示各个应用程序(例如音乐APP、阅读APP、浏览器APP、新闻APP、视频APP或购物APP等)的跨域推荐功能“已开启”或“已关闭”。在一些实施例中,主界面301中显示的各个应用程序的跨域推荐管理控件可以按照应用名称首字母从“A”到“Z”的顺序显示,其中每个应用程序都对应着各自的跨域推荐管理控件。应理解,主界面301还可以包括其他更多或更少或类似的显示内容,本申请对此不作限定。
当用户点击某个应用程序的跨域推荐管理控件时,手机可以显示该应用程序对应的跨域推荐管理界面。示例性地,用户执行图17中的(a)图中所示的对浏览器APP的跨域推荐管理控件的点击操作,响应于该点击操作,手机进入浏览器APP跨域推荐管理界面302,跨域推荐管理界面302可以显示如图17中的(b)图所示的内容。在跨域推荐管理界面302中,可以包括允许跨域推荐控件。应理解,跨域推荐管理界面302还可以包括其他更多或更少类似的显示内容,跨域推荐管理界面302也可以依据应用的不同而包括不同的显示内容,本申请实施例对此不作限定。
可选地,跨域推荐管理控件的默认状态可以为关闭状态。
示例性地,如图17中的(b)图所示,允许跨域推荐控件处于开启状态,浏览器APP的跨域推荐功能开启,相应地,浏览器APP可以从多个APP中获取用户交互数据,并进行学习,以便为用户推荐相关视频。进一步地,当允许跨域推荐控件处于开启状态,跨域推荐管理界面302还可以呈现浏览器APP的学习列表,学习列表中包括多个选项。跨域推荐管理界面302上的一个选项可以理解为一个应用的名称及其对应的开关控件。因此,也可以说,跨域推荐管理界面302包括允许跨域推荐控件和多个选项,该多个选项中的每个选项关联一个应用程序,与该应用相关联的选项用于控制浏览器APP从该应用中获取用户行为数据的权限的开启和关闭。也可以理解为与某个应用相关联的选项用于控制浏览器APP 基于该应用中的用户行为数据执行跨域推荐功能。为了方便理解,以下实施例中仍以开关控件来示意选项之义。
如前所述,学习列表中包括多个选项,也就是说跨域推荐管理界面302上呈现多个应用的名称及其对应的开关控件。如图17中的(b)所示,当一个应用对应的开关控件开启状态,视频APP可以从该APP中获取用户行为数据,并进行学习,以便为用户进行推荐。跨域推荐界面302中还可以显示“已允许”或“已禁止”开启跨域推荐功能的应用程序获取各个应用程序(例如音乐APP、阅读APP、浏览器APP、新闻APP、视频APP或购物APP等)中的用户数据。如图17中的(b)图所示,当允许跨域推荐控件处于开启状态,第一界面上呈现多个开关控件,该多个开关控件分别与音乐APP、阅读APP、购物APP、视频APP、新闻APP和聊天APP等应用程序对应。以音乐APP对应的控件为例,当“音乐APP对应的控件处于开启状态,即该音乐APP下方处于“已允许”状态,浏览器APP可以从音乐APP中获取用户行为数据,并进行学习,以为用户进行推荐。
若用户执行对音乐APP对应的控件的关闭操作,响应于该关闭操作,手机呈现如图17中的(c)所示的内容,浏览器APP不再从音乐APP中获取用户行为数据,即不允许浏览器APP获取音乐APP中的用户行为数据。若用户执行对允许跨域推荐控件的关闭操作,响应于该关闭操作,浏览器APP将关闭跨域推荐功能,即不允许浏览器APP获取其他APP中的用户交互数据。例如,用户执行如图17中的(b)图所示的对允许跨域推荐控件的点击操作,响应于该点击操作,手机执行关闭浏览器APP的跨域推荐功能。跨域推荐管理界面可以显示如图17中(d)图所示的内容,浏览器APP在该学习列表中的所有应用中的跨域推荐功能被关闭。这样,可以提高管理效率,提升用户体验。
应用程序为用户推荐的内容即为推荐对象,推荐对象可以在应用程序中显示。当用户进入应用程序,可以触发一条推荐请求,由推荐模型针对该推荐请求为用户推荐相关内容。
示例性地,浏览器APP为用户推荐的信息流可以在浏览器APP的主界面中显示。
示例性地,当用户执行对浏览器APP的点击操作,响应于该点击操作,手机显示如图18的(a)中所示的浏览器APP的主界面303,该浏览器APP的主界面303中可以显示一个或多个推荐内容的推荐列表,该一个或多个推荐内容即为浏览器APP中的推荐对象。应理解,浏览器APP的主界面303中还可以包括其他更多或更少的显示内容,本申请对此不作限定。
用户可以对浏览器APP的主界面303的推荐列表所呈现的内容执行一定操作以查看推荐内容、删除(或忽略)推荐内容或查看推荐内容的相关信息等。例如用户点击某个推荐内容,响应于该点击操作,手机可以打开该推荐内容。再如用户向左快滑(或向右快滑)某个推荐内容,响应于该操作,手机可以将该推荐内容从推荐列表中删除。又如,用户长按某个推荐内容,响应于该长按操作,手机可以显示该推荐内容的相关信息。如图18中的(a)图所示,用户执行如图18中的(a)所示的长按操作,响应于该长按操作,手机可以显示如图所示的提示框。选择框中显示了提示信息,该提示信息用于提示用户该推荐内容是基于其他应用程序中的用户交互数据推荐的。如图18中的(a)所示,该提示信息用于提示用户该推荐内容是基于用户在视频APP中的数据为用户推荐的内容。
应理解,在一些其他实施例中,用户可以通过其他方式打开视频或删除推荐内容,也可以通过其他方式例如左右慢滑方式调出该推荐内容的相关信息,本申请实施例不作限定。
示例性地,当用户对浏览器APP的点击操作,响应于该点击操作,手机还可以显示如图18的(b)中所示的浏览器APP的主界面304,该主界面304中可以显示一个或多个推荐内容的推荐列表以及该一个或多个推荐内容对应的提示信息,该一个或多个推荐内容即为浏览器APP中的推荐对象。应理解,主界面304中还可以包括其他更多或更少的显示内容,本申请对此不作限定。该提示信息用于提示用户该推荐内容是基于其他应用程序中的用户交互数据推荐的。
用户可以对主界面304的推荐列表所呈现的视频执行一定操作以查看推荐内容、删除(或忽略)推荐内容等。例如用户点击某个推荐内容,响应于该点击操作,手机可以打开该推荐内容。再如用户向左快滑(或向右快滑)某个推荐内容,响应于该操作,手机可以将该推荐内容从推荐列表中删除。应理解,在一些其他实施例中,用户可以通过其他方式打开推荐内容或删除推荐内容,也可以通过其他方式例如左右慢滑方式删除该推荐内容的相关信息,本申请实施例不作限定。
应理解,提示信息主要是为用户提供参考信息,以便用户知晓当前推荐对象是基于跨域推荐的功能得到的,其提示信息的内容还可以有其他形式,本申请实施例不作限定。
需要说明的是,本申请实施例中,用户在主界面中删除推荐内容,可以理解为用户只是在主界面的推荐列表中删除了某个推荐内容,也就是说用户对该推荐内容不感兴趣。该行为可以被记录在用户行为日志中用作推荐模型的训练数据。例如,作为前述方法中的有偏样本。
当手机上存在大量的应用时,对于一些需要跨域推荐的应用,可以打开应用程序的跨域推荐功能。示例性地,可以通过以下两种方式打开或关闭应用程序的跨域推荐功能。
一种是单点关闭或打开某个应用的跨域推荐功能。例如,如图13所示,在应用程序对应的跨域推荐管理界面中,开启或关闭允许跨域推荐控件,可以实现单点打开或关闭该应用程序的跨域推荐功能。
另一种是批量关闭或打开全部应用的跨域推荐功能。例如,如图19的(a)图显示的是与图17的(a)图相同的界面。用户执行图19的(a)图中所示的批量管理控件的点击操作,响应于该点击操作,用户进入批量管理界面305中,可以包括搜索应用控件、跨域推荐总开关控件、各个应用程序的跨域推荐开关控件或侧边栏字母排序索引控件等。用户可以通过控制跨域推荐总开关控件(即图中“全部”后的开关)的打开和关闭,实现整体打开全部应用程序的跨域学习功能或整体关闭全部应用程序的跨域推荐功能。在批量管理界面305还包括各个应用的跨域推荐开关控件,用户也可以通过控制某个应用程序的跨域推荐开关控件的打开和关闭,实现单个应用程序的跨域推荐功能的打开或关闭。在一些实施例中,批量管理界面305中显示的各个应用程序的跨域推荐开关控件可以按照应用名称首字母从“A”到“Z”的顺序显示,每个应用的跨域推荐功能都由各自的跨域推荐开关控件控制。
应理解,本申请实施例中,“关闭跨域推荐”、“关闭应用的跨域推荐”、“关闭跨 域推荐功能”、“关闭应用的跨域推荐功能”可以认为是表达相同的含义,均可以理解为关闭了应用的跨域推荐功能,该应用程序不再进行跨域推荐。同理,“开启跨域推荐”、“开启应用的跨域推荐”、“打开跨域推荐功能”、“打开应用的跨域推荐功能”可以认为是表达相同的含义,均可以理解为打开了应用程序的跨域推荐功能,应用程序可以进行跨域推荐。
结合上述实施例及相关附图,本申请实施例提供了一种推荐方法,该方法可以在电子设备(例如手机、平板电脑等)中实现。图20是本申请实施例提供的推荐方法的示意性流程图,如图20所示,该方法1200可以包括以下步骤:
S1210,显示第一界面。
该第一界面可以包括至少一个应用程序的学习列表,该至少一个应用程序的学习列表中的第一应用程序的学习列表包括至少一个选项,该至少一个选项中的每个选项关联一个应用程序。
示例性地,如图17中的(b)图所示,第一界面可以为浏览器APP的跨域推荐管理界面302。该跨域推荐管理界面302用于控制浏览器APP的跨域推荐功能的开启和关闭。
示例性地,如图17中的(b)图所示,第一应用程序的学习列表可以为浏览器APP的学习列表。
示例性地,如图17中的(b)图所示,该至少一个选项可以与应用名称相同,例如“购物”选项、“地图”选项、“健康”选项、“视频”选项等。该至少一个选项中的每个选项关联一个应用程序,与应用相关联的选项用于控制在该应用程序中学习用户的行为的功能的开启和关闭。换言之,与应用相关联的选项用于控制是否允许第一应用程序获取该应用程序的数据以进行跨域推荐。
S1220,感知到用户在第一界面上的第一操作。
第一操作可以为点击操作、双击操作、长按操作或滑动操作等。
S1230,响应于第一操作,打开或关闭第一应用程序在第一应用程序的学习列表中部分或全部选项所关联的应用程序中的跨域推荐功能。
也就是说,允许第一应用程序在部分或全部选项所关联的应用程序中获取用户行为数据,学习在该应用程序中的用户的偏好,以在第一应用程序中为用户进行推荐。
第一操作后,用户可以从界面上看出第一应用程序的跨域推荐功能处于打开状态或关闭状态。
在一个实施例中,第一操作作用于第一选项,响应于用户对第一选项的第一操作,打开或关闭第一应用程序在第一选项所关联的应用程序中的跨域推荐功能;其中,第一选项位于第一应用程序的学习列表中。
示例性地,如图17中的(c)图所示,该第一选项可以为第一界面上的“音乐”选项。应理解,该第一选项可以为第一界面上第一应用程序的学习列表中的任意一个与应用相关联的选项,例如“音乐”选项、“购物”选项、“浏览器”选项等等。
示例性的,如图17中的(c)图所示,该第一操作可以是对第一选项所对应的开关控件的打开或关闭操作。例如当第一选项所对应的开关控件处于打开状态时,第一操作可以 用于将第一选项所对应的开关控件关闭,相应地,关闭了第一应用程序在第一选项所关联的应用程序中进行跨域推荐的功能。例如,当第一选项所对应的开关控件处于关闭状态时,第一操作可以用于将第一选项所对应的开关控件打开,相应地,打开了第一应用程序在第一选项所关联的应用程序中进行跨域推荐的功能。这样,用户可以单独控制第一应用程序在其他每个应用程序中的跨域推荐功能的开和关。
一个实施例中,第一操作作用于第一应用程序的学习列表对应的开关控件,响应于用户对开关控件的第一操作,打开或关闭第一应用程序在第一应用程序的学习列表中全部选项所关联的应用程序中的跨域推荐功能。
示例性地,如图17中的(b)图所示,该第一操作可以是对允许跨域推荐控件的关闭操作。可选地,若允许跨域推荐控件在第一操作之前处于关闭状态,则第一操作可以是对允许跨域推荐控件的打开操作。这样,用户可以整体控制第一应用程序的跨域推荐功能,提高管理效率,提升用户体验。
一个实施例中,方法1200还包括:显示第二界面,所述第二界面用于呈现一个或多个推荐对象以及所述一个或多个推荐对象的提示信息,所述一个或多个推荐对象的提示信息用于指示所述一个或多个推荐对象是基于所述至少一个应用程序中的应用程序中的用户行为数据确定的。
示例性地,如图18中的(a)图所示,第二界面可以为浏览器APP的主界面303。
示例性地,如图18中的(b)图所示,第二界面可以为浏览器APP的主界面304。
示例性地,如图18所示,该提示信息可以用于提示用户当前推荐内容是基于视频APP的数据得到的。
一个实施例中,一个或多个推荐对象是通过将用户的信息和候选推荐对象的信息输入推荐模型中,预测用户对候选推荐对象有操作动作的概率确定的。
例如,将该视频APP中的用户行为数据作为源域的数据,将浏览器APP中的用户行为作为目标域的数据,执行前述方法1100可得到推荐模型,利用该推荐模型可以预测用户对候选推荐对象有操作动作的概率,基于该概率值确定推荐内容,进而显示如图14所示的内容。
在一个实施例中,推荐模型是使用样本集中的有偏数据集和无偏数据集按照第一蒸馏方式对第一神经网络进行训练得到,有偏数据集中包括有偏置的样本,无偏数据集中包括无偏置的样本,第一蒸馏方式是根据样本集的数据特征确定的,有偏数据集中的样本包括第一用户的信息和第一推荐对象的信息以及实际标签,无偏数据集中的样本的实际标签用于表示第一用户是否对第一推荐对象有操作动作,无偏数据集中的样本包括第二用户的信息和第二推荐对象的信息以及实际标签,无偏数据集中的样本的实际标签用于表示第二用户是否对第二推荐对象有操作动作。
例如,当用户允许第一应用程序开启跨域推荐功能时,第一应用程序可以从第一选项关联的应用程序中获取用户行为数据,将第一选择关联的应用程序中的用户行为数据作为源域的数据。应理解,源域的数据还可以包括其他应用中的用户行为数据。例如,当用户允许第一应用程序在第一应用程序的学习列表中的所有选项关联的应用程序中进行跨域学 习时,第一应用程序可以从所有选项关联的应用程序中获取用户行为数据,并将获取到的用户行为数据均作为源域的数据。
示例性地,推荐模型可以采用前述图6训练得到的更新后的第一神经网络。具体描述可以参见前述图6所示的方法的步骤,此处不再赘述。
一个实施例中,在显示第一界面之前,还包括:显示第三界面,该第三界面包括至少一个应用对应的开关控件;在第三界面上检测用户对该至少一个应用程序对应的开关控件中的第一应用程序的开关控件的第三操作;响应于该第三操作,显示第一界面。
示例性地,如图17中的(a)图所示,第三界面可以为设置应用程序主界面301。
示例性的,如图17中的(a)图所示,第一应用程序的开关控件可以为浏览器APP的跨域推荐管理控件。
示例性的,如图17中的(a)图所示,该第三操作可以是对第一应用程序对应的开关控件的点击操作,响应于该点击操作,显示如图17中的(b)所示的界面。
根据本申请实施例中的方案,通过在不同域间进行知识(例如,用户的兴趣偏好)进行迁移和共享,将源域和目标域的用户交互历史记录都纳入到学习中,以使推荐模型能够更好地学习用户的偏好,使推荐模型在目标域也能很好的拟合用户的兴趣偏好,给用户推荐符合其兴趣的推荐结果,实现跨域推荐,缓解冷启动问题。
前述对本申请提供的神经网络蒸馏方法以及推荐方法的流程进行了详细介绍,下面结合前述的方法的流程,对本申请提供的装置进行说明。
参阅图21,本申请提供的一种神经网络蒸馏装置的结构示意图。
该神经网络蒸馏装置可以包括:
采集模块2101,用于样本集,样本集包括获取有偏数据集和无偏数据集,有偏数据集中包括有偏置的样本,无偏数据集中包括无偏置的样本,通常,有偏数据集的样本量大于无偏数据集的样本量;
决策模块2102,用于根据样本集的数据特征确定第一蒸馏方式,其中,不同的蒸馏方式在进行知识蒸馏时老师模型对学生模型的指导方式不相同,老师模型是使用无偏数据集训练得到的,学生模型是使用有偏数据集训练得到;
训练模块2103,用于基于有偏数据集和无偏数据集,按照第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络。
在一种可能的实施方式中,样本集中的样本包括输入特征和实际标签,第一蒸馏方式为基于有偏数据集和无偏数据集中的样本的输入特征进行蒸馏。
在一种可能的实施方式中,训练模块2103,具体用于交替使用有偏数据集和无偏数据集对第一神经网络进行训练,得到更新后的第一神经网络,其中,在一个交替过程中,使用有偏数据集对第一神经网络进行训练的批训练次数,和使用无偏数据集对第一神经网络进行训练的批训练次数为预设比例,且样本包括输入特征作为第一神经网络的输入。
在一种可能的实施方式中,当预设比例为1时,在第一神经网络的损失函数中增加第一正则项和第二正则项的差值,第一正则项是使用无偏数据集包括的样本对第一神经网络进行训练得到的参数,第二正则项是使用有偏数据集包括的样本对第一神经网络进行训练 得到的参数。
在一种可能的实施方式中,训练模块2103,具体用于为有偏数据集中的样本设置置信度,置信度用于表示样本的偏置程度;基于有偏数据集、有偏数据集中的样本的置信度和无偏数据集,对第一神经网络进行训练,得到更新后的第一神经网络,且在对第一神经网络进行训练时样本包括输入特征作为第一神经网络的输入。
在一种可能的实施方式中,有偏数据集和无偏数据集所包括的样本包括输入特征和实际标签,第一蒸馏方式为基于无偏数据集所包括的样本的预测标签进行蒸馏,预测标签由更新后的第二神经网络针对无偏数据集中的样本输出,更新后的第二神经网络为使用无偏数据集对第二神经网络进行训练得到。
在一种可能的实施方式中,样本集中还包括未观测数据集,未观测数据集中包括多个未观测样本;训练模块2103,具体用于:通过有偏数据集对第一神经网络进行训练,得到训练后的第一神经网络,以及通过无偏数据集对第二神经网络进行训练,得到更新后的第二神经网络;从样本集中采集多个样本,得到辅助数据集;使用辅助数据集,以数据集中样本的预测标签作为约束,更新训练后的第一神经网络,得到更新后的第一神经网络,数据集中样本的预测标签由更新后的第二神经网络输出。
在一种可能的实施方式中,训练模块2103,具体用于:通过无偏数据集对第二神经网络进行训练,得到更新后的第二神经网络;通过更新后的第二神经网络输出有偏数据集中样本的预测标签;将样本的预测标签和样本的实际标签进行加权合并,得到样本的合并标签;使用样本的合并标签训练第一神经网络,得到更新后的第一神经网络。
在一种可能的实施方式中,决策模块2102,具体用于计算无偏数据集的样本量和有偏数据集的样本量之间的第一比例,从多种蒸馏方式中选择与第一比例匹配的第一蒸馏方式,样本集的数据特征包括第一比例。
在一种可能的实施方式中,第一蒸馏方式包括:基于从无偏数据集中提取到的特征训练老师模型,并通过老师模型以及有偏数据集对学生模型进行知识蒸馏。
在一种可能的实施方式中,训练模块2103,具体用于:通过预设算法输出无偏数据集的特征;根据无偏数据集的特征对第二神经网络进行训练,得到更新后的第二神经网络;将第二神经网络作为老师模型,第一神经网络作为学生模型,使用有偏数据集对第一神经网络进行知识蒸馏,得到更新后的第一神经网络。
在一种可能的实施方式中,训练模块2103,具体用于:获取无偏数据集以及有偏数据集中所包括的特征维度数量;从多种蒸馏方式中选择与特征维度数量匹配的第一蒸馏方式,样本集的数据特征包括特征维度数量。
在一种可能的实施方式中,训练模块2103,具体用于:通过无偏数据集更新第二神经网络,得到更新后的第二神经网络;将更新后的第二神经网络作为老师模型,第一神经网络作为学生模型,使用有偏数据集对第一神经网络进行知识蒸馏,得到更新后的第一神经网络。
在一种可能的实施方式中,根据有偏数据集所包括的数据或无偏数据集所包括的数据中的至少一种,决策模块2102,具体用于:计算无偏数据集中包括的正样本的数量和负样 本的数量的第二比例,从多种蒸馏方式中选择与第二比例匹配的第一蒸馏方式,样本集的数据特征包括第二比例;或者,计算有偏数据集中包括的正样本的数量和负样本的数量的第三比例,从多种蒸馏方式中选择与第三比例匹配的第一蒸馏方式,样本集的数据特征包括第三比例。
在一种可能的实施方式中,有偏数据集包括的样本的类型,和无偏数据集包括的样本的类型不相同。
在一种可能的实施方式中,在得到更新后的第一神经网络之后,装置还包括:
输出模块2104,用于获取目标用户的至少一个样本;将至少一个样本作为更新后的第一神经网络的输入,输出目标用户的至少一个标签,至少一个标签组成目标用户的用户画像,该用户画像用于确定与目标用户匹配的样本。
请参阅图22,本申请提供的一种推荐装置的结构示意图,如下所述。
获取单元2201,用于获取目标用户的信息和候选推荐对象的信息;
处理单元2202,用于将目标用户的信息和候选推荐对象的信息输入至推荐模型,预测目标用户对候选推荐对象有操作动作的概率;
其中,推荐模型是使用样本集中的有偏数据集和无偏数据集按照第一蒸馏方式对第一神经网络进行训练得到,有偏数据集中包括有偏置的样本,无偏数据集中包括无偏置的样本,第一蒸馏方式是根据样本集的数据特征确定的,有偏数据集中的样本包括第一用户的信息和第一推荐对象的信息以及实际标签,无偏数据集中的样本的实际标签用于表示第一用户是否对第一推荐对象有操作动作,无偏数据集中的样本包括第二用户的信息和第二推荐对象的信息以及实际标签,无偏数据集中的样本的实际标签用于表示第二用户是否对第二推荐对象有操作动作。
在一种可能的实施方式中,无偏数据集是在候选推荐对象集合中的候选推荐对象被展示的概率相同的情况下获得的,第二推荐对象为候选推荐对象集合中的一个候选推荐对象。
在一种可能的实施方式中,无偏数据集是在候选推荐对象集合中的候选推荐对象被展示的概率相同的情况下获得的,可以包括:无偏数据集中的样本是在候选推荐对象集合中的候选推荐对象被随机展示给第二用户的情况下获得的;或者无偏数据集中的样本是在第二用户搜索第二推荐对象的情况下获得的。
在一种可能的实施方式中,无偏数据集中的样本属于源域的数据,有偏数据集中的样本属于目标域的数据。
请参阅图23,本申请提供的一种电子设备的结构示意图,如下所述。
显示单2301元,显示单元用于显示第一界面,第一界面包括至少一个应用程序的学习列表,该至少一个应用程序的学习列表中的第一应用程序的学习列表包括至少一个选项,至少一个选项中的选项关联一个应用程序;
处理单元2302,处理单元用于感知到用户在第一界面上的第一操作;
显示单元还用于响应于第一操作,打开或关闭第一应用程序在第一应用程序的学习列表中部分或全部选项所关联的应用程序中的跨域推荐功能。
在一种可能的实施方式中,一个或多个推荐对象是通过将用户的信息和候选推荐对象 的信息输入推荐模型中,预测用户对候选推荐对象有操作动作的概率确定的。
在一种可能的实施方式中,推荐模型是使用样本集中的有偏数据集和无偏数据集按照第一蒸馏方式对第一神经网络进行训练得到,有偏数据集中包括有偏置的样本,无偏数据集中包括无偏置的样本,第一蒸馏方式是根据样本集的数据特征确定的,有偏数据集中的样本包括第一用户的信息和第一推荐对象的信息以及实际标签,无偏数据集中的样本的实际标签用于表示第一用户是否对第一推荐对象有操作动作,无偏数据集中的样本包括第二用户的信息和第二推荐对象的信息以及实际标签,无偏数据集中的样本的实际标签用于表示第二用户是否对第二推荐对象有操作动作。
请参阅图24,本申请提供的另一种神经网络蒸馏装置的结构示意图,如下所述。
该神经网络蒸馏装置可以包括处理器2401和存储器2402。该处理器2401和存储器2402通过线路互联。其中,存储器2402中存储有程序指令和数据。
存储器2402中存储了前述图6中的步骤对应的程序指令以及数据。
处理器2401用于执行前述图6中任一实施例所示的神经网络蒸馏装置执行的方法步骤。
可选地,该神经网络蒸馏装置还可以包括收发器2403,用于接收或者发送数据。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于程序,当其在计算机上运行时,使得计算机执行如前述图6所示实施例描述的方法中的步骤。
可选地,前述的图24中所示的神经网络蒸馏装置为芯片。
请参阅图25,本申请提供的另一种推荐装置的结构示意图,如下所述。
该推荐装置可以包括处理器2501和存储器2502。该处理器2501和存储器2502通过线路互联。其中,存储器2502中存储有程序指令和数据。
存储器2502中存储了前述图11中的步骤对应的程序指令以及数据。
处理器2501用于执行前述图11中任一实施例所示的推荐装置执行的方法步骤。
可选地,该推荐装置还可以包括收发器2503,用于接收或者发送数据。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有程序,当其在计算机上运行时,使得计算机执行如前述图11所示实施例描述的方法中的步骤。
可选地,前述的图25中所示的推荐装置为芯片。
请参阅图26,本申请提供的另一种电子设备的结构示意图,如下所述。
该电子设备可以包括处理器2601和存储器2602。该处理器2601和存储器2602通过线路互联。其中,存储器2602中存储有程序指令和数据。
存储器2602中存储了前述图20中的步骤对应的程序指令以及数据。
处理器2601用于执行前述图20所示的电子设备执行的方法步骤。
可选地,该电子设备还可以包括收发器2603,用于接收或者发送数据。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有程序,当其在计算机上运行时,使得计算机执行如前述图20所示实施例描述的方法中的步骤。
可选地,前述的图26中所示的电子设备为芯片。
本申请实施例还提供了一种神经网络蒸馏装置,该神经网络蒸馏装置也可以称为数字 处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行前述图6-图20的方法步骤。
本申请实施例还提供一种数字处理芯片。该数字处理芯片中集成了用于实现上述处理器2401、处理器2501、处理器2601,或者处理器2301、处理器2501、处理器2601的功能的电路和一个或者多个接口。当该数字处理芯片中集成了存储器时,该数字处理芯片可以完成前述实施例中的任一个或多个实施例的方法步骤。当该数字处理芯片中未集成存储器时,可以通过通信接口与外置的存储器连接。该数字处理芯片根据外置的存储器中存储的程序代码来实现上述实施例中神经网络蒸馏装置、推荐装置或者电子设备执行的动作。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上行驶时,使得计算机执行如前述图6-图20所示实施例描述的方法的步骤。
本申请实施例提供的神经网络蒸馏装置可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使服务器内的芯片执行上述图6-图10所示实施例描述的训练集处理方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体地,前述的处理单元或者处理器可以是中央处理器(central processing unit,CPU)、网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者也可以是任何常规的处理器等。
示例性地,请参阅图27,图27为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 270,NPU 270作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路2703,通过控制器2704控制运算电路2703提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路2703内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路2703是二维脉动阵列。运算电路2703还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路2703是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器2702中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器2701中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)2708中。
统一存储器2706用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控 制器(direct memory access controller,DMAC)2705,DMAC被搬运到权重存储器2702中。输入数据也通过DMAC被搬运到统一存储器2706中。
总线接口单元(bus interface unit,BIU)2710,用于AXI总线与DMAC和取指存储器(instruction fetch buffer,IFB)2709的交互。
总线接口单元2710(bus interface unit,BIU),用于取指存储器2709从外部存储器获取指令,还用于存储单元访问控制器2705从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器2706或将权重数据搬运到权重存储器2702中或将输入数据数据搬运到输入存储器2701中。
向量计算单元2707包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如批归一化(batch normalization),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元2707能将经处理的输出的向量存储到统一存储器2706。例如,向量计算单元2707可以将线性函数和/或非线性函数应用到运算电路2703的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元2707生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路2703的激活输入,例如用于在神经网络中的后续层中的使用。
控制器2704连接的取指存储器(instruction fetch buffer)2709,用于存储控制器2704使用的指令;
统一存储器2706,输入存储器2701,权重存储器2702以及取指存储器2709均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,循环神经网络中各层的运算可以由运算电路2703或向量计算单元2707执行。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述图6-图20的方法的程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是 更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
最后应说明的是:以上,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (40)

  1. 一种神经网络蒸馏方法,其特征在于,包括:
    获取样本集,所述样本集包括有偏数据集和无偏数据集,所述有偏数据集中包括有偏置的样本,所述无偏数据集中包括无偏置的样本;
    根据所述样本集的数据特征确定第一蒸馏方式,其中,在所述第一蒸馏方式中,使用所述无偏数据集训练老师模型,使用所述有偏数据集训练学生模型;
    基于所述有偏数据集和所述无偏数据集,按照所述第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络。
  2. 根据权利要求1所述的方法,其特征在于,所述样本集的样本包括输入特征和实际标签,所述第一蒸馏方式为使用所述样本集中的样本的输入特征进行蒸馏。
  3. 根据权利要求2所述的方法,其特征在于,所述基于所述有偏数据集和所述无偏数据集,按照所述第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络,包括:
    交替使用所述有偏数据集和所述无偏数据集对所述第一神经网络进行训练,得到所述更新后的第一神经网络,其中,在一个交替过程中,使用所述有偏数据集对所述第一神经网络进行训练的批训练次数,和使用所述无偏数据集对所述第一神经网络进行训练的批训练次数为预设比例,且所述样本集中的样本的输入特征作为所述第一神经网络的输入。
  4. 根据权利要求2所述的方法,其特征在于,所述基于所述有偏数据集和所述无偏数据集,按照所述第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络,包括:
    为所述有偏数据集中的样本设置置信度,所述置信度用于表示所述样本的偏置程度;
    基于所述有偏数据集、所述有偏数据集中的样本的置信度和所述无偏数据集,对第一神经网络进行训练,得到所述更新后的第一神经网络,且在对所述第一神经网络进行训练时所述样本包括输入特征作为所述第一神经网络的输入。
  5. 根据权利要求1所述的方法,其特征在于,所述第一蒸馏方式为基于所述无偏数据集所包括的样本的预测标签进行蒸馏,所述预测标签由更新后的第二神经网络针对所述无偏数据集中的样本输出,所述更新后的第二神经网络为使用所述无偏数据集对第二神经网络进行训练得到。
  6. 根据权利要求5所述的方法,其特征在于,所述样本集中还包括未观测数据集,所述未观测数据集中包括多个未观测样本;
    所述基于所述有偏数据集和所述无偏数据集,按照所述第一蒸馏方式对第一神经网络 进行训练,得到更新后的第一神经网络,包括:
    通过所述有偏数据集对第一神经网络进行训练,得到训练后的第一神经网络,以及通过所述无偏数据集对第二神经网络进行训练,得到所述更新后的第二神经网络;
    从所述样本集中采集多个样本,得到辅助数据集;
    使用所述辅助数据集,以所述辅助数据集中样本的预测标签作为约束,更新所述训练后的第一神经网络,得到所述更新后的第一神经网络,所述数据集中样本的预测标签包括所述更新后的第二神经网络输出的标签。
  7. 根据权利要求5所述的方法,其特征在于,所述基于所述有偏数据集和所述无偏数据集,按照所述第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络,包括:
    通过所述无偏数据集对第二神经网络进行训练,得到所述更新后的第二神经网络;
    通过所述更新后的第二神经网络输出所述有偏数据集中样本的预测标签;
    将样本的预测标签和样本的实际标签进行加权合并,得到所述样本的合并标签;
    使用所述样本的合并标签训练所述第一神经网络,得到所述更新后的第一神经网络。
  8. 根据权利要求2-7中任一项所述的方法,其特征在于,所述样本集的数据特征包括所述第一比例,所述第一比例为所述无偏数据集的样本量和所述有偏数据集的样本量之间的比例,所述根据所述样本集的数据特征确定第一蒸馏方式,包括:
    从多种蒸馏方式中选择与所述第一比例匹配的所述第一蒸馏方式。
  9. 根据权利要求1所述的方法,其特征在于,所述第一蒸馏方式包括:基于从所述无偏数据集中提取到的特征训练所述老师模型,得到训练后的所述老师模型,并通过训练后的所述老师模型以及所述有偏数据集对所述学生模型进行知识蒸馏。
  10. 根据权利要求9所述的方法,其特征在于,所述基于所述有偏数据集和所述无偏数据集,按照所述第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络,包括:
    通过深度全局平衡回归DGBR算法从所述无偏数据集中筛选出部分样本的输入特征;
    根据所述部分样本的输入特征对第二神经网络进行训练,得到更新后的第二神经网络;
    将更新后的所述第二神经网络作为所述老师模型,所述第一神经网络作为所述学生模型,使用所述有偏数据集对所述第一神经网络进行知识蒸馏,得到所述更新后的第一神经网络。
  11. 根据权利要求9或10所述的方法,其特征在于,所述样本集的数据特征包括所述样本集的特征维度数量,所述根据所述样本集的数据特征确定第一蒸馏方式,包括:
    从多种蒸馏方式中选择与所述特征维度数量匹配的所述第一蒸馏方式。
  12. 根据权利要求1-11中任一项所述的方法,其特征在于,所述第一蒸馏方式是从预设的多种蒸馏方式中选择得到,所述多种蒸馏方式中包括所述老师模型对所述学生模型的指导方式不相同的至少两种蒸馏方式。
  13. 一种推荐方法,其特征在于,包括:
    获取目标用户的信息和候选推荐对象的信息;
    将所述目标用户的信息和所述候选推荐对象的信息输入至推荐模型,预测所述目标用户对所述候选推荐对象有操作动作的概率;
    其中,所述推荐模型是使用样本集中的有偏数据集和无偏数据集按照第一蒸馏方式对第一神经网络进行训练得到,所述有偏数据集中包括有偏置的样本,所述无偏数据集中包括无偏置的样本,所述第一蒸馏方式是根据所述样本集的数据特征确定的,所述有偏数据集中的样本包括第一用户的信息和第一推荐对象的信息以及实际标签,所述无偏数据集中的样本的实际标签用于表示所述第一用户是否对所述第一推荐对象有操作动作,所述无偏数据集中的样本包括第二用户的信息和第二推荐对象的信息以及实际标签,所述无偏数据集中的样本的实际标签用于表示所述第二用户是否对所述第二推荐对象有操作动作。
  14. 根据权利要求13所述的方法,其特征在于,所述无偏数据集是在候选推荐对象集合中的所述候选推荐对象被展示的概率相同的情况下获得的,所述第二推荐对象为所述候选推荐对象集合中的一个候选推荐对象。
  15. 根据权利要求14所述的方法,其特征在于,所述无偏数据集是在候选推荐对象集合中的所述候选推荐对象被展示的概率相同的情况下获得的,包括:
    所述无偏数据集中的样本是在所述候选推荐对象集合中的候选推荐对象被随机展示给所述第二用户的情况下获得的;
    或者所述无偏数据集中的样本是在所述第二用户搜索所述第二推荐对象的情况下获得的。
  16. 一种推荐方法,其特征在于,包括:显示第一界面,所述第一界面包括至少一个应用程序的学习列表,所述该至少一个应用程序的学习列表中的第一应用程序的学习列表包括至少一个选项,所述至少一个选项中的选项关联一个应用程序;
    感知到用户在所述第一界面上的第一操作;
    响应于所述第一操作,打开或关闭所述第一应用程序在所述第一应用程序的学习列表中部分或全部选项所关联的应用程序中的跨域推荐功能。
  17. 根据权利要求16所述的方法,其特征在于,所述一个或多个推荐对象是通过将所述用户的信息和候选推荐对象的信息输入推荐模型中,预测所述用户对所述候选推荐对象有操作动作的概率确定的。
  18. 根据权利要求17所述的方法,其特征在于,所述推荐模型是使用样本集中的有偏数据集和无偏数据集按照第一蒸馏方式对第一神经网络进行训练得到,所述有偏数据集中包括有偏置的样本,所述无偏数据集中包括无偏置的样本,所述第一蒸馏方式是根据所述 样本集的数据特征确定的,所述有偏数据集中的样本包括第一用户的信息和第一推荐对象的信息以及实际标签,所述无偏数据集中的样本的实际标签用于表示所述第一用户是否对所述第一推荐对象有操作动作,所述无偏数据集中的样本包括第二用户的信息和第二推荐对象的信息以及实际标签,所述无偏数据集中的样本的实际标签用于表示所述第二用户是否对所述第二推荐对象有操作动作。
  19. 一种神经网络蒸馏装置,其特征在于,包括:
    采集模块,用于获取样本集,所述样本集包括有偏数据集和无偏数据集,所述有偏数据集中包括有偏置的样本,所述无偏数据集中包括无偏置的样本;
    决策模块,用于根据所述样本集的数据特征确定第一蒸馏方式,其中,在所述第一蒸馏方式中,老师模型是使用所述无偏数据集训练得到的,学生模型是使用所述有偏数据集训练得到;
    训练模块,用于基于所述有偏数据集和所述无偏数据集,按照所述第一蒸馏方式对第一神经网络进行训练,得到更新后的第一神经网络。
  20. 根据权利要求19所述的装置,其特征在于,所述样本集中的样本包括输入特征和实际标签,所述第一蒸馏方式为使用所述样本集中的样本的输入特征进行蒸馏。
  21. 根据权利要求20所述的装置,其特征在于,
    所述训练模块,具体用于交替使用所述有偏数据集和所述无偏数据集对所述第一神经网络进行训练,得到所述更新后的第一神经网络,其中,在一个交替过程中,使用所述有偏数据集对所述第一神经网络进行训练的批训练次数,和使用所述无偏数据集对所述第一神经网络进行训练的批训练次数为预设比例,且所述样本集中的样本的输入特征作为所述第一神经网络的输入。
  22. 根据权利要求20所述的装置,其特征在于,
    所述训练模块,具体用于为所述有偏数据集中的样本设置置信度,所述置信度用于表示所述样本的偏置程度;基于所述有偏数据集、所述有偏数据集中的样本的置信度和所述无偏数据集,对第一神经网络进行训练,得到所述更新后的第一神经网络,且在对所述第一神经网络进行训练时所述样本包括输入特征作为所述第一神经网络的输入。
  23. 根据权利要求19所述的装置,其特征在于,所述第一蒸馏方式为基于所述无偏数据集所包括的样本的预测标签进行蒸馏,所述预测标签由更新后的第二神经网络针对所述无偏数据集中的样本输出,所述更新后的第二神经网络为使用所述无偏数据集对第二神经网络进行训练得到。
  24. 根据权利要求23所述的装置,其特征在于,所述样本集中还包括未观测数据集, 所述未观测数据集中包括多个未观测样本;
    所述训练模块,具体用于:
    通过所述有偏数据集对第一神经网络进行训练,得到训练后的第一神经网络,以及通过所述无偏数据集对第二神经网络进行训练,得到所述更新后的第二神经网络;
    从所述样本集中采集多个样本,得到辅助数据集;
    使用所述辅助数据集,以所述辅助数据集中样本的预测标签作为约束,更新所述训练后的第一神经网络,得到所述更新后的第一神经网络,所述辅助数据集中样本的预测标签包括所述更新后的第二神经网络输出的标签。
  25. 根据权利要求23所述的装置,其特征在于,所述训练模块,具体用于:
    通过所述无偏数据集对第二神经网络进行训练,得到所述更新后的第二神经网络;
    通过所述更新后的第二神经网络输出所述有偏数据集中样本的预测标签;
    将样本的预测标签和样本的实际标签进行加权合并,得到所述样本的合并标签;
    使用所述样本的合并标签训练所述第一神经网络,得到所述更新后的第一神经网络。
  26. 根据权利要求20-25中任一项所述的装置,其特征在于,所述样本集的数据特征包括所述第一比例,所述第一比例为所述无偏数据集的样本量和所述有偏数据集的样本量之间的比例;
    所述决策模块,具体用于从多种蒸馏方式中选择与所述第一比例匹配的所述第一蒸馏方式。
  27. 根据权利要求19所述的装置,其特征在于,所述第一蒸馏方式包括:基于从所述无偏数据集中提取到的特征训练所述老师模型,得到训练后的所述老师模型,并通过训练后的所述老师模型以及所述有偏数据集对所述学生模型进行知识蒸馏。
  28. 根据权利要求27所述的装置,其特征在于,所述训练模块,具体用于:
    通过深度全局平衡回归DGBR算法从所述无偏数据集中筛选出部分样本的输入特征;
    根据所述部分样本的输入特征对第二神经网络进行训练,得到更新后的第二神经网络;
    将所述第二神经网络作为所述老师模型,所述第一神经网络作为所述学生模型,使用所述有偏数据集对所述第一神经网络进行知识蒸馏,得到所述更新后的第一神经网络。
  29. 根据权利要求27或28所述的装置,其特征在于,所述样本集的数据特征包括所述样本集的特征维度数量,所述训练模块,具体用于:
    从多种蒸馏方式中选择与所述特征维度数量匹配的所述第一蒸馏方式。
  30. 根据权利要求19-29中任一项所述的装置,其特征在于,
    所述第一蒸馏方式是从预设的多种蒸馏方式中选择得到,所述多种蒸馏方式包括所述老师模型对所述学生模型的指导方式不相同的至少两种蒸馏方式。
  31. 一种推荐装置,其特征在于,包括:
    获取单元,用于获取目标用户的信息和候选推荐对象的信息;
    处理单元,用于将所述目标用户的信息和所述候选推荐对象的信息输入至推荐模型,预测所述目标用户对所述候选推荐对象有操作动作的概率;
    其中,所述推荐模型是使用样本集中的有偏数据集和无偏数据集按照第一蒸馏方式对第一神经网络进行训练得到,所述有偏数据集中包括有偏置的样本,所述无偏数据集中包括无偏置的样本,所述第一蒸馏方式是根据所述样本集的数据特征确定的,所述有偏数据集中的样本包括第一用户的信息和第一推荐对象的信息以及实际标签,所述无偏数据集中的样本的实际标签用于表示所述第一用户是否对所述第一推荐对象有操作动作,所述无偏数据集中的样本包括第二用户的信息和第二推荐对象的信息以及实际标签,所述无偏数据集中的样本的实际标签用于表示所述第二用户是否对所述第二推荐对象有操作动作。
  32. 如权利要求31所述的装置,其特征在于,所述无偏数据集是在候选推荐对象集合中的所述候选推荐对象被展示的概率相同的情况下获得的,所述第二推荐对象为所述候选推荐对象集合中的一个候选推荐对象。
  33. 根据权利要求32所述的装置,其特征在于,所述无偏数据集是在候选推荐对象集合中的所述候选推荐对象被展示的概率相同的情况下获得的,包括:
    所述无偏数据集中的样本是在所述候选推荐对象集合中的候选推荐对象被随机展示给所述第二用户的情况下获得的;
    或者所述无偏数据集中的样本是在所述第二用户搜索所述第二推荐对象的情况下获得的。
  34. 一种电子设备,其特征在于,包括:
    显示单元,所述显示单元用于显示第一界面,所述第一界面包括至少一个应用程序的学习列表,所述该至少一个应用程序的学习列表中的第一应用程序的学习列表包括至少一个选项,所述至少一个选项中的选项关联一个应用程序;
    处理单元,所述处理单元用于感知到用户在所述第一界面上的第一操作;
    所述显示单元还用于响应于所述第一操作,打开或关闭所述第一应用程序在所述第一应用程序的学习列表中部分或全部选项所关联的应用程序中的跨域推荐功能。
  35. 根据权利要求34所述的电子设备,其特征在于,所述一个或多个推荐对象是通过将所述用户的信息和候选推荐对象的信息输入推荐模型中,预测所述用户对所述候选推荐对象有操作动作的概率确定的。
  36. 如权利要求35所述的电子设备,其特征在于,所述推荐模型是使用样本集中的有偏数据集和无偏数据集按照第一蒸馏方式对第一神经网络进行训练得到,所述有偏数据集中包括有偏置的样本,所述无偏数据集中包括无偏置的样本,所述第一蒸馏方式是根据所述样本集的数据特征确定的,所述有偏数据集中的样本包括第一用户的信息和第一推荐对象的信息以及实际标签,所述无偏数据集中的样本的实际标签用于表示所述第一用户是否对所述第一推荐对象有操作动作,所述无偏数据集中的样本包括第二用户的信息和第二推 荐对象的信息以及实际标签,所述无偏数据集中的样本的实际标签用于表示所述第二用户是否对所述第二推荐对象有操作动作。
  37. 一种神经网络蒸馏装置,其特征在于,包括处理器,所述处理器和存储器耦合,所述存储器存储有程序,当所述存储器存储的程序指令被所述处理器执行时实现权利要求1至12中任一项所述的方法。
  38. 一种推荐装置,其特征在于,包括至少一个处理器和存储器,所述至少一个处理器与所述存储器耦合,用于读取并执行所述存储器中的指令,以执行如权利要求13-15中任一项所述的推荐方法。
  39. 一种电子设备,其特征在于,包括:处理器;存储器;所述存储器存储一个或多个计算机程序,所述一个或多个计算机程序包括指令,当所述指令被所述一个或多个处理器执行时,使得所述电子设备执行如权利要求16-18中任一项所述的方法。
  40. 一种计算机可读存储介质,包括程序,当其被处理单元所执行时,执行如权利要求1至12、13-15或者16-18中任一项所述的方法。
PCT/CN2020/104653 2020-07-24 2020-07-24 一种神经网络蒸馏方法以及装置 WO2022016556A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP20945809.0A EP4180991A4 (en) 2020-07-24 2020-07-24 METHOD AND DEVICE FOR THE DISTILLATION OF A NEURAL NETWORK
CN202080104828.5A CN116249991A (zh) 2020-07-24 2020-07-24 一种神经网络蒸馏方法以及装置
PCT/CN2020/104653 WO2022016556A1 (zh) 2020-07-24 2020-07-24 一种神经网络蒸馏方法以及装置
US18/157,277 US20230162005A1 (en) 2020-07-24 2023-01-20 Neural network distillation method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/104653 WO2022016556A1 (zh) 2020-07-24 2020-07-24 一种神经网络蒸馏方法以及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/157,277 Continuation US20230162005A1 (en) 2020-07-24 2023-01-20 Neural network distillation method and apparatus

Publications (1)

Publication Number Publication Date
WO2022016556A1 true WO2022016556A1 (zh) 2022-01-27

Family

ID=79729768

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/104653 WO2022016556A1 (zh) 2020-07-24 2020-07-24 一种神经网络蒸馏方法以及装置

Country Status (4)

Country Link
US (1) US20230162005A1 (zh)
EP (1) EP4180991A4 (zh)
CN (1) CN116249991A (zh)
WO (1) WO2022016556A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114743043A (zh) * 2022-03-15 2022-07-12 北京迈格威科技有限公司 一种图像分类方法、电子设备、存储介质及程序产品
CN114822510A (zh) * 2022-06-28 2022-07-29 中科南京智能技术研究院 一种基于二值卷积神经网络的语音唤醒方法及系统
CN114970375A (zh) * 2022-07-29 2022-08-30 山东飞扬化工有限公司 一种基于实时采样数据的精馏过程监测方法
CN115759027A (zh) * 2022-11-25 2023-03-07 上海苍阙信息科技有限公司 文本数据处理系统及方法
WO2023184185A1 (zh) * 2022-03-29 2023-10-05 西门子股份公司 应用程序的编排方法及装置
CN117009830A (zh) * 2023-10-07 2023-11-07 之江实验室 一种基于嵌入特征正则化的知识蒸馏方法和系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116911956A (zh) * 2023-09-12 2023-10-20 深圳须弥云图空间科技有限公司 基于知识蒸馏的推荐模型训练方法、装置及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110351318A (zh) * 2018-04-04 2019-10-18 腾讯科技(深圳)有限公司 应用推荐的方法、终端和计算机存储介质
CN111105008A (zh) * 2018-10-29 2020-05-05 富士通株式会社 模型训练方法、数据识别方法和数据识别装置
CN111310053A (zh) * 2020-03-03 2020-06-19 上海喜马拉雅科技有限公司 信息的推荐方法、装置、设备和存储介质
US10713540B2 (en) * 2017-03-07 2020-07-14 Board Of Trustees Of Michigan State University Deep learning system for recognizing pills in images

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10713540B2 (en) * 2017-03-07 2020-07-14 Board Of Trustees Of Michigan State University Deep learning system for recognizing pills in images
CN110351318A (zh) * 2018-04-04 2019-10-18 腾讯科技(深圳)有限公司 应用推荐的方法、终端和计算机存储介质
CN111105008A (zh) * 2018-10-29 2020-05-05 富士通株式会社 模型训练方法、数据识别方法和数据识别装置
CN111310053A (zh) * 2020-03-03 2020-06-19 上海喜马拉雅科技有限公司 信息的推荐方法、装置、设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4180991A4 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114743043A (zh) * 2022-03-15 2022-07-12 北京迈格威科技有限公司 一种图像分类方法、电子设备、存储介质及程序产品
CN114743043B (zh) * 2022-03-15 2024-04-26 北京迈格威科技有限公司 一种图像分类方法、电子设备、存储介质及程序产品
WO2023184185A1 (zh) * 2022-03-29 2023-10-05 西门子股份公司 应用程序的编排方法及装置
CN114822510A (zh) * 2022-06-28 2022-07-29 中科南京智能技术研究院 一种基于二值卷积神经网络的语音唤醒方法及系统
CN114822510B (zh) * 2022-06-28 2022-10-04 中科南京智能技术研究院 一种基于二值卷积神经网络的语音唤醒方法及系统
CN114970375A (zh) * 2022-07-29 2022-08-30 山东飞扬化工有限公司 一种基于实时采样数据的精馏过程监测方法
CN115759027A (zh) * 2022-11-25 2023-03-07 上海苍阙信息科技有限公司 文本数据处理系统及方法
CN115759027B (zh) * 2022-11-25 2024-03-26 上海苍阙信息科技有限公司 文本数据处理系统及方法
CN117009830A (zh) * 2023-10-07 2023-11-07 之江实验室 一种基于嵌入特征正则化的知识蒸馏方法和系统
CN117009830B (zh) * 2023-10-07 2024-02-13 之江实验室 一种基于嵌入特征正则化的知识蒸馏方法和系统

Also Published As

Publication number Publication date
US20230162005A1 (en) 2023-05-25
CN116249991A (zh) 2023-06-09
EP4180991A4 (en) 2023-08-09
EP4180991A1 (en) 2023-05-17

Similar Documents

Publication Publication Date Title
WO2022016556A1 (zh) 一种神经网络蒸馏方法以及装置
WO2021047593A1 (zh) 推荐模型的训练方法、预测选择概率的方法及装置
WO2021233199A1 (zh) 搜索推荐模型的训练方法、搜索结果排序的方法及装置
WO2022016522A1 (zh) 推荐模型的训练方法、推荐方法、装置及计算机可读介质
WO2023221928A1 (zh) 一种推荐方法、训练方法以及装置
WO2023185925A1 (zh) 一种数据处理方法及相关装置
WO2024002167A1 (zh) 一种操作预测方法及相关装置
US11853901B2 (en) Learning method of AI model and electronic apparatus
WO2024041483A1 (zh) 一种推荐方法及相关装置
CN115879508A (zh) 一种数据处理方法及相关装置
CN116108267A (zh) 一种推荐方法及相关设备
WO2024067779A1 (zh) 一种数据处理方法及相关装置
CN113590976A (zh) 一种空间自适应图卷积网络的推荐方法
WO2024012360A1 (zh) 一种数据处理方法及相关装置
CN117217284A (zh) 一种数据处理方法及其装置
CN116910357A (zh) 一种数据处理方法及相关装置
CN116843022A (zh) 一种数据处理方法及相关装置
CN116204709A (zh) 一种数据处理方法及相关装置
WO2023050143A1 (zh) 一种推荐模型训练方法及装置
CN116308640A (zh) 一种推荐方法及相关装置
CN116467594A (zh) 一种推荐模型的训练方法及相关装置
CN114707070A (zh) 一种用户行为预测方法及其相关设备
CN115545738A (zh) 一种推荐方法及相关装置
WO2022262561A1 (zh) 多媒体资源的处理方法、装置、设备及存储介质
WO2023051678A1 (zh) 一种推荐方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20945809

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020945809

Country of ref document: EP

Effective date: 20230207

NENP Non-entry into the national phase

Ref country code: DE