CN111640425A - Model training and intention recognition method, device, equipment and storage medium - Google Patents

Model training and intention recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN111640425A
CN111640425A CN202010444204.XA CN202010444204A CN111640425A CN 111640425 A CN111640425 A CN 111640425A CN 202010444204 A CN202010444204 A CN 202010444204A CN 111640425 A CN111640425 A CN 111640425A
Authority
CN
China
Prior art keywords
training
model
network
target
distillation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010444204.XA
Other languages
Chinese (zh)
Other versions
CN111640425B (en
Inventor
王晶
彭程
罗雪峰
王健飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202010444204.XA priority Critical patent/CN111640425B/en
Publication of CN111640425A publication Critical patent/CN111640425A/en
Application granted granted Critical
Publication of CN111640425B publication Critical patent/CN111640425B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The application discloses a model training and intention recognition method, device, equipment and storage medium, and relates to the technical field of artificial intelligence. The model training method comprises the following steps: performing precipitation training at least twice on the underlying network of the pre-training model according to the training task data set to obtain a reinforced model of the pre-training model; the training objects of each precipitation training at least comprise an underlying network and a prediction layer network, and comprise successively descending medium-high layer networks; taking at least two networks in the reinforced model as target networks, and constructing a distillation model according to the target networks, wherein the target networks comprise a feature recognition network and the prediction layer network; the characteristic identification network at least comprises an underlying network; extracting target knowledge of a training task data set through a target network of the reinforced model; and training the distillation model according to the target knowledge and the training task data set to obtain a target learning model so as to improve the efficiency and accuracy of prediction of the target learning model.

Description

Model training and intention recognition method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to an artificial intelligence technology.
Background
With the development of artificial intelligence technology, deep learning models are more and more widely applied in the field of human-computer interaction. The pre-training model is used as one of deep learning models, and has a complex structure and huge model parameters, so that the pre-training model may consume long time and is slow in speed in the operation stage. In order to improve the response speed of the pre-training model, in the prior art, research and development personnel are generally required to manually select a network layer with a smaller weight value from the pre-training model and cut the network layer from the pre-training model so as to compress the pre-training model and reduce the structural complexity of the pre-training model. However, the pre-training model cut by the method is greatly influenced by human factors, has low accuracy, seriously influences the human-computer interaction effect, and needs to be improved urgently.
Disclosure of Invention
A model training and intention recognition method, apparatus, device and storage medium are provided.
According to a first aspect, there is provided a knowledge-based distillation model training method, the method comprising:
performing precipitation training at least twice on a bottom layer network of a pre-training model according to a training task data set to obtain a reinforced model of the pre-training model; the training objects of each precipitation training at least comprise the underlying network and a prediction layer network and comprise successively descending intermediate-level networks, and the pre-training model comprises the underlying network, at least one intermediate-level network and the prediction layer network from bottom to top;
taking at least two networks in the reinforced model as target networks, and constructing a distillation model according to the target networks, wherein the target networks comprise a feature recognition network and the prediction layer network; the feature recognition network comprises at least the underlying network;
extracting target knowledge of the training task data set through a target network of the reinforced model;
and training the distillation model according to the target knowledge and the training task data set to obtain a target learning model.
According to a second aspect, there is provided an intent recognition method, the method comprising:
acquiring user voice data acquired by a man-machine interaction device;
inputting the user voice data into a target learning model to obtain a user intention recognition result output by the target learning model; wherein the target learning model is determined based on training of a knowledge distillation-based model training method according to any embodiment of the application;
and determining a response result of the human-computer interaction equipment according to the user intention identification result.
According to a third aspect, there is provided a knowledge-based distillation model training apparatus, the apparatus comprising:
the sediment training module is used for carrying out at least two times of sediment training on the underlying network of the pre-training model according to the training task data set to obtain a reinforced model of the pre-training model; the training objects of each precipitation training at least comprise the underlying network and a prediction layer network and comprise successively descending intermediate-level networks, and the pre-training model comprises the underlying network, at least one intermediate-level network and the prediction layer network from bottom to top;
the distillation model building module is used for taking at least two networks in the reinforced model as target networks and building a distillation model according to the target networks, wherein the target networks comprise a feature recognition network and the prediction layer network; the feature recognition network comprises at least the underlying network;
the target knowledge extraction module is used for extracting target knowledge of the training task data set through a target network of the reinforced model;
and the distillation model training module is used for training the distillation model according to the target knowledge and the training task data set to obtain a target learning model.
According to a fourth aspect, there is provided an intention recognition apparatus, the apparatus comprising:
the voice data acquisition module is used for acquiring user voice data acquired by the human-computer interaction equipment;
the intention recognition module is used for inputting the user voice data into a target learning model so as to obtain a user intention recognition result output by the target learning model; wherein the target learning model is determined based on training of a knowledge distillation-based model training method according to any embodiment of the application;
and the response result determining module is used for determining the response result of the human-computer interaction equipment according to the user intention recognition result.
According to a fifth aspect, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a knowledge distillation based model training method or an intent recognition method as described in any embodiment of the present application.
According to a sixth aspect, a non-transitory computer readable storage medium having computer instructions stored thereon is provided. The computer instructions are for causing the computer to perform a knowledge distillation based model training method or an intent recognition method as described in any embodiment of the present application.
According to the technology of the embodiment of the application, the problem that the accuracy is low due to the fact that a pre-training model is manually compressed in the prior art is solved, and a high-precision target learning model can be trained through low-cost automatic compression so as to improve the man-machine interaction effect.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1A is a flow chart of a knowledge-based distillation model training method provided in accordance with an embodiment of the present application;
FIG. 1B is a schematic diagram of a network structure of a pre-training model according to an embodiment of the present application;
FIG. 2 is a flow chart of another knowledge-based distillation model training method provided in accordance with an embodiment of the present application;
FIG. 3 is a flow chart of another knowledge-based distillation model training method provided in accordance with an embodiment of the present application;
4-5 are flow charts of two knowledge-based distillation model training methods provided according to embodiments of the present application;
FIG. 6A is a flow chart of another knowledge-based distillation model training method provided in accordance with an embodiment of the present application;
FIG. 6B is a schematic diagram of a distillation model training scheme according to an embodiment of the present disclosure;
FIG. 7 is a flow chart of a knowledge-based distillation model training method provided in accordance with an embodiment of the present application;
FIG. 8 is a flow chart of an intent recognition method provided in accordance with an embodiment of the present application;
fig. 9 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;
FIG. 10 is a schematic diagram of an intention recognition apparatus according to an embodiment of the present application;
FIG. 11 is a block diagram of an electronic device for implementing the knowledge-based distillation model training method or the intent recognition method of the embodiments of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
FIG. 1A is a flow chart of a knowledge-based distillation model training method provided in accordance with an embodiment of the present application; fig. 1B is a schematic diagram of a network structure of a pre-training model according to an embodiment of the present application. The method is suitable for the condition that a pre-training model with a complex network structure is compressed and trained into a target learning model with a simple network structure based on the knowledge distillation technology. The embodiment may be performed by a knowledge distillation-based model training apparatus configured in an electronic device, which may be implemented in software and/or hardware. As shown in fig. 1A-1B, the method includes:
and S101, performing precipitation training at least twice on the underlying network of the pre-training model according to the training task data set to obtain the reinforced model of the pre-training model.
The task training data set of the embodiment of the application can be a prediction task to be executed according to a pre-training model, and sample data related to the prediction task is acquired to serve as the training task data set. For example, if the prediction task that the pre-training model needs to execute is to perform intent recognition on the user speech data in the shopping platform a, all historical user speech data in the shopping platform a may be obtained at this time, and a training task data set corresponding to the prediction task is obtained after performing relevant processing (such as tagging and deleting invalid data).
The pre-training model can be built based on a deep learning framework, is trained by mass data, can execute a high-precision model of a certain learning task, and generally has the characteristics of deeper network layer number, wider dimensionality of each layer network, more model parameters and the like. The training model may be trained by the user himself using a large amount of sample data, or may be directly obtained from a pre-training model database, and this embodiment is not limited. Optionally, the pre-trained model may include, from bottom to top, an underlying network, at least one intermediate-high network, and a prediction layer network. The bottom layer network and the middle and high layer network are used for carrying out feature recognition; the prediction layer network is configured to perform task prediction based on the identified characteristics. Among them, the underlying network is generally used to identify simple features, and the medium-high network is generally used to abstract complex features from simple features. For example, if the pre-trained model is a bert model that performs intent recognition, the underlying network of the bert model is typically used to recognize simpler grammatical features; medium and high level networks are typically used to abstract complex features from syntactic features. The prediction layer network is used for predicting tasks according to the characteristics recognized by the underlying network and the middle and higher layer networks. Optionally, the pre-training model of the embodiment of the present application may be a bert model.
Illustratively, the pre-training model 1 shown in fig. 1B is composed of 12 network layers, wherein the 1 st to 3 rd network layers are underlying networks 10, the 4 th to 11 th network layers are medium-high networks 11, and the 12 th network layer is a prediction layer network 12, wherein the medium-high networks 11 further include a medium-layer network 110 (i.e., the 4 th to 7 th network layers) and a high-layer network 111 (i.e., the 8 th to 11 th network layers).
Optionally, in a general case, the relevance of the complex features of the medium-high network abstraction of the pre-training model to the prediction task itself is low, and the prediction task needs to be accurately completed mainly depending on the underlying network. Therefore, the operation can carry out multiple times of precipitation training on the underlying network of the pre-training model, and continuously adjust the training object (namely the network layer needing to be trained in the pre-training model) in the multiple times of precipitation training. The training objects of each precipitation training at least comprise an underlying network and a prediction layer network, and comprise successively descending medium-high layer networks. That is, although the embodiment focuses on performing the sediment training on the underlying network of the pre-training model, in order to ensure the accuracy of the training result, the object of each training at least includes the underlying network and the prediction layer network, and for the middle-high layer network, the number of layers of the middle-high layer network included in the training object is in a decreasing trend as the number of times of the sediment training increases. For example, if five sedimentation trainings are performed on the underlying network 10 of the pre-training model 1 shown in fig. 1B, the training objects of the five sedimentation trainings all include the underlying network 10 and the prediction layer network 12, and for the intermediate layer network, the training object of the first sedimentation training may include all network layers; training objects for the second precipitation training may be decremented to include network layers 4 through 9; the training objects of the third settling training may be decreased again to include the 4 th network layer to the 7 th network layer, and are decreased sequentially, and by the time of the fifth settling training, the training objects may have been decreased to not include the middle and high network 11.
Specifically, in this step, when deep training is performed on the underlying network of the pre-training model at least twice according to the training task data set, a part of the training task data set may be input into the pre-training model each time, and a precipitation training may be performed on the training object with the part of the training task data set, and then the pre-training model after multiple times of precipitation training is used as the reinforcement model. Because the middle-high network in the training object in the step is gradually decreased along with the increase of the training times, the operation can perform more accurate training and updating on the underlying network along with the increase of the precipitation training times, so that the parameters of the underlying network are more and more accurate. That is to say, compared with the pre-trained model, the network structure of the reinforcement model of this embodiment does not change, and if the pre-trained model is a bert model, the reinforcement model after the precipitation training is also the bert model, but the network parameters of the underlying network are more accurate.
S102, taking at least two networks in the reinforced model as target networks, and constructing a distillation model according to the target networks.
The target network may be a network selected from networks included in the reinforced model and required for completing the current prediction task. The target network comprises a feature recognition network and a prediction layer network; the feature recognition network is a network for performing feature recognition, and in the embodiment of the present application, the feature recognition network at least includes an underlying network. Optionally, the feature recognition network may include a part or all of a middle-high network in addition to the underlying network, which is not limited in this embodiment.
Optionally, in this embodiment, at least two networks may be selected from the enhanced model as the target network, where if two networks are selected, the two networks are an underlying network and a predicted layer network of the enhanced model, and at this time, the feature recognition network in the target network only includes the underlying network; if three or more networks are selected, the remaining networks may be selected from the medium-high layer networks on the basis of the selection of the underlying network and the prediction layer network, and the feature network in the target network may include at least one medium-high layer network in addition to the underlying network. Whether the middle-high level network of the enhanced model is used as the feature recognition network of the target network or not can be determined by integrating factors such as an actual prediction task and the type of target knowledge to be extracted subsequently. This embodiment is not limited to this.
Optionally, when the distillation model is constructed according to the target network, a distillation model also including the target network may be constructed according to the target network. It should be noted that the network type of the target network of the distillation model constructed in this step is the same as the type of the target network of the reinforcement model. Specifically, the distillation model also includes a prediction layer network and a feature recognition network, and regarding the feature recognition network, if the feature recognition network selected from the enhanced model only includes an underlying network, the feature recognition network of the constructed distillation model also only includes the underlying network; if the feature recognition network selected from the enhanced model includes not only the underlying network but also the middle network in the middle-high network, the feature recognition network of the constructed distillation model also includes the underlying network and the middle network.
Optionally, when the distillation model is constructed according to the target network, a distillation model having the same structure as the enhanced model, that is, an isomorphic model of the enhanced model, may be constructed by combining the network layer structure of the target network of the enhanced model. For example, if the augmented model is a bert model, the distillation model constructed is a bert model that contains only the target network structures in the augmented model. The distillation model may be constructed in a different structure than the enhancement model, but also includes the target network type of the enhancement model, i.e., the heterogeneous model of the enhancement model. For example, the enhanced model is a bert model, and the constructed distillation model is a CNN model, but the CNN model also includes the same type of target network as the enhanced model. The method of how to construct the homogeneous or heterogeneous distillation model will be described in detail in the examples which follow.
It should be noted that the distillation model constructed by the operation can be a machine learning model or a small model based on a neural network, and the distillation model is characterized by few parameters, fast inference block and good portability.
S103, extracting target knowledge of the training task data set through a target network of the reinforced model.
The target knowledge can be a result obtained after a target network in the reinforcement model processes the training task data set, and the target knowledge is used for being subsequently injected into the distillation model and used as a supervision signal during the training of the distillation model.
Optionally, when extracting target knowledge of the training task data set, the step may be to use the training task data set as an input of the enhanced model, and obtain a first data feature representation output by a feature recognition network of the enhanced model and a first prediction probability representation output by a prediction layer network of the enhanced model; and using the obtained first data feature representation and the first prediction probability representation as target knowledge of the training task data set. Specifically, the training task data set may be divided into multiple parts according to a preset size, such as the size of batch _ size. And then inputting each piece of divided training task data into the reinforced model, operating the reinforced model, and acquiring the characteristic representation output by the characteristic recognition network of the reinforced model as a first data characteristic representation. If the feature identification network only has an underlying network, the first data feature representation is only the feature representation output by the underlying network; if the feature recognition network comprises an underlying network and a portion of a higher network, the first data feature representation comprises not only a feature representation of an output of the underlying network, but also a feature representation of an output of the higher network in the portion. And acquiring a characteristic representation, such as a prediction probability value, of the prediction layer network output of the reinforced model as a first prediction probability representation, and further taking the acquired first data characteristic representation and the first prediction probability representation as target knowledge corresponding to the training task data input at this time.
And S104, training the distillation model according to the target knowledge and the training task data set to obtain a target learning model.
Optionally, in the operation, the target knowledge obtained in S103 is used as a supervision signal for training the distillation model, and the distillation model is induced to be trained based on the training task data set, so that the target knowledge is migrated to the distillation model in the training process, so that the distillation model learns to strengthen the prediction task of the model. Specifically, the step may be to calculate a soft supervision tag according to data feature representation and prediction probability representation in the target knowledge and data feature representation and prediction probability representation obtained by processing training task data by the distillation model, calculate a hard supervision tag according to a processing result of processing the training task data by the distillation model, further combine the soft supervision tag with the hard supervision tag, and perform distillation training on the distillation model with higher learning efficiency through less training task data. The process of how to calculate the hard supervised and soft supervised labels, and how to perform the distillation training based on the two supervised labels, will be described in detail in the examples that follow.
Optionally, the distillation model is trained in this step, the trained distillation model is a target learning model, and the target learning model is obtained by knowledge distillation of the pre-training model, so that the target learning model can accurately execute the prediction task of the pre-training model, and the target learning model has a simple structure relative to the pre-training model, so that the time consumption is short and the response speed is high when the prediction task is executed.
Optionally, in this embodiment, after the target learning model is obtained through training, the target learning model may be deployed in the actual human-computer interaction field to perform prediction of the online task. Preferably, if the pre-training model and the target learning model are models for performing intent recognition, after the distillation model is trained according to the target knowledge and the training task data set to obtain the target learning model, the embodiment of the present application may further: and deploying the target learning model into the human-computer interaction equipment so as to perform intention recognition on user voice data acquired by the human-computer interaction equipment in real time. Specifically, after the target learning model is deployed in the human-computer interaction device, the human-computer interaction device transmits the user voice data to the target learning model after acquiring the user voice data, the target learning model performs intention recognition on the input user voice data and feeds an intention recognition result back to the human-computer interaction device, and the human-computer interaction device generates a response result corresponding to the user voice data according to the intention recognition result of the target learning model and feeds the response result back to the user. The scheme of the embodiment of the application is that the target learning model is obtained through knowledge distillation training, the network structure is simpler than a pre-training model, the prediction effect can approach to the complex pre-training model, and rapid and accurate intention recognition can be realized so as to meet the real-time response requirement of the human-computer interaction equipment.
According to the technical scheme of the embodiment, according to a training task data set, a base network, a prediction layer network and a gradually decreased middle-high layer network are used as training objects, and at least two times of precipitation training are carried out on the base network of a pre-training model to obtain a reinforced model; and constructing a distillation model according to the target network determined from the reinforced model. Extracting target knowledge of the training task data set through a target network of the reinforced model; and training the distillation model based on the extracted target knowledge and the training task data set to obtain a target learning model. In the embodiment, the mode of gradually decreasing the middle-high network is adopted to perform multiple sedimentation training on the underlying network of the pre-training model, so that the parameters of the underlying network of the pre-training model are more accurate. And subsequently, a distillation model is constructed at least according to the settled accurate underlying network and the prediction layer network, and distillation training is carried out on the distillation model based on the extracted target knowledge, so that the target learning model distilled from the pre-training model simplifies the network structure, simultaneously retains the prediction accuracy of the pre-training model, and further realizes the improvement of the generalization capability of the model. The whole distillation process is not influenced by human factors, and the target learning model is deployed in the human-computer interaction equipment, so that the task can be quickly and accurately executed, and the real-time response requirement of the human-computer interaction equipment is met.
Optionally, the pre-training model in the embodiment of the present application is a trained model that can execute a certain prediction task, and when the prediction task covers a wide range, the pre-training model may perform task prediction in multiple ranges, but for a certain range, the prediction effect may not be very good. For example, if the pre-trained model is a model for intention recognition, it can recognize the intention of users' voices in many fields such as shopping, business handling and intelligent furniture control, but the predicted effect may not be accurate for some specific fields. For this situation, in this embodiment, before performing at least two times of sediment training on the underlying network of the pre-training model according to the training task data set, the domain training on the pre-training model according to the training domain data set is performed, and the pre-training model is updated.
Specifically, the training domain data set may be a work domain to be deployed based on a pre-training model, and sample data related to the field is specially acquired as the training domain data set, for example, if the pre-training model needs to perform intent recognition of a user voice in a shopping domain, at this time, all voice data of each shopping platform may be subjected to related processing (such as tagging, deleting invalid data, and the like) to obtain the training domain data set corresponding to the field. And inputting the training field data set into the pre-training model so as to update and train the pre-training model aiming at the field, and finely adjusting the parameters of the pre-training model, so that the updated pre-training model can more accurately execute the prediction task in the field. According to the method, the domain training is carried out on the pre-training model according to the training domain data set, after the pre-training model is updated, the deposition training operation of S101 is carried out on the updated pre-training model, and the method has the advantages that the prediction accuracy of the pre-training model in the field to which the prediction task belongs is greatly improved. And a guarantee is provided for distilling an accurate target learning model based on the pre-training model.
FIG. 2 is a flow chart of another knowledge-based distillation model training method provided in accordance with an embodiment of the present application; based on the above embodiments, the present embodiment performs further optimization, and gives a specific description of performing at least two settling trainings on the underlying network of the pre-trained model according to the training task data set. As shown in fig. 2, the method includes:
s201, dividing a training task data set to determine a plurality of training data subsets.
Optionally, the operation may be to divide the training task data set into a plurality of training data subsets according to a preset precipitation strategy, for example, the number of network layers for extracting knowledge each time. For example, if the pre-training model is the model shown in fig. 1B, and the precipitation strategy is to extract knowledge of one layer of network at a time, the training task data set may be divided into 12 parts at this time. And the division number of the training data subset is less than or equal to the total number of layers of the pre-training model. For example, when the total number of layers N of the model is pre-trained, the number of parts K of the training data subset divided in this step may be equal to half of the total number of layers N. Optionally, the number of the training data in each divided training data subset may be the same or different, and this embodiment is not limited thereto.
S202, determining training objects corresponding to each training data subset according to the set precipitation training times.
The training object may be a network layer that needs to be trained in the pre-training model each time the sediment training is performed. The training objects corresponding to each training data subset are different in the embodiment of the application. Specifically, the training objects corresponding to the training data subsets include an underlying network, a middle-high network and a prediction layer network of the pre-training model, and the number of layers of the included middle-high network is inversely proportional to the sequence of the precipitation training. And the middle-high network included in each training object is a network layer which is adjacent to the underlying network and is continuous upwards. That is, in the training objects corresponding to the training data subsets, the underlying network and the prediction layer network are kept unchanged, the number of layers of the intermediate-high network is gradually reduced from top to bottom along with the backward movement of the precipitation training sequence corresponding to the training data subsets. Optionally, based on the increase of the number of deposition training times, the number of layers of the middle-high network included in the training object is decreased to zero. Therefore, only the underlying network is updated and trained finally along with the increase of the times of the precipitation training.
Optionally, in this embodiment, when determining the training object corresponding to each training data subset, the underlying network and the prediction layer network are not changed, and the number of layers of the middle-high layer network in the training object corresponding to each training data subset may be determined according to the total number of layers of the pre-training model and the precipitation training sequence corresponding to each training data subset. For example, if the total number of layers of the pre-training model is N and the precipitation training order for a certain training data subset is kth, the highest number of layers of the higher-layer network included in the training object corresponding to the training data subset is S-N-2 × k, that is, the higher-layer networks below S layers are all the training objects corresponding to the training data set.
And S203, performing precipitation training on the training objects corresponding to the training data subsets in the pre-training model according to each training data subset to obtain a reinforced model of the pre-training model.
Optionally, after the training object corresponding to each divided training data subset is determined, each training data subset may be sequentially input into the pre-training model according to the precipitation training sequence corresponding to each training data subset, each network layer corresponding to the current training object in the pre-training model is trained by using the input training data subset, and the parameters of each network layer corresponding to the training object are updated. Because the number of layers of the middle-high network in the training object corresponding to each training data subset of the embodiment is gradually decreased along with the increase of the number of times of the precipitation training, in the process of the multiple precipitation training, the updated parameters of the middle-high network are less and less, the training process is gradually concentrated on the underlying network, the underlying network of the pre-training model is more and more accurate after the multiple precipitation training, and the pre-training model after the multiple precipitation training can be used as a reinforced model.
S204, taking at least two networks in the reinforced model as target networks, and constructing a distillation model according to the target networks.
Wherein the target network comprises a feature recognition network and the prediction layer network; the feature recognition network includes at least an underlying network.
S205, extracting target knowledge of the training task data set through a target network of the reinforced model.
And S206, training the distillation model according to the target knowledge and the training task data set to obtain a target learning model.
According to the technical scheme of the embodiment, a training task data set is divided into a plurality of training data subsets, a training object of each training data subset is determined based on the principle that the number of layers of a medium-high network in the training object is in inverse proportion to the sequence of precipitation training, and the training object is subjected to precipitation training once according to each divided training data subset to obtain the reinforced model. Constructing a distillation model according to the reinforced model, and extracting target knowledge; and training the distillation model based on the extracted target knowledge and the training task data set to obtain a target learning model. The embodiment determines the training object for each precipitation training based on the principle that the number of layers of a medium-high network in the training object is inversely proportional to the sequence of the precipitation training, so that the parameters of the underlying network of the pre-training model after multiple times of precipitation training are more accurate. Provides a new idea for the knowledge precipitation operation of the knowledge distillation process. And the subsequent operation of the knowledge distillation training target learning model is guaranteed.
Fig. 3 is a flowchart of another knowledge-based distillation model training method provided in an embodiment of the present application, which is further optimized based on the above embodiment to show details of when to obtain an enhanced model of a pre-trained model during multiple sediment training of an underlying network of the pre-trained model. As shown in fig. 3, the method includes:
and S301, successively carrying out precipitation training on the underlying network of the pre-training model according to the training task data set.
It should be noted that, the specific implementation manner of the step of successively performing the sediment training on the underlying network of the pre-training model has been described in detail in the foregoing embodiment, and is not described herein again.
And S302, testing the pre-training model after the precipitation training according to the test task data set.
The test task data set may be test data for testing whether the pre-training model after the precipitation training can accurately complete the prediction task. Optionally, sample data related to the prediction task may be obtained according to the prediction task that needs to be executed by the pre-training model, and then the sample data is divided into two parts, one part is used as the training task data set of the embodiment of the present application, and the other part is used as the test task data set of the embodiment of the present application.
Optionally, in this embodiment, the test task data set may be input into the pre-training model after the multiple precipitation training in S301, so as to obtain a prediction result output by the pre-training model after the precipitation training based on the test task data, and finally, the prediction result is analyzed according to the real label in the test task data, so as to calculate an evaluation index value representing whether the output result of the pre-training model after the multiple precipitation training is accurate, and the evaluation index value is used as the test result. Optionally, the evaluation index value may be determined according to a prediction task, such as accuracy, precision, recall rate, and the like of an output result of a pre-training model after multiple precipitation training.
Optionally, in order to ensure the accuracy of the test result, in this embodiment, multiple sets of test task data sets may be used to perform multiple tests on the pre-trained model after the precipitation training, and the final test result is determined according to the multiple test results.
And S303, if the test result meets the precipitation finishing condition, taking the pre-training model after the precipitation training as a reinforced model.
The precipitation finishing condition may be a judgment condition for judging whether the pre-training model after the multiple precipitation training satisfies the condition as the reinforcement model. Specifically, the index may be an index threshold corresponding to an evaluation index value in the test result.
Optionally, in this embodiment, the pre-trained model after the sediment training is tested in S302, an obtained test result (i.e., an evaluation index value) is compared with an index threshold in the sediment end condition, if the evaluation index value satisfies the index threshold, it is indicated that the test result satisfies the sediment end condition, and at this time, the pre-trained model after the sediment training may be used as a reinforcement model; if the test result does not meet the precipitation ending condition, the method needs to return to S301 to continue to perform precipitation training on the underlying network of the pre-training model successively according to the training task data set until the test result meets the precipitation ending condition.
S304, taking at least two networks in the reinforced model as target networks, and constructing a distillation model according to the target networks.
Wherein the target network comprises a feature recognition network and the prediction layer network; the feature recognition network includes at least an underlying network.
S305, extracting target knowledge of the training task data set through a target network of the reinforced model.
And S306, training the distillation model according to the target knowledge and the training task data set to obtain a target learning model.
According to the technical scheme, after the bottom layer network of the pre-training model is subjected to multiple sedimentation training according to the training task data set, the pre-training model after the sedimentation training is tested according to the testing task data set, and if the pre-training model after the sedimentation training passes the test, the pre-training model can be used as a strengthening model. Further constructing a distillation model according to the reinforced model and extracting target knowledge; and training the distillation model based on the extracted target knowledge and the training task data set to obtain a target learning model. The embodiment tests the pre-training model after the knowledge precipitation to determine whether the knowledge precipitation achieves the expected effect of precipitation training, and only when the expected effect is achieved, the knowledge precipitation can be used as a reinforced model, so that the accuracy of the underlying network parameters of the reinforced model is ensured. And the subsequent operation of the knowledge distillation training target learning model is guaranteed.
Optionally, the foregoing embodiment introduces a determination process of when to obtain the enhanced model of the pre-training model in the process of performing multiple sedimentation training on the underlying network of the pre-training model, and similarly, in the process of training the distillation model according to the target knowledge and the training task data set, a similar method may also be used to determine whether the training of the distillation model is completed, so as to obtain the target learning model. Specifically, the method comprises the following steps: the embodiment of the application can be specifically implemented when the distillation model is trained according to target knowledge and a training task data set to obtain a target learning model: training the distillation model according to the target knowledge and the training task data set; testing the trained distillation model according to the test task data set; and if the test result meets the training end condition, taking the trained distillation model as a target learning model. It should be noted that, the process of testing the distillation model after training according to the training task data set is similar to the process of testing the pre-training model after precipitation training according to the training task data set described in the above embodiment, for example, the test task data set may be input into the distillation model after training, an evaluation index value is calculated according to the output prediction result of the distillation model after training and the real label of the test task data set, if the evaluation index value meets the index threshold in the training end condition, it is indicated that the test result of the distillation model after training meets the training end condition, and the distillation model after this training may be used as the target learning model. The method has the advantages that whether the task prediction precision of the trained distillation model achieves the expected effect or not is determined by testing the trained distillation model, and the trained distillation model can be used as the final target learning model only when the expected effect is achieved, so that the accuracy of the target learning model distilled based on the knowledge distillation technology is improved.
4-5 are flow charts of two knowledge-based distillation model training methods provided according to the embodiments of the present application, which are further optimized based on the above embodiments and give descriptions of two specific implementations of constructing a distillation model according to a target network.
Alternatively, fig. 4 shows an implementation of constructing a distillation model having the same structure as the enhanced model according to the target network, specifically:
s401, performing precipitation training at least twice on the underlying network of the pre-training model according to the training task data set to obtain a reinforced model of the pre-training model.
The training objects of each precipitation training at least comprise an underlying network and a prediction layer network and a gradually decreasing middle-high layer network, and the pre-training model comprises the underlying network, at least one middle-high layer network and the prediction layer network from bottom to top.
S402, taking at least two networks in the reinforced model as target networks, and acquiring network structure blocks of the target networks.
The target network comprises a feature recognition network and a prediction layer network of the reinforced model; the feature recognition network includes at least an underlying network of augmented models. Optionally, a medium-high network that strengthens part or all of the model may also be included. Since the network structure of the distillation model constructed in this embodiment is simpler than that of the enhanced model, in general, the feature recognition network of the target network in the embodiment of the present application does not include or includes only a small number of middle and high-rise networks. The network structure block may be obtained by encapsulating the network structures of one or more network layers in the reinforcement model, for example, assuming that the reinforcement model of this embodiment is obtained by performing precipitation training on the pre-trained model shown in fig. 1B, the network structure of the reinforcement model should also be as shown in fig. 1B, and at this time, the network structures of the 1 st network layer to the 3 rd network layer in fig. 1B may be encapsulated as the network structure block of the underlying network 10; encapsulating the network structures of the 4 th network layer to the 7 th network layer into a network structure block of the middle layer network 110; encapsulating the network structures from the 8 th network layer to the 11 th network layer into a network structure block of a higher-layer network 111; the network structure of the 12 th network layer is encapsulated into a network structure block of the prediction layer network 12.
Alternatively, if a distillation model having the same structure as the reinforcement model is to be constructed, the network structure block of the target network in the reinforcement model may be obtained after the target network is selected from the reinforcement model. For example, if the underlying network and the predicted network in the reinforcement model are used as the target network, the network structure blocks of the underlying network and the network structure blocks of the predicted network may be used as the network structure blocks of the target network.
And S403, constructing a distillation model with the same structure as the reinforced model according to the obtained network structure block.
Optionally, the target network corresponds to at least two networks in the enhanced model, so the obtained network structure block is also the network structure block of at least two networks, and this step may be to arrange the at least two network structure blocks in the enhanced model from bottom to top, and use the output of the network structure block located below as the input of the network structure block located above the network structure block adjacent to the network structure block, thereby forming a new model composed of the target network, which is the constructed distillation model.
For example, assuming that the network structure blocks of the underlying network 10 and the predicted network 12 in fig. 1B are obtained in S402, since the underlying network 10 is located below the predicted network 12, the network structure blocks of the underlying network 10 may be placed below the network structure blocks of the predicted network 12, and the output of the network structure blocks of the underlying network 10 may be connected to the input of the network structure blocks of the predicted network 12, so as to generate a distillation model composed of the network structure blocks of the underlying network 10 and the network structure blocks of the predicted network 12. Similarly, if the network structure blocks of the underlying network 10, the middle network 110, and the prediction layer network 12 are obtained in S402, the network structure block of the underlying network 10 may be located at the bottom, the network structure block of the middle network 110 may be located at the middle, the network structure block of the prediction layer network 12 may be located at the top, the output of the network structure block of the underlying network 10 may be connected to the input of the network structure block of the middle network 110, and the output of the network structure block of the middle network 110 may be connected to the input of the network structure block of the prediction layer network 12, so as to generate a distillation model composed of the network structure block of the underlying network 10, the network structure block of the middle network 110, and the network structure block of the prediction layer network 12.
S404, extracting target knowledge of the training task data set through a target network of the reinforced model.
S405, training the distillation model according to the target knowledge and the training task data set to obtain a target learning model.
Alternatively, fig. 5 shows an implementation of constructing a distillation model with a structure different from that of the enhanced model according to the target network, specifically:
s501, performing precipitation training at least twice on the underlying network of the pre-training model according to the training task data set to obtain the reinforced model of the pre-training model.
The training objects of each precipitation training at least comprise an underlying network and a prediction layer network and a gradually decreasing middle-high layer network, and the pre-training model comprises the underlying network, at least one middle-high layer network and the prediction layer network from the bottom to the top.
And S502, taking at least two networks in the reinforced model as target networks.
Optionally, the process of selecting the target network from the reinforcement model is already described in the foregoing embodiment, and details are not described in this embodiment.
And S503, selecting a neural network model with a structure different from that of the reinforced model as a distillation model according to the target network.
The output layer network of the neural network model is consistent with the type of a prediction layer network in the target network, and the non-output layer network of the neural network model is consistent with the type of a feature recognition network in the target network. The type of the prediction layer network means that the type of the network belongs to a prediction type, namely a task prediction type. The types of so-called feature recognition networks include: an underlying network, a middle network, and a higher network, etc.
Optionally, because the distillation model constructed in the embodiment has a different structure from the reinforcement model, a neural network model which has a simple structure and can be used for realizing a prediction task can be selected as the distillation model according to the requirement. The neural network model which can be selected as the distillation model is generally simpler in structure and fewer in layer number, but the output layer of the neural network model is required to be consistent with the type of a prediction layer network in a target network, and the non-output layer network is required to be consistent with the type of a feature recognition network in the target network. That is, it is necessary that the output layer of the neural network model is a network capable of performing task prediction, and the non-output layer thereof is required to be consistent with the type of the feature recognition network of the target network, for example, if the type of the feature recognition network of the target network is an underlying network, the type of the non-output layer of the neural network model should also be the underlying network; if the type of the feature recognition network of the target network is an underlying network and an intermediate network, the type of the non-output layer of the neural network model should also be the underlying network and the intermediate network.
The distillation model constructed in this step has a small number of layers due to the structural units, and therefore is usually in a heterogeneous relationship with a structurally complex reinforcement model. For example, assuming the enhanced model is the bert model, the CNN model may be selected as the distillation model.
S504, extracting target knowledge of the training task data set through a target network of the reinforced model.
And S505, training the distillation model according to the target knowledge and the training task data set to obtain a target learning model.
According to the technical scheme of the embodiment of the application, a specific execution mode of constructing two distillation models with the same or different structures as or from the reinforced model according to the target network of the reinforced model after the sediment training in the process of training the target learning model of the pre-trained model based on the knowledge distillation technology is provided. If a distillation model with the same structure as the reinforced model is constructed, the network structure block of the reinforced model is reserved in the distillation model, so that the homogeneous distillation model is easier to distill and train to achieve the prediction effect of the reinforced model; if a distillation model with the same structure as the reinforced model is constructed, the heterogeneous distillation model can learn the characteristics different from the reinforced model, and the generalization capability of the model is improved. The embodiment of the application can be selected according to actual requirements, and flexibility is high.
FIG. 6A is a flow chart of another knowledge-based distillation model training method provided in accordance with an embodiment of the present application; fig. 6B is a schematic structural diagram of a distillation model training method provided in an embodiment of the present application. The embodiment is further optimized on the basis of the above embodiment, and a specific case introduction of training the distillation model according to the target knowledge and the training task data set is given. As shown in fig. 6A-6B, the method includes:
s601, performing at least two times of precipitation training on the underlying network of the pre-training model according to the training task data set to obtain the reinforced model of the pre-training model.
The training objects of each precipitation training at least comprise an underlying network and a prediction layer network and a gradually decreasing middle-high layer network, and the pre-training model comprises the underlying network, at least one middle-high layer network and the prediction layer network from the bottom to the top.
S602, at least two networks in the reinforced model are used as target networks, and a distillation model is constructed according to the target networks.
The target network comprises a feature recognition network and a prediction layer network; the feature recognition network includes at least an underlying network.
Illustratively, it is assumed that, from the reinforcement model shown in fig. 6B, the selected target network is an underlying network, a middle network and a predicted layer network, and the distillation model shown in fig. 6B is constructed from these three networks.
And S603, extracting target knowledge of the training task data set through a target network of the reinforced model.
For example, the present operation may be inputting training data with a preset size in the training task data set, such as the size of batch _ size, into the enhanced model shown in fig. 6B, and obtaining a feature representation (knowledge _ seq) of the underlying network output of the enhanced modell) And a characterization of the Medium network output (knowledge _ seq)m) As a first data characteristic representation (knowledge _ seq); and acquiring a characteristic representation (knowledge _ prediction) of the prediction layer network output of the reinforced model as a first prediction probability representation. The first data characteristic representation and the first prediction probability representation obtained in the step are extracted target knowledge.
S604, inputting the training task data set into a distillation model, and determining a soft supervision label and a hard supervision label according to the processing result and target knowledge of the distillation model on the training task data set.
Wherein, the soft supervision label and the hard supervision label are two supervision signals in the process of training the distillation model. Wherein the soft supervised labels are calculated based on the extracted target knowledge and the hard supervised labels are calculated based on the actual labels in the training task dataset.
Optionally, in this embodiment, the training task data set may be input into the distillation model, and the distillation model processes the input training data set to obtain an output result of each network layer of the distillation model, where the output result is used to determine the soft supervision label by combining with the target knowledge. And on the other hand, for computing hard surveillance tags in conjunction with information associated with a training task data set. The specific determination process comprises the following three substeps:
and S6041, inputting the training task data set into the distillation model to obtain a second data characteristic representation output by the characteristic identification network of the distillation model and a second prediction probability representation output by the prediction layer network of the distillation model.
Specifically, after training data with a preset size, such as the size of batch _ size, in the training task data is input into the distillation model, a prediction layer network output prediction result (namely, feature representation) in the distillation model is obtained as a second prediction probability representation; if only the underlying network exists in the feature recognition network in the distillation model, acquiring the feature representation output by the underlying network as a second data feature representation; if the feature recognition network in the distillation model comprises part of the medium-high network besides the underlying network, the feature representation of the underlying network and the output of the medium-high network in the part is obtained and serves as the second data feature representation. Illustratively, as shown in fig. 6B, the training task data set is input into the distillation model, and since the feature recognition network of the target network in fig. 6B includes an underlying network and an intermediate network, it is necessary to process the training task data set by the distillation model, and then the feature representation (samll _ seq) output by the underlying network is shownl) Characterization of the Medium network outputs (samll _ seq)m) As the second data bitSymbolized (samll _ seq); and taking a characteristic representation (small _ predict) of the prediction layer network output of the distillation model as a second prediction probability representation.
And S6042, determining a soft surveillance tag according to the target knowledge, the second data characteristic representation and the second prediction probability representation.
Optionally, since the target knowledge is composed of the first data feature representation and the first prediction probability representation, the embodiment may calculate the first data feature representation, the first prediction probability representation, the second data feature representation, and the second prediction probability representation according to a preset algorithm to obtain the soft supervision tag. The specific calculation algorithm is not limited in this embodiment. As may be the mean variance of the first data feature representation and the second data feature representation in the target knowledge as the data feature label; taking the mean variance of the first prediction probability representation and the second prediction probability representation in the target knowledge as a probability prediction label; and then, according to the weight value of the feature recognition network of the reinforced model, performing label fusion on the data feature label and the probability prediction label to obtain a soft supervision label. In the embodiment, the soft supervision label is determined based on the same training task data set according to the reinforced model and the distillation model and the output feature representation, so that the determined soft supervision label is more accurate, and the accuracy of a subsequently trained target learning model is further improved.
Specifically, the data feature label may be calculated according to the following formula (1), and the probability prediction label may be calculated according to the following formula (2); and finally, calculating the soft supervision label according to the following formula (3).
loss_i=MSE(knowledge_seq,small_seq) (1)
loss_p=MSE(knowledge_predict,small_predict) (2)
loss_soft=Wi*loss_i+loss_p (3)
Wherein, loss _ i is a data characteristic label; MSE () is the mean variance function; knowledge _ seq is a first data characteristic representation; small _ seq is a second data characteristic representation; loss _ p is a probability prediction label; knowledge _ predict is a first prediction probability representation; small _ predict is a second prediction probability representation; loss _ soft is a soft supervision label; wiThe weight values of the network are identified for the features.
Optionally, when the feature recognition network includes a plurality of networks (e.g., an underlying network and a middle network), the first data feature representation and the second data feature representation are both formed by feature representations output by a plurality of network layers, in this case, a data feature label may be calculated according to formula (1) for each feature representation output by each network layer. For example, as shown in fig. 6B, the first data feature representation includes: knowledge _ seqlAnd knowledge _ seqmThe second data characteristic representation comprises: samll _ seqlAnd samll _ seqm. This time may be according to knowledge _ seqlAnd samll _ seqlCalculating the data characteristic label loss _ i of the underlying networklAccording to knowledge _ seqmAnd samll _ seqmComputing a data feature tag loss _ i of a middle layer networkm. Correspondingly, when the soft supervision label is calculated, the product of the weight value of each network and the data feature label thereof and the probability prediction label can be summed to obtain the final soft supervision label. For example, for the scenario shown in fig. 6B, the calculation formula of the soft supervision label may be loss _ soft ═ Wl*loss_il+Wm*loss_im+loss_p。
And S6043, determining a hard supervision label according to the second prediction probability representation and the training task data set information.
Wherein the training task data set information comprises: the number of training samples, the number of training labels, and the actual label value in the training task dataset.
Alternatively, this sub-step may be to calculate the hard supervision tag according to the following equation (4).
Figure BDA0002505114760000211
Wherein loss _ hart is a hard supervision tag; n is the number of training samples in the training task data set; m is the number of training labels, and i is the ith training sample; c is the c training label; y isicBelongs to the c training label for the ith sampleThe actual tag value of; small _ predicticThe probability value that the ith training sample output for the prediction network layer of the distillation model belongs to the c training label. Alternatively, yicThe value of (b) may be 0 or 1.
In this embodiment, S6041-S6043 determine the soft surveillance label based on the same training task data set based on the reinforced model and the distillation model and the output feature representation, and determine the hard surveillance label based on the actual label value of the training task data and the prediction probability of the distillation model, thereby providing a new idea for determining the soft surveillance label and the hard surveillance label, and improving the accuracy of the soft surveillance label and the hard surveillance label.
And S605, determining a target label according to the soft supervision label and the hard supervision label.
Wherein the target label is a label value which is finally used for supervised distillation model training and is determined after the characteristics of the soft supervision label and the hard supervision label are combined. Optionally, this step determines the target tag according to the following formula (5):
loss=alpha*loss_soft+(1-alpha)*loss_hart (5)
wherein loss is a target label, and alpha is a parameter variable; loss _ soft is a soft supervision label; loss _ hart is a hard supervision tag.
The parameter variables in the above formula (5) may be constants set based on preset rules, or may be variables trained with the distillation model. This embodiment is not limited to this.
And S606, iteratively updating the parameters of the distillation model according to the target label to obtain a target learning model.
Optionally, in this embodiment, according to the target label determined in S605, the parameters of the distillation model are updated and adjusted according to a preset rule, such as a back propagation algorithm (BP algorithm), so as to complete one iterative update of the parameters of the distillation model. And then acquiring a next group of training data with preset size, such as the size of batch _ size, from the training task data set, inputting the training data into the distillation model, returning to execute the operations of S603-S606, and performing next iterative update on the parameters of the distillation model, thereby finishing the training of the distillation model. After the distillation model is trained for multiple times, the trained distillation model can be tested through the test task data set, if the training ending condition is met, the distillation model is well trained, and the trained distillation model can be used as a target learning model.
According to the technical scheme of the embodiment, a distillation model is constructed and target knowledge is extracted according to a reinforced model obtained by carrying out precipitation training on a bottom layer network of a pre-training model; and determining a soft supervision label and a hard supervision label according to the processing result of the distillation model on the task training data and the extracted target knowledge, and then determining the target label based on the soft supervision label and the hard supervision label to iteratively update the parameters of the distillation model to obtain a target learning model. In the embodiment, the distillation model is trained by combining the soft supervision label and the hard supervision label, so that the generalization capability of the distillation model is improved while the trained distillation model approaches the prediction effect of the pre-training model. Therefore, the real-time response requirement of the man-machine interaction equipment is better met.
FIG. 7 is a flow chart of a knowledge-based distillation model training method provided in accordance with an embodiment of the present application. The present embodiment provides a preferred example based on the above embodiments, and specifically, as shown in fig. 7, the method includes:
s701, obtaining a pre-training model.
Optionally, the pre-training model obtained in this step is a model that has been trained based on a large number of training samples, and the pre-training model can better complete an online prediction task.
And S702, performing field training on the pre-training model according to the training field data set, and updating the pre-training model.
And S703, performing successive sedimentation training on the underlying network of the pre-training model according to the training task data set.
The training objects of each precipitation training at least comprise an underlying network and a prediction layer network and a gradually decreasing middle-high layer network, and the pre-training model comprises the underlying network, at least one middle-high layer network and the prediction layer network from the bottom to the top.
And S704, testing the pre-training model after the precipitation training according to the test task data set.
And S705, judging whether the test result meets the precipitation finishing condition, if so, executing S706, otherwise, returning to execute S702.
Optionally, if the test result meets the precipitation ending condition, it indicates that the precipitation training has achieved the expected effect, and S706 may be executed to use the precipitation training as the reinforcement model, otherwise, it indicates that the precipitation training is insufficient, and it is necessary to return to S702 to update and adjust the parameters of the pre-training model based on the training field data set.
And S706, if the test result meets the precipitation finishing condition, taking the pre-training model after the precipitation training as a reinforced model.
And S707, taking at least two networks in the reinforced model as target networks, and constructing a distillation model according to the target networks.
And S708, extracting target knowledge of the training task data set through a target network of the reinforced model.
S709, training the distillation model according to the target knowledge and the training task data set.
And S710, testing the trained distillation model according to the test task data set.
And S711, judging whether the test result meets the training end condition, if so, executing S712, and if not, returning to executing S709.
And S712, if the test result meets the training end condition, taking the trained distillation model as a target learning model.
The technical scheme of the embodiment of the application provides a concrete implementation scheme for distilling the target learning model from the pre-training model China based on the knowledge distillation technology, and the target learning model distilled by the scheme simplifies network structure branches and improves the generalization capability of the model while the accurate prediction capability of the pre-training model is kept. The target learning model is deployed in the human-computer interaction equipment, so that the task can be quickly and accurately executed, and the real-time response requirement of the human-computer interaction equipment is met.
Fig. 8 is a flowchart of an intention identification method provided according to an embodiment of the present application. The present embodiment is applied to the case of performing intention recognition based on the target learning model trained in the above embodiments. The embodiment may be performed by an intention recognition apparatus configured in the electronic device, which may be implemented in software and/or hardware. Optionally, the electronic device may be a human-computer interaction device or a server that performs communication interaction with the human-computer interaction device. The human-computer interaction device can be an intelligent robot, an intelligent sound box, an intelligent mobile phone and the like. As shown in fig. 8, the method includes:
s801, user voice data collected by the human-computer interaction equipment are obtained.
Optionally, the human-computer interaction device according to the embodiment of the present application may acquire user voice data in an environment in real time through a voice acquisition device (e.g., a microphone) configured inside the human-computer interaction device. If the execution main body of this embodiment is a human-computer interaction device, the following operation of S802 may be directly performed after the human-computer interaction device collects the user voice data. If the execution main body of this embodiment is a server interacting with the human-computer interaction device, after the human-computer interaction device collects the user voice data, it transmits the user voice data to the server interacting with the human-computer interaction device, and the server acquires the user voice data and then executes the following operation of S802.
S802, inputting the user voice data into the target learning model to obtain the user intention recognition result output by the target learning model.
The target learning model in this embodiment is determined based on training of the knowledge distillation-based model training method described in any one of the above embodiments. And the target learning model of the present embodiment is a model for performing intention recognition.
Optionally, after the human-computer interaction device or the server in communication interaction with the human-computer interaction device acquires the user voice data, the acquired user voice data is input into the target learning model, at this time, the target learning model performs online analysis and prediction on the user voice data by using an algorithm during training based on the input user voice data, and outputs a user intention recognition result, and at this time, the human-computer interaction device or the server in communication interaction with the human-computer interaction device acquires the user intention recognition result output by the target learning model.
And S803, determining a response result of the human-computer interaction device according to the user intention recognition result.
Optionally, the human-computer interaction device or a server in communication interaction with the human-computer interaction device may determine a target human-computer interaction response rule corresponding to the user intention recognition result based on the obtained user intention recognition result, determine the response result based on the target human-computer interaction response rule, and feed the response result back to the user, so as to implement human-computer interaction based on the user voice data.
According to the technical scheme of the embodiment of the application, the target learning model for intention recognition trained by the knowledge distillation-based model training method based on any embodiment is deployed to the human-computer interaction device or the server end in communication interaction with the human-computer interaction device, the human-computer interaction device or the server end in communication interaction with the human-computer interaction device can acquire user voice data and input the user voice data into the target learning model, and the response result is determined based on the user intention recognition result output by the target learning model. The target learning model deployed in the human-computer interaction equipment or the server side in communication interaction with the human-computer interaction equipment is obtained through training in a knowledge distillation mode, the network structure is simpler than a pre-training model, the prediction effect can approach to the complex pre-training model, and rapid and accurate intention recognition can be achieved to meet the requirement of real-time response of the human-computer interaction equipment.
Fig. 9 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application, where the embodiment is suitable for a case where a pre-training model with a complex network structure is compressed and trained into a target learning model with a simple network structure based on a knowledge distillation technology. The device can implement the knowledge-based distillation model training method according to any embodiment of the present application, and the device 900 specifically includes the following:
the sediment training module 901 is used for carrying out at least two times of sediment training on the underlying network of the pre-training model according to the training task data set to obtain an enhanced model of the pre-training model; the training objects of each precipitation training at least comprise the underlying network and a prediction layer network and comprise successively descending intermediate-level networks, and the pre-training model comprises the underlying network, at least one intermediate-level network and the prediction layer network from bottom to top;
a distillation model construction module 902, configured to use at least two networks in the enhanced model as target networks, and construct a distillation model according to the target networks, where the target networks include a feature recognition network and the prediction layer network; the feature recognition network comprises at least the underlying network;
a target knowledge extraction module 903, configured to extract target knowledge of the training task data set through a target network of the reinforced model;
and a distillation model training module 904, configured to train the distillation model according to the target knowledge and the training task data set, so as to obtain a target learning model.
Further, the underlying network and the middle and high network are used for feature recognition; the prediction layer network is configured to perform task prediction based on the identified characteristics.
Further, the sediment training module 901 includes:
a data subset dividing unit, configured to divide the training task data set to determine multiple training data subsets;
the training object determining unit is used for determining training objects corresponding to each training data subset according to the set precipitation training times; the training objects corresponding to the training data subsets comprise a bottom layer network, a middle-high layer network and a prediction layer network of the pre-training model, and the number of layers of the middle-high layer network is in inverse proportion to the sequence of the precipitation training;
the precipitation training unit is used for carrying out precipitation training on the training object corresponding to the training data subset in the pre-training model according to each training data subset;
and the division number of the training data subsets is less than or equal to the total number of layers of the pre-training model.
Further, the middle-high network included in each training object is a network layer which is adjacent to the underlying network and is continuous upwards; and based on the increase of the number of the precipitation training times, the number of layers of the middle and high-rise networks included in the training object is decreased to zero.
Further, the sediment training module 901 is specifically configured to:
successively carrying out precipitation training on the underlying network of the pre-training model according to the training task data set;
testing the pre-training model after the sediment training according to the test task data set;
and if the test result meets the precipitation finishing condition, taking the pre-training model after the precipitation training as a reinforced model.
Further, the apparatus further comprises:
and the field training model is used for carrying out field training on the pre-training model according to the training field data set and updating the pre-training model before carrying out at least twice sedimentation training on the underlying network of the pre-training model according to the training task data set.
Further, the distillation model construction module 902 is specifically configured to:
taking at least two networks in the reinforced model as target networks, and acquiring network structure blocks of the target networks;
and constructing a distillation model with the same structure as the reinforced model according to the obtained network structure block.
Further, the distillation model construction module 902 is further specifically configured to:
taking at least two networks in the reinforced model as target networks;
and selecting a neural network model with a structure different from that of the reinforced model as a distillation model according to the target network, wherein an output layer network of the neural network model is consistent with the type of a prediction layer network in the target network, and a non-output layer network of the neural network model is consistent with the type of a feature recognition network in the target network.
Further, the target knowledge extraction module 903 is specifically configured to:
taking the training task data set as the input of the reinforced model, and acquiring a first data characteristic representation output by a characteristic recognition network of the reinforced model and a first prediction probability representation output by a prediction layer network of the reinforced model;
and taking the obtained first data feature representation and the first prediction probability representation as target knowledge of the training task data set.
Further, the distillation model training module 904 comprises:
the supervision label determining unit is used for inputting the training task data set into the distillation model and determining a soft supervision label and a hard supervision label according to the processing result of the distillation model on the training task data set and the target knowledge;
the target label determining unit is used for determining a target label according to the soft supervision label and the hard supervision label;
and the model parameter updating unit is used for iteratively updating the parameters of the distillation model according to the target label.
Further, the supervision tag determination unit specifically includes:
the output acquisition subunit is used for inputting the training task data set into the distillation model to obtain a second data characteristic representation output by a characteristic recognition network of the distillation model and a second prediction probability representation output by a prediction layer network of the distillation model;
a soft label determination subunit, configured to determine a soft supervised label based on the target knowledge, the second data feature representation and the second predictive probability representation;
and the hard label determining subunit is used for determining a hard supervision label according to the second prediction probability representation and the training task data set information.
Further, the training task data set information includes: the number of training samples, the number of training labels, and the actual label value in the training task dataset.
Further, the soft label determination subunit is specifically configured to:
taking the mean variance of the first data feature representation and the second data feature representation in the target knowledge as a data feature label;
taking a mean variance of the first prediction probability representation and the second prediction probability representation in the target knowledge as a probability prediction label;
and performing label fusion on the data feature label and the probability prediction label according to the weight value of the feature recognition network of the reinforced model to obtain a soft supervision label.
Further, the distillation model training module 904 is specifically configured to:
training the distillation model according to the target knowledge and the training task data set;
testing the trained distillation model according to the test task data set;
and if the test result meets the training end condition, taking the trained distillation model as a target learning model.
Further, the pre-training model is a bert model.
Further, the pre-training model and the target learning model are models for performing intention recognition;
correspondingly, the device further comprises:
and the model deployment module is used for deploying the target learning model into the human-computer interaction equipment so as to identify the intention of the user voice data acquired by the human-computer interaction equipment in real time.
According to the technical scheme of the embodiment, according to a training task data set, a base network, a prediction layer network and a gradually decreased middle-high layer network are used as training objects, and at least two times of precipitation training are carried out on the base network of a pre-training model to obtain a reinforced model; and constructing a distillation model according to the target network determined from the reinforced model. Extracting target knowledge of the training task data set through a target network of the reinforced model; and training the distillation model based on the extracted target knowledge and the training task data set to obtain a target learning model. In the embodiment, the mode of gradually decreasing the middle-high network is adopted to perform multiple sedimentation training on the underlying network of the pre-training model, so that the parameters of the underlying network of the pre-training model are more accurate. And subsequently, a distillation model is constructed at least according to the settled accurate underlying network and the prediction layer network, and distillation training is carried out on the distillation model based on the extracted target knowledge, so that the target learning model distilled from the pre-training model simplifies the network structure, simultaneously retains the prediction accuracy of the pre-training model, and further realizes the improvement of the generalization capability of the model. The whole distillation process is not influenced by human factors, and the target learning model is deployed in the human-computer interaction equipment, so that the task can be quickly and accurately executed, and the real-time response requirement of the human-computer interaction equipment is met.
Fig. 10 is a schematic structural diagram of an intention recognition device according to an embodiment of the present application, and this embodiment is applicable to a case where intention recognition is performed based on a target learning model trained in the foregoing embodiments. The device can implement the intention identification method described in any embodiment of the present application, and the device 1000 specifically includes the following:
a voice data acquisition module 1001, configured to acquire user voice data acquired by a human-computer interaction device;
an intention recognition module 1002, configured to input the user voice data into a target learning model to obtain a user intention recognition result output by the target learning model; wherein the target learning model is determined based on training of the knowledge distillation-based model training method according to any one of the above embodiments;
a response result determining module 1003, configured to determine a response result of the human-computer interaction device according to the user intention recognition result.
Further, the device is configured in the human-computer interaction device or a server side for communication interaction with the human-computer interaction device.
According to the technical scheme of the embodiment of the application, the target learning model for intention recognition trained by the knowledge distillation-based model training method based on any embodiment is deployed to the human-computer interaction device or the server end in communication interaction with the human-computer interaction device, the human-computer interaction device or the server end in communication interaction with the human-computer interaction device can acquire user voice data and input the user voice data into the target learning model, and the response result is determined based on the user intention recognition result output by the target learning model. The target learning model deployed in the human-computer interaction equipment or the server side in communication interaction with the human-computer interaction equipment is obtained through training in a knowledge distillation mode, the network structure is simpler than a pre-training model, the prediction effect can approach to the complex pre-training model, and rapid and accurate intention recognition can be achieved to meet the requirement of real-time response of the human-computer interaction equipment.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
Fig. 11 is a block diagram of an electronic device for a knowledge-based distillation model training method or an intention recognition method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 11, the electronic apparatus includes: one or more processors 1101, a memory 1102, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 11, a processor 1101 is taken as an example.
The memory 1102 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform a knowledge distillation based model training method or an intent recognition method as provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform a knowledge distillation based model training method or an intent recognition method provided herein.
The memory 1102, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the knowledge-based distillation model training method or the intention recognition method in the embodiments of the present application (e.g., the precipitation training module 901, the distillation model construction module 902, the target knowledge extraction module 903, and the distillation model training module 904 shown in fig. 9, or the speech data acquisition module 1001, the intention recognition module 1002, and the response result determination module 1003 shown in fig. 10). The processor 1101 executes various functional applications of the server and data processing, i.e., implementing the knowledge-distillation-based model training method or the intention recognition method in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory 1102.
The memory 1102 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created according to use of an electronic device of a knowledge-based distillation model training method or an intention recognition method, or the like. Further, the memory 1102 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 1102 may optionally include a memory remotely located from the processor 1101, and these remote memories may be connected to the electronics of the knowledge-based distillation model training method or the intent recognition method through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the knowledge distillation-based model training method or the intention recognition method may further include: an input device 1103 and an output device 1104. The processor 1101, the memory 1102, the input device 1103 and the output device 1104 may be connected by a bus or other means, and are exemplified by being connected by a bus in fig. 11.
The input device 1103 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus based on the knowledge-based model training method or the intention recognition method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 1104 may include a display device, auxiliary lighting devices (e.g., LEDs), tactile feedback devices (e.g., vibrating motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
According to the technical scheme of the embodiment, according to a training task data set, a base network, a prediction layer network and a gradually decreased middle-high layer network are used as training objects, and at least two times of precipitation training are carried out on the base network of a pre-training model to obtain a reinforced model; and constructing a distillation model according to the target network determined from the reinforced model. Extracting target knowledge of the training task data set through a target network of the reinforced model; and training the distillation model based on the extracted target knowledge and the training task data set to obtain a target learning model. In the embodiment, the mode of gradually decreasing the middle-high network is adopted to perform multiple sedimentation training on the underlying network of the pre-training model, so that the parameters of the underlying network of the pre-training model are more accurate. And subsequently, a distillation model is constructed at least according to the settled accurate underlying network and the prediction layer network, and distillation training is carried out on the distillation model based on the extracted target knowledge, so that the target learning model distilled from the pre-training model simplifies the network structure, simultaneously retains the prediction accuracy of the pre-training model, and further realizes the improvement of the generalization capability of the model. The whole distillation process is not influenced by human factors, and the target learning model is deployed in the human-computer interaction equipment, so that the task can be quickly and accurately executed, and the real-time response requirement of the human-computer interaction equipment is met.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (38)

1. A knowledge-distillation-based model training method, the method comprising:
performing precipitation training at least twice on a bottom layer network of a pre-training model according to a training task data set to obtain a reinforced model of the pre-training model; the training objects of each precipitation training at least comprise the underlying network and a prediction layer network and comprise successively descending intermediate-level networks, and the pre-training model comprises the underlying network, at least one intermediate-level network and the prediction layer network from bottom to top;
taking at least two networks in the reinforced model as target networks, and constructing a distillation model according to the target networks, wherein the target networks comprise a feature recognition network and the prediction layer network; the feature recognition network comprises at least the underlying network;
extracting target knowledge of the training task data set through a target network of the reinforced model;
and training the distillation model according to the target knowledge and the training task data set to obtain a target learning model.
2. The method of claim 1, wherein the underlay network and the medium-high network are used for feature recognition; the prediction layer network is configured to perform task prediction based on the identified characteristics.
3. The method of claim 1, wherein performing at least two deposit trainings on an underlying network of the pre-trained model from the training task dataset comprises:
dividing the training task data set to determine a plurality of training data subsets;
determining training objects corresponding to each training data subset according to the set precipitation training times; the training objects corresponding to the training data subsets comprise a bottom layer network, a middle-high layer network and a prediction layer network of the pre-training model, and the number of layers of the middle-high layer network is in inverse proportion to the sequence of the precipitation training;
performing primary precipitation training on a training object corresponding to the training data subset in the pre-training model according to each training data subset;
and the division number of the training data subsets is less than or equal to the total number of layers of the pre-training model.
4. The method of claim 3, wherein each of the training objects comprises a medium-high network that is an upwardly continuous network layer adjacent to an underlying network; and based on the increase of the number of the precipitation training times, the number of layers of the middle and high-rise networks included in the training object is decreased to zero.
5. The method of claim 1, wherein performing at least two sediment training sessions on an underlying network of a pre-trained model according to a training task data set to obtain an enhanced model of the pre-trained model comprises:
successively carrying out precipitation training on the underlying network of the pre-training model according to the training task data set;
testing the pre-training model after the sediment training according to the test task data set;
and if the test result meets the precipitation finishing condition, taking the pre-training model after the precipitation training as a reinforced model.
6. The method of claim 1, wherein prior to performing at least two deposit trainings on an underlying network of the pre-trained model based on the training task dataset, further comprising:
and performing field training on the pre-training model according to the training field data set, and updating the pre-training model.
7. The method of claim 1, wherein taking at least two networks of the augmented model as target networks and constructing a distillation model from the target networks comprises:
taking at least two networks in the reinforced model as target networks, and acquiring network structure blocks of the target networks;
and constructing a distillation model with the same structure as the reinforced model according to the obtained network structure block.
8. The method of claim 1, wherein taking at least two networks of the augmented model as target networks and constructing a distillation model from the target networks comprises:
taking at least two networks in the reinforced model as target networks;
and selecting a neural network model with a structure different from that of the reinforced model as a distillation model according to the target network, wherein an output layer network of the neural network model is consistent with the type of a prediction layer network in the target network, and a non-output layer network of the neural network model is consistent with the type of a feature recognition network in the target network.
9. The method of claim 1, wherein extracting target knowledge of the training task data set over a target network of the reinforcement model comprises:
taking the training task data set as the input of the reinforced model, and acquiring a first data characteristic representation output by a characteristic recognition network of the reinforced model and a first prediction probability representation output by a prediction layer network of the reinforced model;
and taking the obtained first data feature representation and the first prediction probability representation as target knowledge of the training task data set.
10. The method of claim 1, wherein training the distillation model based on the target knowledge and the training task data set comprises:
inputting the training task data set into the distillation model, and determining a soft supervision label and a hard supervision label according to the processing result of the distillation model on the training task data set and the target knowledge;
determining a target tag according to the soft surveillance tag and the hard surveillance tag;
and iteratively updating the parameters of the distillation model according to the target label.
11. The method of claim 10, wherein inputting the training task data set into the distillation model and determining soft and hard surveillance labels based on the results of the processing of the training task data set by the distillation model and the target knowledge comprises:
inputting the training task data set into the distillation model to obtain a second data feature representation output by a feature recognition network of the distillation model and a second prediction probability representation output by a prediction layer network of the distillation model;
determining a soft surveillance label based on the target knowledge, the second data feature representation, and the second predictive probability representation;
determining a hard surveillance label based on the second predictive probability representation and the training task dataset information.
12. The method of claim 11, wherein the training task data set information comprises: the number of training samples, the number of training labels, and the actual label value in the training task dataset.
13. The method of claim 11, wherein determining a soft supervised label from the target knowledge, the second data feature representation, and the second predictive probability representation comprises:
taking the mean variance of the first data feature representation and the second data feature representation in the target knowledge as a data feature label;
taking a mean variance of the first prediction probability representation and the second prediction probability representation in the target knowledge as a probability prediction label;
and performing label fusion on the data feature label and the probability prediction label according to the weight value of the feature recognition network of the reinforced model to obtain a soft supervision label.
14. The method of claim 1, wherein training the distillation model based on the target knowledge and the training task data set to obtain a target learning model comprises:
training the distillation model according to the target knowledge and the training task data set;
testing the trained distillation model according to the test task data set;
and if the test result meets the training end condition, taking the trained distillation model as a target learning model.
15. The method of any of claims 1-14, wherein the pre-trained model is a bert model.
16. The method of any of claims 1-14, wherein the pre-trained and target learning models are models for intent recognition;
correspondingly, after the distillation model is trained according to the target knowledge and the training task data set to obtain a target learning model, the method further includes:
and deploying the target learning model into a human-computer interaction device so as to perform intention recognition on user voice data acquired by the human-computer interaction device in real time.
17. An intent recognition method, the method comprising:
acquiring user voice data acquired by a man-machine interaction device;
inputting the user voice data into a target learning model to obtain a user intention recognition result output by the target learning model; wherein the target learning model is determined based on training by the knowledge distillation-based model training method according to any one of claims 1 to 16;
and determining a response result of the human-computer interaction equipment according to the user intention identification result.
18. The method of claim 17, wherein an execution subject of the method is the human-computer interaction device or a server side in communication interaction with the human-computer interaction device.
19. A knowledge-based distillation model training apparatus, the apparatus comprising:
the sediment training module is used for carrying out at least two times of sediment training on the underlying network of the pre-training model according to the training task data set to obtain a reinforced model of the pre-training model; the training objects of each precipitation training at least comprise the underlying network and a prediction layer network and comprise successively descending intermediate-level networks, and the pre-training model comprises the underlying network, at least one intermediate-level network and the prediction layer network from bottom to top;
the distillation model building module is used for taking at least two networks in the reinforced model as target networks and building a distillation model according to the target networks, wherein the target networks comprise a feature recognition network and the prediction layer network; the feature recognition network comprises at least the underlying network;
the target knowledge extraction module is used for extracting target knowledge of the training task data set through a target network of the reinforced model;
and the distillation model training module is used for training the distillation model according to the target knowledge and the training task data set to obtain a target learning model.
20. The apparatus of claim 19, wherein the underlay network and the medium-high network are used for feature recognition; the prediction layer network is configured to perform task prediction based on the identified characteristics.
21. The apparatus of claim 19, wherein the precipitation training module comprises:
a data subset dividing unit, configured to divide the training task data set to determine multiple training data subsets;
the training object determining unit is used for determining training objects corresponding to each training data subset according to the set precipitation training times; the training objects corresponding to the training data subsets comprise a bottom layer network, a middle-high layer network and a prediction layer network of the pre-training model, and the number of layers of the middle-high layer network is in inverse proportion to the sequence of the precipitation training;
the precipitation training unit is used for carrying out precipitation training on the training object corresponding to the training data subset in the pre-training model according to each training data subset;
and the division number of the training data subsets is less than or equal to the total number of layers of the pre-training model.
22. The apparatus of claim 21, wherein each of the training objects comprises a medium-high network that is an upwardly continuous network layer adjacent to an underlying network; and based on the increase of the number of the precipitation training times, the number of layers of the middle and high-rise networks included in the training object is decreased to zero.
23. The apparatus of claim 19, wherein the precipitation training module is specifically configured to:
successively carrying out precipitation training on the underlying network of the pre-training model according to the training task data set;
testing the pre-training model after the sediment training according to the test task data set;
and if the test result meets the precipitation finishing condition, taking the pre-training model after the precipitation training as a reinforced model.
24. The apparatus of claim 19, further comprising:
and the field training model is used for carrying out field training on the pre-training model according to the training field data set and updating the pre-training model before carrying out at least twice sedimentation training on the underlying network of the pre-training model according to the training task data set.
25. The apparatus of claim 19, wherein the distillation model construction module is specifically configured to:
taking at least two networks in the reinforced model as target networks, and acquiring network structure blocks of the target networks;
and constructing a distillation model with the same structure as the reinforced model according to the obtained network structure block.
26. The apparatus of claim 19, wherein the distillation model construction module is further specific to:
taking at least two networks in the reinforced model as target networks;
and selecting a neural network model with a structure different from that of the reinforced model as a distillation model according to the target network, wherein an output layer network of the neural network model is consistent with the type of a prediction layer network in the target network, and a non-output layer network of the neural network model is consistent with the type of a feature recognition network in the target network.
27. The apparatus of claim 19, wherein the target knowledge extraction module is specifically configured to:
taking the training task data set as the input of the reinforced model, and acquiring a first data characteristic representation output by a characteristic recognition network of the reinforced model and a first prediction probability representation output by a prediction layer network of the reinforced model;
and taking the obtained first data feature representation and the first prediction probability representation as target knowledge of the training task data set.
28. The apparatus of claim 19, wherein the distillation model training module comprises:
the supervision label determining unit is used for inputting the training task data set into the distillation model and determining a soft supervision label and a hard supervision label according to the processing result of the distillation model on the training task data set and the target knowledge;
the target label determining unit is used for determining a target label according to the soft supervision label and the hard supervision label;
and the model parameter updating unit is used for iteratively updating the parameters of the distillation model according to the target label.
29. The apparatus of claim 28, wherein the supervision tag determination unit specifically comprises:
the output acquisition subunit is used for inputting the training task data set into the distillation model to obtain a second data characteristic representation output by a characteristic recognition network of the distillation model and a second prediction probability representation output by a prediction layer network of the distillation model;
a soft label determination subunit, configured to determine a soft supervised label based on the target knowledge, the second data feature representation and the second predictive probability representation;
and the hard label determining subunit is used for determining a hard supervision label according to the second prediction probability representation and the training task data set information.
30. The apparatus of claim 29, wherein the training task data set information comprises: the number of training samples, the number of training labels, and the actual label value in the training task dataset.
31. The apparatus of claim 29, wherein the soft tag determination subunit is specifically configured to:
taking the mean variance of the first data feature representation and the second data feature representation in the target knowledge as a data feature label;
taking a mean variance of the first prediction probability representation and the second prediction probability representation in the target knowledge as a probability prediction label;
and performing label fusion on the data feature label and the probability prediction label according to the weight value of the feature recognition network of the reinforced model to obtain a soft supervision label.
32. The apparatus of claim 19, wherein the distillation model training module is further configured to:
training the distillation model according to the target knowledge and the training task data set;
testing the trained distillation model according to the test task data set;
and if the test result meets the training end condition, taking the trained distillation model as a target learning model.
33. The apparatus of any one of claims 19-32, wherein the pre-trained model is a bert model.
34. The apparatus of any of claims 19-32, wherein the pre-trained and target learning models are models for intent recognition;
correspondingly, the method also comprises the following steps:
and the model deployment module is used for deploying the target learning model into the human-computer interaction equipment so as to identify the intention of the user voice data acquired by the human-computer interaction equipment in real time.
35. An intent recognition device, the device comprising:
the voice data acquisition module is used for acquiring user voice data acquired by the human-computer interaction equipment;
the intention recognition module is used for inputting the user voice data into a target learning model so as to obtain a user intention recognition result output by the target learning model; wherein the target learning model is determined based on training by the knowledge distillation-based model training method according to any one of claims 1 to 16;
and the response result determining module is used for determining the response result of the human-computer interaction equipment according to the user intention recognition result.
36. The apparatus of claim 35, wherein the apparatus is configured in the human-computer interaction device or a server side for communication interaction with the human-computer interaction device.
37. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the knowledge distillation based model training method of any one of claims 1-16 or to perform the intent recognition method of any one of claims 17-18.
38. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the knowledge distillation based model training method of any one of claims 1-16 or the intent recognition method of any one of claims 17-18.
CN202010444204.XA 2020-05-22 2020-05-22 Model training and intention recognition method, device, equipment and storage medium Active CN111640425B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010444204.XA CN111640425B (en) 2020-05-22 2020-05-22 Model training and intention recognition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010444204.XA CN111640425B (en) 2020-05-22 2020-05-22 Model training and intention recognition method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111640425A true CN111640425A (en) 2020-09-08
CN111640425B CN111640425B (en) 2023-08-15

Family

ID=72333280

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010444204.XA Active CN111640425B (en) 2020-05-22 2020-05-22 Model training and intention recognition method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111640425B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113157183A (en) * 2021-04-15 2021-07-23 成都新希望金融信息有限公司 Deep learning model construction method and device, electronic equipment and storage medium
CN113160801A (en) * 2021-03-10 2021-07-23 云从科技集团股份有限公司 Speech recognition method, apparatus and computer readable storage medium
CN113204614A (en) * 2021-04-29 2021-08-03 北京百度网讯科技有限公司 Model training method, method and device for optimizing training data set
CN113239272A (en) * 2021-05-12 2021-08-10 烽火通信科技股份有限公司 Intention prediction method and intention prediction device of network management and control system
WO2022121515A1 (en) * 2020-12-11 2022-06-16 International Business Machines Corporation Mixup data augmentation for knowledge distillation framework
US20220277143A1 (en) * 2021-02-27 2022-09-01 Walmart Apollo, Llc Methods and apparatus for natural language understanding in conversational systems using machine learning processes

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160350834A1 (en) * 2015-06-01 2016-12-01 Nara Logics, Inc. Systems and methods for constructing and applying synaptic networks
CN107247989A (en) * 2017-06-15 2017-10-13 北京图森未来科技有限公司 A kind of neural network training method and device
CN109543817A (en) * 2018-10-19 2019-03-29 北京陌上花科技有限公司 Model distillating method and device for convolutional neural networks
CN109637546A (en) * 2018-12-29 2019-04-16 苏州思必驰信息科技有限公司 Knowledge distillating method and device
WO2019143946A1 (en) * 2018-01-19 2019-07-25 Visa International Service Association System, method, and computer program product for compressing neural network models
CN110084368A (en) * 2018-04-20 2019-08-02 谷歌有限责任公司 System and method for regularization neural network
CN110162018A (en) * 2019-05-31 2019-08-23 天津开发区精诺瀚海数据科技有限公司 The increment type equipment fault diagnosis method that knowledge based distillation is shared with hidden layer
CN110807515A (en) * 2019-10-30 2020-02-18 北京百度网讯科技有限公司 Model generation method and device
CN110832596A (en) * 2017-10-16 2020-02-21 因美纳有限公司 Deep convolutional neural network training method based on deep learning
CN110837761A (en) * 2018-08-17 2020-02-25 北京市商汤科技开发有限公司 Multi-model knowledge distillation method and device, electronic equipment and storage medium
CN110909775A (en) * 2019-11-08 2020-03-24 支付宝(杭州)信息技术有限公司 Data processing method and device and electronic equipment
CN111062951A (en) * 2019-12-11 2020-04-24 华中科技大学 Knowledge distillation method based on semantic segmentation intra-class feature difference
CN111062495A (en) * 2019-11-28 2020-04-24 深圳市华尊科技股份有限公司 Machine learning method and related device
CN111079938A (en) * 2019-11-28 2020-04-28 百度在线网络技术(北京)有限公司 Question-answer reading understanding model obtaining method and device, electronic equipment and storage medium
EP3648014A1 (en) * 2018-10-29 2020-05-06 Fujitsu Limited Model training method, data identification method and data identification device

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160350834A1 (en) * 2015-06-01 2016-12-01 Nara Logics, Inc. Systems and methods for constructing and applying synaptic networks
CN107247989A (en) * 2017-06-15 2017-10-13 北京图森未来科技有限公司 A kind of neural network training method and device
CN110832596A (en) * 2017-10-16 2020-02-21 因美纳有限公司 Deep convolutional neural network training method based on deep learning
WO2019143946A1 (en) * 2018-01-19 2019-07-25 Visa International Service Association System, method, and computer program product for compressing neural network models
CN110084368A (en) * 2018-04-20 2019-08-02 谷歌有限责任公司 System and method for regularization neural network
CN110837761A (en) * 2018-08-17 2020-02-25 北京市商汤科技开发有限公司 Multi-model knowledge distillation method and device, electronic equipment and storage medium
CN109543817A (en) * 2018-10-19 2019-03-29 北京陌上花科技有限公司 Model distillating method and device for convolutional neural networks
EP3648014A1 (en) * 2018-10-29 2020-05-06 Fujitsu Limited Model training method, data identification method and data identification device
CN109637546A (en) * 2018-12-29 2019-04-16 苏州思必驰信息科技有限公司 Knowledge distillating method and device
CN110162018A (en) * 2019-05-31 2019-08-23 天津开发区精诺瀚海数据科技有限公司 The increment type equipment fault diagnosis method that knowledge based distillation is shared with hidden layer
CN110807515A (en) * 2019-10-30 2020-02-18 北京百度网讯科技有限公司 Model generation method and device
CN110909775A (en) * 2019-11-08 2020-03-24 支付宝(杭州)信息技术有限公司 Data processing method and device and electronic equipment
CN111062495A (en) * 2019-11-28 2020-04-24 深圳市华尊科技股份有限公司 Machine learning method and related device
CN111079938A (en) * 2019-11-28 2020-04-28 百度在线网络技术(北京)有限公司 Question-answer reading understanding model obtaining method and device, electronic equipment and storage medium
CN111062951A (en) * 2019-12-11 2020-04-24 华中科技大学 Knowledge distillation method based on semantic segmentation intra-class feature difference

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YUUKI TACHIOKA: "Knowledge Distillation Using Soft and Hard Labels and Annealing for Acoustic Model Training" *
玉圣龙: "基于分节信息的方言语音系统的研究与实现" *
马治楠: "基于深度学习的计算优化技术研究" *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022121515A1 (en) * 2020-12-11 2022-06-16 International Business Machines Corporation Mixup data augmentation for knowledge distillation framework
GB2617035A (en) * 2020-12-11 2023-09-27 Ibm Mixup data augmentation for knowledge distillation framework
US20220277143A1 (en) * 2021-02-27 2022-09-01 Walmart Apollo, Llc Methods and apparatus for natural language understanding in conversational systems using machine learning processes
US11960842B2 (en) 2021-02-27 2024-04-16 Walmart Apollo, Llc Methods and apparatus for natural language understanding in conversational systems using machine learning processes
CN113160801A (en) * 2021-03-10 2021-07-23 云从科技集团股份有限公司 Speech recognition method, apparatus and computer readable storage medium
CN113160801B (en) * 2021-03-10 2024-04-12 云从科技集团股份有限公司 Speech recognition method, device and computer readable storage medium
CN113157183A (en) * 2021-04-15 2021-07-23 成都新希望金融信息有限公司 Deep learning model construction method and device, electronic equipment and storage medium
CN113157183B (en) * 2021-04-15 2022-12-16 成都新希望金融信息有限公司 Deep learning model construction method and device, electronic equipment and storage medium
CN113204614A (en) * 2021-04-29 2021-08-03 北京百度网讯科技有限公司 Model training method, method and device for optimizing training data set
CN113204614B (en) * 2021-04-29 2023-10-17 北京百度网讯科技有限公司 Model training method, method for optimizing training data set and device thereof
CN113239272A (en) * 2021-05-12 2021-08-10 烽火通信科技股份有限公司 Intention prediction method and intention prediction device of network management and control system

Also Published As

Publication number Publication date
CN111640425B (en) 2023-08-15

Similar Documents

Publication Publication Date Title
CN111640425B (en) Model training and intention recognition method, device, equipment and storage medium
CN111539227B (en) Method, apparatus, device and computer storage medium for training semantic representation model
CN111667054A (en) Method and device for generating neural network model, electronic equipment and storage medium
CN110175628A (en) A kind of compression algorithm based on automatic search with the neural networks pruning of knowledge distillation
CN111582479B (en) Distillation method and device for neural network model
CN113627135B (en) Recruitment post description text generation method, device, equipment and medium
CN111831813B (en) Dialog generation method, dialog generation device, electronic equipment and medium
CN111737954B (en) Text similarity determination method, device, equipment and medium
CN112560985B (en) Neural network searching method and device and electronic equipment
CN112559870B (en) Multi-model fusion method, device, electronic equipment and storage medium
CN114612749B (en) Neural network model training method and device, electronic device and medium
CN112989023B (en) Label recommendation method, device, equipment, storage medium and computer program product
CN111931067A (en) Interest point recommendation method, device, equipment and medium
CN113705628B (en) Determination method and device of pre-training model, electronic equipment and storage medium
CN110675954A (en) Information processing method and device, electronic equipment and storage medium
CN111611808B (en) Method and apparatus for generating natural language model
CN111753761A (en) Model generation method and device, electronic equipment and storage medium
CN111639753A (en) Method, apparatus, device and storage medium for training a hyper-network
CN112288483A (en) Method and device for training model and method and device for generating information
CN112329453B (en) Method, device, equipment and storage medium for generating sample chapter
CN114547244A (en) Method and apparatus for determining information
CN111325000B (en) Language generation method and device and electronic equipment
CN112580723A (en) Multi-model fusion method and device, electronic equipment and storage medium
CN112529180A (en) Method and apparatus for model distillation
CN111767946A (en) Medical image hierarchical model training and prediction method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant