CN114118061A

CN114118061A - Lightweight intention recognition model training method, device, equipment and storage medium

Info

Publication number: CN114118061A
Application number: CN202111449931.6A
Authority: CN
Inventors: 蒋志燕; 曾航; 程刚; 廖晨
Original assignee: Shenzhen Raisound Technology Co ltd
Current assignee: Shenzhen Raisound Technology Co ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-01

Abstract

The invention relates to an artificial intelligence technology, and discloses a lightweight intention recognition model training method, which comprises the following steps: the method comprises the steps of obtaining an original session training set, carrying out session role characterization and session environment characterization on the original session training set by using a pre-constructed teacher model to obtain a standard session training set, constructing an interactive distillation network by using the teacher model and a pre-constructed lightweight neural network, and carrying out interactive training on the interactive distillation network by using the standard session training set to obtain a standard student model. The invention also provides a lightweight intention recognition model training device, electronic equipment and a computer readable storage medium. The invention can solve the problem that the voice intention of the voice recognition model in the mobile intelligent equipment is not accurately recognized.

Description

Lightweight intention recognition model training method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a lightweight intention recognition model training method and device, electronic equipment and a computer readable storage medium.

Background

Along with the rapid development of artificial intelligence, the intelligent voice recognition technology is widely applied, and is more and more important when being applied to model training of mobile intelligent devices such as smart phones and bracelets. However, in a speech recognition scenario, the mobile smart device is limited by volume, memory, and the like, and cannot deploy a speech recognition model with complex processing logic.

In the prior art, model deployment in a mobile intelligent device is performed by the following method: 1. model compression is realized by pruning and quantizing the voice recognition model, however, the model pruning and quantization can synchronously reduce the parameter quantity and the operation quantity of the model, and the accuracy rate of voice recognition is reduced; 2. through the knowledge distillation method, the student models are trained by using labels output by the teacher model, so that the performance of the student models is close to that of the teacher model, however, the interaction between the student models and the teacher model is low, and because the manual labeling efficiency is low, a large amount of labeling data is lacked, so that the recognition accuracy of the trained models is low. Therefore, the accuracy of speech intention recognition of a speech recognition model in the existing mobile intelligent device is to be improved.

Disclosure of Invention

The application provides a lightweight intention recognition model training method and device, electronic equipment and a storage medium, and aims to solve the problem that speech intention recognition of a speech recognition model in mobile intelligent equipment is inaccurate.

In a first aspect, the present application provides a lightweight intent recognition model training method, including:

acquiring an original session training set, and performing session role representation and session environment representation on the original session training set by using a pre-constructed teacher model to obtain a standard session training set;

constructing an interactive distillation network by using the teacher model and a pre-constructed lightweight neural network;

and performing interactive training on the interactive distillation network by using the standard session training set to obtain a standard student model.

In detail, the performing session role representation and session environment representation on the original session training set by using a pre-constructed teacher model to obtain a standard session training set, including:

performing text conversion on the conversation voice in the original conversation training set by using a voice recognition layer in the teacher model to obtain a conversation text;

vector conversion is carried out on the session text by utilizing a vector conversion layer in the teacher model, and role representation is carried out on the converted vector to obtain a session semantic representation vector;

performing semantic environment representation on the session semantic representation vector by using a vector representation layer in the teacher model to obtain a session environment representation vector;

and outputting an initial intention recognition result in the session environment characterization vector by using an intention recognition layer in the teacher model, and adding the initial intention recognition result as a real label to the original session training set to obtain the standard session training set.

In detail, the performing vector transformation on the session text by using a vector transformation layer in the teacher model, and performing role representation on the transformed vector to obtain a session semantic representation vector includes:

vector conversion is carried out on all sentences in the conversation text by utilizing the vector conversion layer to obtain sentence representation vectors;

constructing a role representation vector according to the speaker of the statement representation vector;

and splicing the statement characterization vector and the role characterization vector to obtain the session semantic characterization vector.

In detail, the outputting, by an intention recognition layer in the teacher model, an initial intention recognition result in the session environment characterization vector includes:

obtaining the contribution degree of the session environment characterization vector by utilizing an attention layer in the intention identification layer;

accumulating the session environment characterization vector and the contribution degree to obtain a session characterization vector;

and outputting an initial intention identification result of the session characterization vector by utilizing a classification function in the intention identification layer.

In detail, the method for constructing the interactive distillation network by using the teacher model and the pre-constructed lightweight neural network comprises the following steps:

taking each layer of network in the teacher model as a teacher module, and taking each layer of network in the pre-constructed lightweight neural network as a student module;

and matching the teacher module with the corresponding student module, and connecting the successfully matched teacher model with the lightweight neural network in parallel to obtain the interactive distillation network.

In detail, the interactive training of the interactive distillation network by using the standard session training set to obtain a standard student model includes:

selecting one student module from the interactive distillation network, and replacing the corresponding teacher module with the selected student module to obtain a mixed network;

training the mixed network by using the standard session training set, and adjusting parameters of student modules in the mixed network until the parameters are converged to obtain trained student modules;

returning to the step of selecting one student module from the interactive distillation network and replacing the corresponding teacher module with the selected student module to obtain a mixed network until all trained student modules are obtained;

and performing iterative training on all the trained student modules to obtain the standard student model.

In detail, the iteratively training all the trained student modules to obtain the standard student model includes:

splicing all the trained student modules according to the sequence of each layer of network in the pre-constructed lightweight neural network to obtain an original student model;

outputting a prediction label of the session in the standard session training set by using the original student model, and calculating a loss value according to the prediction label and the real label;

and when the loss value is greater than or equal to a preset loss threshold value, adjusting parameters of a module in the original student model, returning to the step of outputting the prediction labels of the sessions in the standard session training set by using the original student model, and stopping training until the loss value is less than the loss threshold value to obtain the standard student model.

In a second aspect, the present application provides a lightweight intent recognition model training apparatus, the apparatus comprising:

the session training set construction module is used for acquiring an original session training set, and performing session role representation and session environment representation on the original session training set by using a pre-constructed teacher model to obtain a standard session training set;

the interactive distillation network construction module is used for constructing an interactive distillation network by utilizing the teacher model and a pre-constructed lightweight neural network;

and the interactive training module is used for performing interactive training on the interactive distillation network by using the standard session training set to obtain a standard student model.

In a third aspect, an intention identification device is provided, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor, configured to implement the steps of the lightweight intent recognition model training method according to any embodiment of the first aspect when executing a program stored in a memory.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the lightweight intent recognition model training method according to any of the embodiments of the first aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:

according to the invention, the pre-constructed teacher model is used for performing session role representation and session environment representation on the original session training set, and because the semantics represented by different role speeches are considered and the contribution of sentences to intention judgment in the session context environment is different, the labels in the standard session training set are more accurate, so that a large amount of accurate training data can be obtained, and the accuracy of model training is improved. Meanwhile, interactive training is carried out by constructing an interactive distillation network, the interactivity of knowledge distillation is improved, the standard student model cannot be compressed, the parameter quantity of the model cannot be reduced, and the accuracy of model intention identification is further improved. Therefore, the lightweight intention recognition model training method, the lightweight intention recognition model training device, the electronic equipment and the computer readable storage medium can solve the problem that the voice intention recognition of the voice recognition model in the mobile intelligent equipment is inaccurate.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a lightweight intent recognition model training method according to an embodiment of the present disclosure;

FIG. 2 is a block diagram of a lightweight intent recognition model training apparatus according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an electronic device of a lightweight intent recognition model training method according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic flow chart of a lightweight intent recognition model training method according to an embodiment of the present disclosure. In this embodiment, the lightweight intent recognition model training method includes:

and S1, acquiring an original session training set, and performing session role representation and session environment representation on the original session training set by using a pre-constructed teacher model to obtain a standard session training set.

In the embodiment of the invention, the original session training set can be in different fields, and the intelligent customer service and the user can communicate voice. For example, in the banking field, the original session training set may be the session speech of the loan collection session.

In an optional embodiment of the present invention, the pre-constructed teacher model includes a speech recognition layer, a vector transformation characterization layer, and an intention recognition layer. Wherein, the speech recognition layer can be a traditional acoustic model plus a language model, such as an HMM, or an end-to-end model structure, such as an Encoder-Decoder model structure; the vector conversion layer can be a BERT model and the like; the vector characterization layer can be a Bi-Lstm model and the like; the intention recognition layer may be an attention layer + softmax classification function.

For example, in an optional embodiment of the present invention, a speech recognition layer is used to perform speech recognition on one-way conversation speech to obtain a conversation text; inputting all session texts of the session voice into a bert model (a vector conversion layer), acquiring a [ cls ] vector output by the bert model to obtain a statement characterization vector sensor-vector _ i of each sentence in the session text, and constructing roll-embedding characterization vectors of different session roles (the same sentence is said by a user and is different from semantics contained by a customer service, so the embedding vectors of different roles are needed); then, splicing the sensor-vector _ i and the role-embedding _ i to obtain a session semantic representation vector rs-vector _ i represented by the sentence spoken by the role; then learning time sequence semantic information among session semantic representation vectors rs-vector _ i by utilizing a Bi-Lstm (vector representation layer) to obtain session environment representation vectors cs-vector _ i in a session context environment; because the contribution degree provided by each session of the user or customer service to the session intention is different, the embodiment of the invention learns the contribution granularity e _ i of the session representation from the obtained session environment representation vector cs-vector _ i after passing through one attribute layer, and accumulates the e _ i and the cs-vector _ i to be used as the session environment representation vector session-vector of the whole session text; and outputting an initial intention recognition result of the whole conversation voice by utilizing a softmax classification function (intention recognition layer), and adding the initial intention recognition result serving as a label to the conversation text to serve as the standard conversation training set.

Specifically, the outputting, by an intention recognition layer in the teacher model, an initial intention recognition result in the session environment characterization vector includes:

In the embodiment of the invention, the teacher model considers the semantics represented by different character speeches and simultaneously considers that the contribution of sentences to the intention judgment is different in the context environment of conversation, so that the model is more consistent with the actual situation, and the accuracy of intention identification is improved. For example, the intention recognition result of the loan segment includes "the user is still", "the user is not still", and the like. Meanwhile, the result of intention recognition is used as a label, a large amount of training data with accurate labels can be obtained without manually marking in a large amount, and the efficiency and the accuracy of model training are greatly improved.

And S2, constructing an interactive distillation network by using the teacher model and the pre-constructed lightweight neural network.

In the embodiment of the invention, the pre-constructed lightweight neural network comprises a lightweight speech recognition layer, a lightweight vector conversion layer, a lightweight vector characterization layer and a lightweight intention recognition layer. The lightweight speech recognition layer can be a third-party speech recognition engine, the lightweight vector conversion layer can be an ALBert model and the like, the lightweight vector characterization layer can be a GRU network, and the lightweight intention recognition layer can be an ECANet network + softmax classification function.

Specifically, the method for constructing the interactive distillation network by using the teacher model and the pre-constructed lightweight neural network comprises the following steps:

In the embodiment of the invention, the teacher module comprises a voice recognition module, a vector conversion module, a vector representation module and an intention recognition layer. The student module comprises a lightweight voice recognition module, a lightweight vector conversion module, a lightweight vector representation module and a lightweight intention recognition module. The interactive distillation network is obtained by matching the corresponding modules, for example, matching the voice recognition module in the teacher module with the lightweight voice recognition module in the student module, matching the vector conversion module in the teacher module with the lightweight vector conversion module in the student module, and the like, and connecting the two teacher models and the pre-constructed neural network in parallel, so that the efficiency of model training can be improved.

And S3, performing interactive training on the interactive distillation network by using the standard session training set to obtain a standard student model.

In the embodiment of the invention, the interactive training is to use the student modules to replace the corresponding teacher modules in sequence to obtain a new mixed neural network, adjust the parameters in the student modules until convergence, and generate the student models according to all the converged student modules. For example, an interactive distillation network includes two networks in parallel: teacher module 1, teacher module 2 and teacher module 3, student module 1, student module 2 and student module 3 utilize student module 1 to replace teacher module 1, obtain the hybrid network: student module 1, teacher module 2 and teacher module 3.

In the embodiment of the invention, in the training process of the hybrid network, the parameters of the teacher module are fixed, and only the parameters of the student modules are updated, which is equivalent to that the teacher module is a reference for the student modules, so that the information of the student modules with less parameters is updated in each training, the model convergence can be accelerated, and the training speed is increased.

Specifically, the iteratively training all the trained student modules to obtain the standard student model includes:

In the embodiment of the invention, as the student modules only converge in the hybrid network, all the converged student modules need to be trained repeatedly to improve the accuracy of model identification.

In an optional embodiment of the present invention, the calculating a loss value according to the prediction tag and the real tag includes:

calculating loss values of the prediction label and the real label by using the following loss functions:

Loss＝l_ilog(1-pred_i)+(1-l_i)log(pred_i)

therein, pred_iFor predictive labeling,/_iFor the true notation, Loss is the Loss value.

In the embodiment of the invention, the trained student modules are obtained through interactive distillation, and all the trained student modules are retrained by utilizing iterative training to obtain the standard student model, so that the accuracy of model identification is further improved. The standard student model is a lightweight network, so the standard student model can be directly deployed in mobile intelligent devices such as mobile phones, tablet computers and vehicle-mounted systems.

According to the embodiment of the invention, the student models deployed in mobile intelligent devices such as mobile phones, tablet computers and vehicle-mounted systems can be used for performing intention recognition on the voice to be recognized to obtain the intention recognition result.

In the embodiment of the invention, the voice to be recognized can be conversation voice in scenes such as manual outbound/intelligent outbound of the e-commerce and the like, and the intention of the user is recognized, for example, in a bank loan acceptance link, and the intention recognition result comprises 'the user is still', 'the user is not still', and the like.

According to the invention, the pre-constructed teacher model is used for performing session role representation and session environment representation on the original session training set, and because the semantics represented by different role speeches are considered and the contribution of sentences to intention judgment in the session context environment is different, the labels in the standard session training set are more accurate, so that a large amount of accurate training data can be obtained, and the accuracy of model training is improved. Meanwhile, interactive training is carried out by constructing an interactive distillation network, the interactivity of knowledge distillation is improved, the standard student model cannot be compressed, the parameter quantity of the model cannot be reduced, and the accuracy of model intention identification is further improved. Therefore, the lightweight intention recognition model training method provided by the invention can solve the problem that the voice intention recognition of the voice recognition model in the mobile intelligent equipment is inaccurate.

As shown in fig. 2, an embodiment of the present application provides a module schematic diagram of a lightweight intent recognition model training apparatus 10, where the lightweight intent recognition model training apparatus 10 includes: the session training set building module 11, the interactive distillation network building module 12 and the interactive training module 13.

The session training set constructing module 11 is configured to obtain an original session training set, and perform session role representation and session environment representation on the original session training set by using a pre-constructed teacher model to obtain a standard session training set;

the interactive distillation network construction module 12 is configured to construct an interactive distillation network by using the teacher model and a pre-constructed lightweight neural network;

and the interactive training module 13 is configured to perform interactive training on the interactive distillation network by using the standard session training set to obtain a standard student model.

In detail, when the modules in the lightweight intent recognition model training device 10 in the embodiment of the present application are used, the same technical means as the lightweight intent recognition model training method described in fig. 1 above are adopted, and the same technical effects can be produced, which is not described herein again.

As shown in fig. 3, an electronic device provided in the embodiment of the present application includes a processor 111, a communication interface 112, a memory 113, and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 complete communication with each other through the communication bus 114;

a memory 113 for storing a computer program;

in an embodiment of the present application, the processor 111, when configured to execute the program stored in the memory 113, implements the lightweight intent recognition model training method provided in any of the foregoing method embodiments, including:

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the lightweight intent recognition model training method provided in any of the foregoing method embodiments.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A lightweight intent recognition model training method, the method comprising:

2. The training method of lightweight intent recognition model according to claim 1, wherein said performing session role characterization and session environment characterization on the original session training set using a pre-constructed teacher model to obtain a standard session training set comprises:

3. The training method of the lightweight intention recognition model of claim 2, wherein the vector conversion of the conversational text and the role representation of the converted vector by using the vector conversion layer in the teacher model to obtain a conversational semantic representation vector comprises:

4. The lightweight intent recognition model training method of claim 2, wherein said outputting initial intent recognition results in the session environment characterization vector using an intent recognition layer in the teacher model comprises:

5. The training method of a lightweight intent recognition model as defined in claim 1, wherein the constructing an interactive distillation network using the teacher model and a pre-constructed lightweight neural network comprises:

6. The training method of a lightweight intent recognition model according to claim 5, wherein the interactive training of the interactive distillation network using the standard session training set to obtain a standard student model comprises:

7. The training method of lightweight intention recognition model according to claim 6, wherein the iteratively training all the trained student modules to obtain the standard student model comprises:

8. A lightweight intent recognition model training apparatus, the apparatus comprising:

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the steps of the lightweight intent recognition model training method of any of claims 1-7 when executing a program stored in a memory.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the lightweight intent recognition model training method according to any of claims 1-7.