CN114781365A

CN114781365A - End-to-end model training method, semantic understanding method, device, equipment and medium

Info

Publication number: CN114781365A
Application number: CN202210408734.8A
Authority: CN
Inventors: 张桐桐; 殷腾龙
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2022-04-19
Filing date: 2022-04-19
Publication date: 2022-07-22

Abstract

The present disclosure relates to an end-to-end model training method, semantic understanding method, apparatus, device and medium; wherein, the method comprises the following steps: acquiring a training sample, wherein the training sample comprises a natural language text, a keyword set corresponding to the natural language text and a label information set corresponding to the keywords; defining a frame of an end-to-end semantic understanding model, and generating a corresponding semantic understanding result based on the frame of the end-to-end semantic understanding model and a training sample, wherein the semantic understanding result comprises an intention recognition result, keywords and label information corresponding to each keyword; and training the framework of the end-to-end semantic understanding model according to the training samples based on a preset loss function to obtain the end-to-end semantic understanding model. The method and the device have the advantages that the training samples are used for end-to-end training of the framework of the end-to-end semantic understanding model to obtain the end-to-end semantic understanding model, so that the semantic understanding is more accurate, the accumulation of errors is reduced, and the accuracy of the field dialogue understanding is improved.

Description

End-to-end model training method, semantic understanding method, device, equipment and medium

Technical Field

The present disclosure relates to the field of computers and natural language processing, and in particular, to an end-to-end model training method, semantic understanding method, apparatus, device, and medium.

Background

Natural language processing is an important direction in the fields of computer science and artificial intelligence, and can realize effective communication between people and computers by using natural language. Semantic understanding is an important task in natural language processing and is also the core for realizing intelligent interaction. The method mainly aims to identify and mark intentions corresponding to the audio data of the user and slot position information of key words through a natural language processing related technology. However, due to the diversity of the Chinese semantics, the flexibility of the words and phrases, the complexity of the Chinese language, and the like, the semantic understanding in the prior art is not accurate enough, so that the user experience is poor.

Disclosure of Invention

In order to solve the technical problem, the present disclosure provides an end-to-end model training method, a semantic understanding method, an apparatus, a device, and a medium.

In a first aspect, the present disclosure provides an end-to-end semantic understanding model training method, including:

acquiring a training sample, wherein the training sample comprises a natural language text, a keyword set corresponding to the natural language text and a label information set corresponding to the keywords;

defining a frame of an end-to-end semantic understanding model, and generating a corresponding semantic understanding result based on the frame of the end-to-end semantic understanding model and the training sample, wherein the semantic understanding result comprises an intention recognition result, keywords and label information corresponding to each keyword;

and training the framework of the end-to-end semantic understanding model according to the training sample based on a preset loss function to obtain the end-to-end semantic understanding model.

As an optional implementation manner of the embodiment of the present disclosure, the framework of the end-to-end semantic understanding model includes a semantic feature extraction unit, a full connection layer, and a target result score calculation unit;

the semantic feature extraction unit is used for generating a corresponding semantic vector based on the natural language text;

the full connection layer is used for carrying out fusion processing on the semantic vectors to obtain an intention recognition prediction vector and a keyword prediction vector;

the target result score calculating unit is used for respectively obtaining corresponding prediction scores based on the intention recognition prediction vector and the keyword prediction vector, and determining a semantic understanding result corresponding to the natural language text based on the prediction scores.

As an optional implementation manner of the embodiment of the present disclosure, the obtaining, based on the intent recognition prediction vector and the keyword prediction vector, corresponding prediction scores respectively, and determining, based on the prediction scores, a semantic understanding result corresponding to the natural language text includes:

identifying a prediction vector based on the intention, and determining a corresponding first prediction score according to a first parameter matrix;

determining a corresponding second prediction score according to a second parameter matrix based on the keyword prediction vector;

determining a corresponding third prediction score according to a third parameter matrix based on the intention recognition prediction vector and the keyword prediction vector;

and determining a semantic understanding result corresponding to the natural language text based on the first prediction score, the second prediction score and the third prediction score.

As an optional implementation manner of the embodiment of the present disclosure, the semantic feature extraction unit includes: a semantic representation layer and an encoding layer;

the semantic representation layer is used for generating a corresponding representation vector based on the natural language text;

and the coding layer is used for extracting the characteristics of the expression vector to obtain a semantic vector corresponding to the natural language text. As an optional implementation manner of the embodiment of the present disclosure, the training the frame of the end-to-end semantic understanding model according to the training sample based on the preset loss function to obtain the end-to-end semantic understanding model includes:

determining a loss value corresponding to the preset loss function according to the first prediction score, the second prediction score, the third prediction score and the label information set;

and adjusting parameters of a frame of the end-to-end semantic understanding model according to the loss value until the frame of the end-to-end semantic understanding model is converged to obtain the end-to-end semantic understanding model.

In a second aspect, the present disclosure provides a semantic understanding method, comprising:

acquiring a text to be predicted;

inputting the text to be predicted into an end-to-end semantic understanding model to obtain a target semantic understanding result corresponding to the text to be predicted;

wherein the end-to-end semantic understanding model is trained based on the method according to any one of the first aspect.

In a third aspect, the present disclosure provides an end-to-end semantic understanding model training apparatus, including:

the system comprises a sample acquisition module, a training module and a label analysis module, wherein the sample acquisition module is used for acquiring a training sample, and the training sample comprises a natural language text, a keyword set corresponding to the natural language text and a label information set corresponding to the keywords;

the system comprises a frame determining module, a training sample generating module and a semantic understanding module, wherein the frame determining module is used for defining a frame of an end-to-end semantic understanding model, and generating a corresponding semantic understanding result based on the frame of the end-to-end semantic understanding model and the training sample, and the semantic understanding result comprises an intention recognition result, keywords and label information corresponding to each keyword;

and the model determining module is used for training the frame of the end-to-end semantic understanding model according to the training sample based on a preset loss function to obtain the end-to-end semantic understanding model.

As an optional implementation manner of the embodiment of the present disclosure, the target result score calculating unit is specifically configured to:

and the coding layer is used for extracting the characteristics of the expression vector to obtain a semantic vector corresponding to the natural language text.

As an optional implementation manner of the embodiment of the present disclosure, the model determining module is specifically configured to:

and adjusting parameters of the frame of the end-to-end semantic understanding model according to the loss value until the frame of the end-to-end semantic understanding model converges to obtain the end-to-end semantic understanding model. In a fourth aspect, the present disclosure provides a semantic understanding apparatus, comprising:

the text acquisition module is used for acquiring a text to be predicted;

the result determining module is used for inputting the text to be predicted into an end-to-end semantic understanding model to obtain a target semantic understanding result corresponding to the text to be predicted;

In a fifth aspect, the present disclosure also provides a computer device, including:

one or more processors;

a storage device to store one or more programs,

when executed by the one or more processors, cause the one or more processors to implement an end-to-end semantic understanding model training method as described in any one of the first aspects, or a semantic understanding method as described in the second aspect.

In a sixth aspect, the present disclosure also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the end-to-end semantic understanding model training method according to any one of the first aspect, or the semantic understanding method according to the second aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages: firstly, a training sample is obtained, the training sample comprises a natural language text, a keyword set corresponding to the natural language text and a label information set corresponding to the keywords, then a frame of an end-to-end semantic understanding model is defined, a corresponding semantic understanding result is generated based on the frame of the end-to-end semantic understanding model and the training sample, the semantic understanding result comprises an intention recognition result, the keywords and label information corresponding to each keyword, finally, the frame of the end-to-end semantic understanding model is trained according to the training sample based on a preset loss function, the end-to-end semantic understanding model is obtained, the frame of the end-to-end semantic understanding model is trained end-to-end through the training sample, and the end-to-end semantic understanding model is obtained, so that the semantic understanding is more accurate, the accumulation of errors is reduced, and the accuracy of field dialogue understanding is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the embodiments or technical solutions in the prior art description will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1A is a diagram illustrating a semantic understanding process in the prior art;

FIG. 1B is a schematic diagram of an application scenario of a semantic understanding process according to an embodiment of the disclosure;

FIG. 2A is a block diagram of a hardware configuration of a computer device according to one or more embodiments of the present disclosure;

FIG. 2B is a software configuration diagram of a computer device according to one or more embodiments of the present disclosure;

fig. 2C is a schematic illustration of an icon control interface display of an application program included in a smart device according to one or more embodiments of the present disclosure;

FIG. 3A is a schematic flow chart diagram of an end-to-end semantic understanding model training method provided by the embodiment of the present disclosure;

FIG. 3B is a schematic diagram illustrating an end-to-end semantic understanding model training method according to an embodiment of the present disclosure;

FIG. 4A is a schematic flow chart diagram of another end-to-end semantic understanding model training method provided by the embodiments of the present disclosure;

FIG. 4B is a schematic diagram illustrating another end-to-end semantic understanding model training method provided by the embodiment of the disclosure;

fig. 4C is a schematic structural diagram of a semantic feature extraction unit in a framework of an end-to-end semantic understanding model provided by the embodiment of the present disclosure;

FIG. 4D is a structural schematic diagram of a framework of an end-to-end semantic understanding model of an exemplary embodiment of the present disclosure;

fig. 4E is a schematic structural diagram of a target result score calculating unit in the framework of the end-to-end semantic understanding model provided by the embodiment of the present disclosure;

FIG. 5 is an overall framework diagram of an end-to-end semantic understanding model provided by embodiments of the present disclosure;

fig. 6A is a schematic flowchart of a semantic understanding method provided by the embodiment of the present disclosure;

FIG. 6B is a schematic diagram illustrating a semantic understanding method provided by an embodiment of the disclosure;

fig. 7A is a schematic structural diagram of an end-to-end semantic understanding model training apparatus provided in the embodiment of the present disclosure;

FIG. 7B is a schematic structural diagram of a framework determination module in the end-to-end semantic understanding model training apparatus according to the embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a semantic understanding apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments of the present disclosure may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

The terms "first" and "second," etc. in this disclosure are used to distinguish between different objects, rather than to describe a particular order of objects. For example, the first prediction score and the second prediction score, etc. are used to distinguish between different prediction scores, rather than to describe a particular order of prediction scores.

With the rapid development of intelligent technology and the increasing popularization of intelligent terminals, multi-modal interaction such as voice and semantics becomes an increasingly important mode. Semantic Understanding (Semantic Understanding) is an important task in Natural language processing and is also a core for realizing intelligent interaction, and the main purpose of the Semantic Understanding is to identify and mark out a corresponding intention of audio data of a user and slot position information of a key word through a Natural Language Processing (NLP) related technology, and the slot position information can be understood as tag information to which the key word belongs. In natural language processing, many tasks rely on semantic understanding, such as: intelligent question answering, intelligent home furnishing, voice assistant and the like all need to identify and mark semantic intentions and corresponding keyword information of audio data of users, and then perform other business processing. However, due to the diversity of the chinese semantics, the flexibility of the words and phrases, and the complexity of the chinese language, how to improve the accuracy of semantic understanding has always been a core issue concerned in the field of natural language processing.

The current vertical domain semantic understanding task generally adopts a pipeline mode and is divided into an Intention Recognition (intent Recognition) part and a Slot Extraction (Slot Extraction) part. Wherein, the intention identification refers to identifying a specific intention of the audio data (such as the query of the user) of the user in the field, such as the intention of "movie search" and "phase search" respectively different from that of "movie search" and "phase search" in the movie and television vertical field, "i want to watch the movie of XX" and "i want to listen to the phase of XX". The common implementation method of the intention recognition mainly comprises rule classification based on a dictionary template, intention classification based on a classification model and the like; slot extraction is to extract specific concepts, attributes, attribute values and the like which are clearly defined from audio data and label slot information thereof, such as that "i want to see DEF movie" needs to identify actor entity (actor) - > "DEF"; for the information of key slot positions such as the video attribute (video topic) - > "movie", etc., the slot position extraction is generally solved by adopting the sequence marking problem. The vertical domain semantic understanding task is solved by serially connecting an intention identification task and a slot position extraction task, the intention identification task and the slot position extraction task generate different errors, and the accuracy of the whole vertical domain semantic understanding task is the accuracy of the two subtasks which are multiplied together.

It should be noted that: the intention identification is equivalent to determining a corresponding classification of the audio data, and the slot extraction is equivalent to determining more detailed keyword information under the classification, and the two are combined to carry out deeper semantic understanding on the audio data.

Fig. 1A is a schematic diagram of a semantic understanding process in the prior art. As shown in fig. 1, the main implementation flow is as follows: the natural language text is a text content obtained by identifying the audio data a of the user, and may also be other text contents, which is not limited in this embodiment. Firstly, natural language texts are identified, then slot position extraction is carried out, and finally semantic understanding results corresponding to the natural language texts are obtained. However, in this method, since the intention recognition and the slot extraction are two tasks, errors are likely to accumulate, and the obtained semantic understanding result is not accurate enough.

Aiming at the defects in the method, the embodiment of the disclosure firstly obtains a training sample, the training sample comprises a natural language text corresponding to audio data, a keyword set corresponding to the natural language text and a label information set corresponding to the keywords, then defines a frame of an end-to-end semantic understanding model, generates a corresponding semantic understanding result based on the frame of the end-to-end semantic understanding model and the training sample, the semantic understanding result comprises an intention recognition result, the keywords and label information corresponding to each keyword, finally trains the frame of the end-to-end semantic understanding model according to the training sample based on a preset loss function to obtain the end-to-end semantic understanding model, and trains the frame of the end-to-end semantic understanding model end to end-to-end through the training sample, so as to obtain the end-to-end semantic understanding model, reduce the complexity of the model and enable the semantic understanding to be more accurate, the accumulation of errors is reduced, and the accuracy of domain dialogue understanding is improved.

Fig. 1B is a schematic view of an application scenario of a semantic understanding process in the embodiment of the present disclosure. As shown in fig. 1B, the semantic understanding process may be used in a voice interaction scenario between a user and a smart home, assuming that the smart device in the scenario includes a smart device 100 (i.e., a smart refrigerator), a smart device 101 (i.e., a smart washing machine), and a smart device 102 (i.e., a smart display device), when the user wants to control the smart device in the scenario, the user needs to send a voice instruction first, and when the smart device receives the voice instruction, the smart device needs to perform semantic understanding on the voice instruction, and determine a semantic understanding result corresponding to the voice instruction, so that the subsequent smart device executes a corresponding control instruction according to the semantic understanding result, thereby satisfying the user requirement.

The end-to-end semantic understanding model training method and the semantic understanding method provided by the embodiment of the disclosure can be implemented based on computer equipment, or a functional module or a functional entity in the computer equipment.

The computer device may be a Personal Computer (PC), a server, a mobile phone, a tablet computer, a notebook computer, a mainframe computer, and the like, which is not specifically limited in this disclosure.

Illustratively, fig. 2A is a block diagram of a hardware configuration of a computer device according to one or more embodiments of the present disclosure. As shown in fig. 2A, the computer apparatus includes: at least one of a tuner demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface 280. The controller 250 includes a central processing unit, a video processor, an audio processor, a graphic processor, a RAM, a ROM, a first interface to an nth interface for input/output, among others. The display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen. The tuner demodulator 210 receives broadcast television signals through wired or wireless reception, and demodulates audio and video signals, such as EPG audio and video data signals, from a plurality of wireless or wired broadcast television signals. The communicator 220 is a component for communicating with an external device or a server according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The computer device may establish transmission and reception of control signals and data signals with the server or the local control device through the communicator 220. The detector 230 is used to collect signals of the external environment or interaction with the outside. The controller 250 and the modem 210 may be located in different separate devices, that is, the modem 210 may also be located in an external device of the main device where the controller 250 is located, such as an external set-top box.

In some embodiments, controller 250 controls the operation of the computing device and responds to user actions through various software control programs stored in memory. The controller 250 controls the overall operation of the computer device. The user may input a user command through a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input a user command by inputting a specific sound or gesture, and the user input interface receives the user input command by recognizing the sound or gesture through the sensor.

Fig. 2B is a schematic software configuration diagram of a computer device according to one or more embodiments of the present disclosure, and as shown in fig. 2B, the system is divided into four layers, which are, from top to bottom, an Application (Applications) layer (referred to as an "Application layer"), an Application Framework (Application Framework) layer (referred to as a "Framework layer"), an Android runtime (Android runtime) and system library layer (referred to as a "system runtime library layer"), and a kernel layer.

Fig. 2C is a schematic diagram illustrating an interface display of an icon control of an application program included in an intelligent device (mainly, an intelligent playback device, such as an intelligent television, a digital cinema system, or a video server), according to one or more embodiments of the present disclosure, as shown in fig. 2C, an application layer includes at least one application program that can display a corresponding icon control in a display, for example: the system comprises a live television application icon control, a video on demand VOD application icon control, a media center application icon control, an application center icon control, a game application icon control and the like. The live television application program can provide live television through different signal sources. A video on demand VOD application may provide video from different storage sources. Unlike live television applications, video on demand provides a video display from some storage source. The media center application program can provide various applications for playing multimedia contents. The application program center can provide and store various application programs.

For a more detailed description of the end-to-end semantic understanding model training scheme, the following description is made with reference to fig. 3A by way of example, and it is understood that the steps involved in fig. 3A may include more steps or fewer steps in actual implementation, and the order between the steps may also be different, so as to enable the end-to-end semantic understanding model training method provided in the embodiment of the present application.

FIG. 3A is a schematic flow chart diagram of an end-to-end semantic understanding model training method provided by the embodiment of the present disclosure; fig. 3B is a schematic diagram illustrating a principle of an end-to-end semantic understanding model training method according to an embodiment of the present disclosure. The embodiment can be applied to the condition that the end-to-end semantic understanding model is obtained by training the framework of the end-to-end semantic understanding model. The method of the embodiment may be executed by an end-to-end semantic understanding model training apparatus, which may be implemented in hardware and/or software and may be configured in a computer device.

As shown in fig. 3A, the method specifically includes the following steps:

s310, obtaining a training sample, wherein the training sample comprises a natural language text, a keyword set corresponding to the natural language text and a label information set corresponding to the keywords.

Wherein the training samples may be training samples determined from a predetermined set of training data. The training data set may include text content corresponding to each audio data determined based on a plurality of types of audio data of a plurality of different users, or other collected text content, a set formed by keywords included in the text content, and a set formed by tag information corresponding to each keyword. Keywords may be understood as important words contained in the text content. The tag information may be understood as attributes corresponding to each keyword, such as a person attribute: actors, singers, and athletes; the film and television attributes are as follows: drama, movie, and art programs. The natural language text is: text content determined from the training data set.

In the embodiment of the present disclosure, before training the model, a large number of training samples need to be collected, where the large number of training samples may include different types of natural language texts, sets of keywords corresponding to the natural language texts, and sets of label information corresponding to the keywords.

S320, defining a frame of the end-to-end semantic understanding model, and generating a corresponding semantic understanding result based on the frame of the end-to-end semantic understanding model and the training sample, wherein the semantic understanding result comprises an intention recognition result, keywords and label information corresponding to each keyword.

Wherein, the intention recognition result can be understood as the main intention contained in the natural language text, namely the main event. For example, natural language text is: "i want to see the movie of XX", the corresponding intention recognition result is: the "movie search" intent.

In the embodiment of the disclosure, by defining a framework of an end-to-end semantic understanding model and recognizing a natural language text through the framework, an intention recognition result corresponding to the natural language text, keywords included in the natural language text, and tag information corresponding to each keyword are obtained. Namely: an intention recognition task and a slot extraction task are fused in the framework of the end-to-end semantic understanding model, and the intention recognition task and the slot extraction task can be simultaneously carried out on the natural language text.

For example, assume that the natural language text is: "see XX movies", the intent recognition results need to be identified by the framework of the end-to-end semantic understanding model: searching for a movie; key words: XX and movies; label information corresponding to each keyword: and the label information corresponding to XX is the actor, and the label information corresponding to the movie is the movie program.

S330, training the framework of the end-to-end semantic understanding model according to the training samples based on a preset loss function to obtain the end-to-end semantic understanding model.

The preset loss function may be a Connection Timing Classification (CTC) loss function, a multi-class cross entropy loss function, a mean square loss function, and the like, and may be specifically determined according to actual use requirements, and may also be set by a user through user definition, which is not limited in the embodiment of the present disclosure.

In the embodiment of the disclosure, the preset loss function is a measurement standard for determining whether the semantic understanding model is qualified or not, so that the semantic understanding model obtained by training has a high-precision recognition result. The similarity between the semantic understanding result generated based on the framework of the end-to-end semantic understanding model and the actual semantic understanding result corresponding to the natural language text can be calculated through a preset loss function, so that the recognition precision of the end-to-end semantic understanding model can be verified, and the end-to-end semantic understanding model with high accuracy can be trained.

In this disclosure, a large number of training samples obtained in S310 may be divided into a training set and a verification set, the training set is used to train a frame of an end-to-end semantic understanding model to obtain an end-to-end semantic understanding model, the verification set is used to verify the obtained end-to-end semantic understanding model, and if the verification is qualified, the trained end-to-end semantic understanding model is obtained.

In the embodiment of the disclosure, a training sample is first obtained, the training sample includes a natural language text, a keyword set corresponding to the natural language text, and a tag information set corresponding to the keywords, then defining a frame of an end-to-end semantic understanding model, generating a corresponding semantic understanding result based on the frame of the end-to-end semantic understanding model and a training sample, wherein the semantic understanding result comprises an intention recognition result, keywords and label information corresponding to each keyword, and finally based on a preset loss function, training the frame of the end-to-end semantic understanding model according to the training samples to obtain the end-to-end semantic understanding model, end-to-end training is carried out on the frame of the end-to-end semantic understanding model through the training samples to obtain the end-to-end semantic understanding model, the semantic understanding is more accurate, the accumulation of errors is reduced, and the accuracy of the domain dialogue understanding is improved.

FIG. 4A is a schematic flow chart diagram of another end-to-end semantic understanding model training method provided by the present disclosure; fig. 4B is a schematic diagram of another end-to-end semantic understanding model training method provided in the embodiment of the present disclosure. The embodiment is further expanded and optimized on the basis of the embodiment. Optionally, this embodiment mainly explains a structure of a framework of the end-to-end semantic understanding model.

S410, obtaining training samples, wherein the training samples comprise natural language texts corresponding to the audio data, keyword sets corresponding to the natural language texts and label information sets corresponding to the keywords.

And S420, defining a framework of an end-to-end semantic understanding model, wherein the framework of the end-to-end semantic understanding model comprises a semantic feature extraction unit, a full connection layer and a target result score calculation unit.

The semantic feature extraction unit is used for generating a corresponding semantic vector based on the natural language text; the full connection layer is used for carrying out fusion processing on the semantic vectors to finally obtain an intention identification prediction vector and a keyword prediction vector; the target result score calculating unit is used for respectively obtaining corresponding prediction scores based on the intention recognition prediction vector and the keyword prediction vector, and determining a semantic understanding result corresponding to the natural language text based on the prediction scores. The semantic vector is typically a high-dimensional semantic vector.

The input of the semantic feature extraction unit is a natural language text, each character in the natural language text is converted into a corresponding expression vector through the semantic feature extraction unit, and then feature extraction is performed on each expression vector to generate a high-dimensional semantic vector corresponding to each character. The full connection layer is connected with the semantic feature extraction unit and mainly integrates and reduces dimensions of high-dimensional semantic vectors generated by the semantic feature extraction unit, wherein the full connection layer comprises two independent networks, one full connection network integrates and outputs the high-dimensional semantic vectors corresponding to the characters, and then the vectors after the characters are integrated and output are spliced to obtain an intention recognition prediction vector; and the other full-connection network integrates and reduces the dimension of high-dimensional semantic vectors corresponding to the characters of the keywords to be predicted, and then combines the integrated and output vectors of the characters corresponding to the keywords to be predicted to obtain keyword prediction vectors. The target result score calculating unit is used for identifying the prediction vector and the keyword prediction vector based on the intention, obtaining a corresponding probability prediction score according to the corresponding parameter matrix, and determining a semantic understanding result corresponding to the natural language text based on the probability prediction score.

In some embodiments, the semantic feature extraction unit comprises: a semantic representation layer and an encoding layer;

The semantic representation layer may adopt a static vector file, a dynamic vector model, a pre-training language model, or a combination of the static vector file and the dynamic vector model, and this embodiment is not particularly limited. The static word vector file may be a pre-trained model file containing a plurality of Chinese characters and corresponding characterization vectors. The dynamic vector model may be a bert (bidirectional Encoder reconstruction from transformations) pre-training model, a generative pre-training model, such as a GPT model, and the like, and this embodiment is not limited specifically. The coding layer may adopt a transform-Encoder structure, or may also adopt a model such as a Long Short-Term Memory Network (LSTM) or a Recurrent Neural Network (RNN), which is not limited in this embodiment. Representing a vector may be understood as representing a character as a vector resulting from a vector. A semantic vector may be understood as a vector containing semantic information.

In this embodiment, the natural language text can be converted into corresponding expression vectors through the semantic expression layer, and feature extraction is performed on the expression vectors through the coding layer to obtain semantic vectors corresponding to the natural language text, so that features of a deeper level in the natural language text can be extracted, and the extracted features are richer and the efficiency is faster.

In some embodiments, the generating a corresponding representation vector based on the natural language text may specifically include:

determining a first expression vector corresponding to the natural language text according to the static vector file;

generating a second expression vector corresponding to the natural language text according to a dynamic vector model;

and splicing the first expression vector and the second expression vector to obtain the expression vector.

Specifically, a first expression vector corresponding to each character in the natural language text can be determined according to the static vector model, a second expression vector corresponding to each character in the natural language text can be generated according to the dynamic vector model, and the context relationship between the second expression vectors is closer. For each character in the natural language text, the first expression vector and the second expression vector corresponding to the current character are spliced to obtain the expression vector corresponding to the current character, and so on, the expression vector corresponding to each character in the natural language text can be obtained.

In the embodiment, the expression vector is obtained by combining the static vector file and the dynamic vector model, so that the semantic features of the natural language text can be better expressed.

For example, the above process of obtaining the expression vector by stitching can be represented by the following formula:

x_i＝[a_i-char,a_i-Bert] (1)

wherein x is_iRepresenting a representation vector corresponding to the ith character in the natural language text; a is a_i-charRepresenting a first representation vector corresponding to an ith character generated from the static vector file; a is_i-BertRepresenting a second representation vector corresponding to an ith character generated according to the dynamic vector model; i is more than or equal to 1 and less than or equal to n, (n is a positive integer more than 1), and n is shown in the tableThe total number of characters contained in the natural language text.

In some embodiments, the representation vector obtained by the semantic representation layer is assumed to be x_iThe encoding layer adopts a transform-Encoder structure, and the process of obtaining the semantic vector corresponding to the natural language text by the encoding layer can be represented by the following formula:

wherein the content of the first and second substances,

denotes x_iThe corresponding semantic vector, c, represents the number of layers corresponding to the Transformer.

Correspondingly, the formula for obtaining the intention recognition predicted vector and the keyword predicted vector by fusing the semantic vectors through the full connection layer is as follows:

g_all＝[g₁,…,g_d,…,g_n] (5)

h_c-span＝[h_c-START,h_c-END,h_c-ATTN] (6)

wherein, g_iRepresent

Generating a first semantic vector corresponding to the ith character under the action of a first fully-connected network; h is a total of_iRepresent

Generating a second semantic vector corresponding to the ith character under the action of a second fully-connected network; ReLU denotes activation function, MLP₁Representing a first fully connected network; MLP₂Representing a second fully connected network; g is a radical of formula_allExpressing an intention recognition prediction vector corresponding to the natural language text, wherein d is more than 1 and less than n; h is_c-spanRepresenting a prediction vector corresponding to the c-th key word in the natural language text, wherein c is more than or equal to 1 and is less than or equal to m, and m represents the total number of the key words contained in the natural language text; h is a total of_c-STARTSemantic vector, h, representing the beginning (first) character corresponding to the c-th keyword_c-ENDSemantic vector representing the last character corresponding to the c-th keyword, h_c-STARTAnd h_c-ENDCan pass through h_iDetermining; h is a total of_c-ATTNRepresenting the sum of the expression vectors obtained by the Attention (Attention) mechanism of the c key word; q_c-spanSum of semantic vectors representing the c-th keyword, k ∈ [ START, END]；v_c-tRepresenting the weight corresponding to the c-th keyword, t is the position of a character in the c-th keyword, t belongs to [ START, END ]]。

In some embodiments, the obtaining the corresponding prediction scores based on the intent recognition prediction vector and the keyword prediction vector, and determining the semantic understanding result corresponding to the natural language text based on the prediction scores may specifically include:

The first parameter matrix is a parameter matrix corresponding to the intention identification prediction vector. The second parameter matrix is a parameter matrix corresponding to the keyword prediction vector. The third parameter matrix is a parameter matrix corresponding to a vector formed by combining the intention identification vector predictor and the keyword vector predictor.

In some embodiments, in conjunction with equations (1) - (9) above, the first prediction score, the second prediction score, and the third prediction score may be represented by the following equations:

S₁＝W₁*MLP₁(g_all) (10)

S₂＝W₂*MLP₂(h_c-span) (11)

S₃＝[g_all,h_c-span]^T*W₃+b (12)

wherein S is₁Representing a first prediction score, W₁Representing a first parameter matrix, S₂Second prediction score, W₂Second parameter matrix, S₃Third prediction score, W₃A third parameter matrix, b represents a bias parameter.

After the first prediction score, the second prediction score and the third prediction score are respectively determined according to the formulas (10) to (12), whether the obtained multiple semantic understanding results are accurate or not can be determined according to a specific numerical value of the sum of the three scores of the first prediction score, the second prediction score and the third prediction score, and the larger the sum of the three scores is, the more accurate the corresponding semantic understanding result is.

And S430, training the frame of the end-to-end semantic understanding model according to the training samples based on a preset loss function to obtain the end-to-end semantic understanding model.

In the embodiment, a training sample is firstly obtained, the training sample comprises a natural language text corresponding to audio data, a keyword set corresponding to the natural language text and a label information set corresponding to the keywords, then a frame of an end-to-end semantic understanding model is defined, the frame of the end-to-end semantic understanding model comprises a semantic feature extraction unit, a full connection layer and a target result score calculation unit, and finally the frame of the end-to-end semantic understanding model is trained according to the training sample based on a preset loss function to obtain the end-to-end semantic understanding model.

In some embodiments, the training the frame of the end-to-end semantic understanding model according to the training sample based on the preset loss function to obtain the end-to-end semantic understanding model may specifically include:

and adjusting parameters of the frame of the end-to-end semantic understanding model according to the loss value until the frame of the end-to-end semantic understanding model converges to obtain the end-to-end semantic understanding model.

In some embodiments, the formula for the predetermined loss function is as follows:

S(b,p,r)＝S₁+S₂+S₃ (14)

Loss＝-logp_θ(y|s) (15)

wherein s is a natural languageHere, the model parameter is θ, Y represents a triplet of all combinations, and B ═ B₁,…,b_f]B is the intention set, P ═ P₁,…,p_e]P is a set of keywords, R ═ R₁,…,r_d]R is a label information set of a keyword, f, e and d are positive integers, S (b, p, R) is the sum of scores obtained based on a first prediction score, a second prediction score and a third prediction score, a Loss function is represented by a Loss function, and mu represents all possible label values, so that the goal of the framework of the end-to-end semantic understanding model is as follows: triple y having s in label information of intention-keyword_(b,p,r)The probability of e.g. Y is maximum and the loss function value is minimum.

Specifically, according to the first prediction score, the second prediction score, the third prediction score and the label information set, a loss value corresponding to a preset loss function can be determined, and according to the loss value, parameters of a frame of the end-to-end semantic understanding model are adjusted until the frame of the end-to-end semantic understanding model converges, so that the end-to-end semantic understanding model is obtained.

In the embodiment, the end-to-end semantic understanding model obtained by the method is simple and efficient, and is beneficial to improving the accuracy of domain dialogue understanding.

Fig. 4C is a schematic structural diagram of a semantic feature extraction unit in a framework of an end-to-end semantic understanding model provided by the embodiment of the present disclosure, as shown in fig. 4C: the semantic feature extraction unit may include a semantic representation layer and an encoding layer. The functions of the semantic representation layer and the coding layer have been described in the above embodiments, and are not described herein again to avoid repetition.

Fig. 4D is a structural diagram of a framework of an end-to-end semantic understanding model of an exemplary embodiment of the present disclosure, as shown in fig. 4D: the framework of the end-to-end semantic understanding model can comprise a Bert pre-training model, a Transformer-Encoder, a full connection layer and a target result score calculation unit. The roles of the Bert pre-training model, the Transformer-Encoder, the full link layer, and the target result score calculating unit have been described in the above embodiments, and are not repeated herein to avoid redundancy.

Fig. 4E is a schematic structural diagram of a target result score calculating unit in the framework of the end-to-end semantic understanding model provided by the embodiment of the present disclosure, as shown in fig. 4E: the target result score calculating unit may include a first score determining sub-unit, a second score determining sub-unit, a third score determining sub-unit, and a result determining sub-unit. Wherein the first score determining subunit is specifically configured to: identifying a prediction vector based on the intention, and determining a corresponding first prediction score according to the first parameter matrix; the second score determining subunit is specifically configured to: determining a corresponding second prediction score according to the second parameter matrix based on the keyword prediction vector; the third fraction determining subunit is specifically configured to: identifying a prediction vector and a keyword prediction vector based on the intention, and determining a corresponding third prediction score according to a third parameter matrix; the result determination subunit is specifically configured to: and determining a semantic understanding result corresponding to the natural language text based on the first prediction score, the second prediction score and the third prediction score.

Fig. 5 is an overall framework diagram of an end-to-end semantic understanding model provided by an embodiment of the present disclosure, as shown in fig. 5: supposing that a natural language text is 'a film for watching BC', the text has six characters, and firstly, a representation vector corresponding to each character is obtained through a semantic representation layer; and then, obtaining a semantic vector corresponding to each character through an encoding layer, and then processing the semantic vectors through a full-connection layer to obtain an intention identification prediction vector and a keyword prediction vector. Specifically, the method is obtained through a first full-connection network of a full-connection layer: the first semantic vector corresponding to the "see" word is: g is a radical of formula₁And the first semantic vector corresponding to the B word is as follows: g is a radical of formula₂…, the first semantic vector corresponding to the "shadow" word is: g₆(ii) a Obtaining, over a second fully connected network of fully connected layers: the second semantic vector corresponding to the "see" word is: h is a total of₁And the second semantic vector corresponding to the B word is as follows: h is a total of₂…, the second semantic vector corresponding to the "shadow" word is: h is a total of₆(ii) a Then g is transmitted through the first fully-connected network₁、g₂、…、g₆Splicing together to obtain the predicted vector g of intention recognition_allH is transmitted through a second fully-connected network of fully-connected layers₂And h₃Splicing to obtain h_1-spanH is to be₅And h₆Splicing to obtain h_2-span(ii) a Finally based on g_all、h_1-spanAnd h_2-spanDetermination of S₁、S₂And S₃And is based on S₁、S₂And S₃And determining a semantic understanding result corresponding to the natural language text.

It should be noted that: the above process for determining the semantic understanding result corresponding to the natural language text has been described in the above embodiments, and details of the process are not described herein.

Fig. 6A is a schematic flow chart diagram of a semantic understanding method provided by the embodiment of the present disclosure; fig. 6B is a schematic diagram of a semantic understanding method provided in an embodiment of the present disclosure. The embodiment can be applied to the situation that the text to be predicted is predicted to obtain the corresponding target semantic understanding result. The method of the embodiment may be executed by a semantic understanding apparatus, which may be implemented in hardware and/or software and may be configured in a computer device.

As shown in fig. 6A, the method specifically includes the following steps:

s610, obtaining the text to be predicted.

The text to be predicted may be audio data of a user, for example, voice data in a voice interaction process between the user and the intelligent device, or may be an input text, which is not limited in this embodiment.

S620, inputting the text to be predicted into the end-to-end semantic understanding model to obtain a target semantic understanding result corresponding to the text to be predicted.

The end-to-end semantic understanding model is obtained by training based on the end-to-end semantic understanding model training method in any embodiment. The target semantic understanding result is an output result of the end-to-end semantic understanding model. The target semantic understanding result may include an intention recognition result corresponding to the text to be predicted, keywords, and tag information corresponding to each keyword.

And inputting the text to be predicted into the end-to-end semantic understanding model to obtain a target semantic understanding result corresponding to the text to be predicted.

In the embodiment, the semantic understanding process can be quickly and accurately realized by the semantic understanding method, and the accuracy of domain dialogue understanding is improved.

Fig. 7A is a schematic structural diagram of an end-to-end semantic understanding model training apparatus according to an embodiment of the present disclosure. The device is configured in computer equipment, and can realize the end-to-end semantic understanding model training method in any embodiment of the application. The device specifically comprises the following steps:

a sample obtaining module 701, configured to obtain a training sample, where the training sample includes a natural language text, a keyword set corresponding to the natural language text, and a tag information set corresponding to the keyword;

a frame determining module 702, configured to define a frame of an end-to-end semantic understanding model, and generate a corresponding semantic understanding result based on the frame of the end-to-end semantic understanding model and the training sample, where the semantic understanding result includes an intention recognition result, keywords, and tag information corresponding to each keyword;

the model determining module 703 is configured to train the frame of the end-to-end semantic understanding model according to the training sample based on a preset loss function, so as to obtain the end-to-end semantic understanding model.

As an optional implementation manner of the embodiment of the present disclosure, fig. 7B is a schematic structural diagram of a frame determining module 702 in an end-to-end semantic understanding model training apparatus according to the embodiment of the present disclosure, and as shown in fig. 7B, the frame determining module 702 includes:

a semantic feature extraction unit 7021, configured to generate a corresponding semantic vector based on the natural language text;

the full connection layer 7022 is configured to perform fusion processing on the semantic vectors to obtain an intention recognition prediction vector and a keyword prediction vector;

the target result score calculating unit 7023 is configured to obtain corresponding prediction scores based on the intent recognition prediction vector and the keyword prediction vector, respectively, and determine a semantic understanding result corresponding to the natural language text based on the prediction scores. As an optional implementation manner of the embodiment of the present disclosure, the target result score calculating unit 7023 is specifically configured to:

As an optional implementation manner of the embodiment of the present disclosure, the semantic feature extracting unit 7021 includes: a semantic representation layer and an encoding layer;

As an optional implementation manner of the embodiment of the present disclosure, the model determining module 703 is specifically configured to:

and adjusting parameters of the frame of the end-to-end semantic understanding model according to the loss value until the frame of the end-to-end semantic understanding model converges to obtain the end-to-end semantic understanding model. The end-to-end semantic understanding model training device provided by the embodiment of the disclosure can execute the end-to-end semantic understanding model training method provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method.

Fig. 8 is a schematic structural diagram of a semantic understanding apparatus according to an embodiment of the present disclosure. The device is configured in computer equipment and can realize the semantic understanding method in any embodiment of the application. The device specifically comprises the following steps:

a text obtaining module 801, configured to obtain a text to be predicted;

a result determining module 802, configured to input the text to be predicted into an end-to-end semantic understanding model, to obtain a target semantic understanding result corresponding to the text to be predicted;

the end-to-end semantic understanding model is obtained by training based on the end-to-end semantic understanding model training method in any embodiment.

The semantic understanding apparatus provided in the embodiments of the present disclosure may execute the semantic understanding method provided in any embodiment of the present disclosure, and has the corresponding functional modules and beneficial effects of the execution method, and in order to avoid repetition, details are not repeated here.

An embodiment of the present disclosure provides a computer device, including: one or more processors; a storage device, configured to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the end-to-end semantic understanding model training method of any of the embodiments of the present disclosure, or the semantic understanding method of any of the embodiments of the present disclosure.

Fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. As shown in fig. 9, the computer apparatus includes a processor 910 and a storage 920; the number of the processors 910 in the computer device may be one or more, and fig. 9 illustrates one processor 910 as an example; the processor 910 and the storage 920 in the computer device may be connected by a bus or other means, and fig. 9 illustrates the connection by the bus as an example.

Storage 920 is provided as a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the end-to-end semantic understanding model training method in the embodiments of the present disclosure; or program instructions/modules corresponding to the semantic understanding method in the embodiments of the present disclosure. The processor 910 executes software programs, instructions and modules stored in the storage 920 to execute various functional applications and data processing of a computer device, that is, to implement the end-to-end semantic understanding model training method or the semantic understanding method provided by the embodiments of the present disclosure.

The storage device 920 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Additionally, the storage 920 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the storage 920 may further include memory located remotely from the processor 910, which may be connected to a computer device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The computer device provided by the embodiment can be used for executing the method provided by any embodiment, and has corresponding functions and beneficial effects.

The embodiments of the present disclosure further provide a storage medium containing computer-executable instructions, where the computer-executable instructions, when executed by a computer processor, implement the processes performed by the methods provided in any of the embodiments, and can achieve the same technical effects, and are not described herein again to avoid repetition.

The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the foregoing discussion in some embodiments is not intended to be exhaustive or to limit the implementations to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A method for training an end-to-end semantic understanding model, the method comprising:

2. The method according to claim 1, wherein the framework of the end-to-end semantic understanding model comprises a semantic feature extraction unit, a full connection layer and a target result score calculation unit;

3. The method of claim 2, wherein the identifying the prediction vector based on the intention and the keyword prediction vector respectively obtain corresponding prediction scores, and determining the semantic understanding result corresponding to the natural language text based on the prediction scores comprises:

4. The method of claim 2, wherein the semantic feature extraction unit comprises: a semantic representation layer and an encoding layer;

5. The method according to claim 3, wherein the training the frame of the end-to-end semantic understanding model according to the training samples based on the preset loss function to obtain the end-to-end semantic understanding model comprises:

6. A method of semantic understanding, the method comprising:

acquiring a text to be predicted;

wherein the end-to-end semantic understanding model is trained based on the method of any one of claims 1 to 5.

7. An end-to-end semantic understanding model training apparatus, the apparatus comprising:

8. A semantic understanding apparatus, characterized in that the apparatus comprises:

the text acquisition module is used for acquiring a text to be predicted;

9. A computer device, comprising:

one or more processors;

a storage device to store one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-6.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.