CN112543932A

CN112543932A - Semantic analysis method, device, equipment and storage medium

Info

Publication number: CN112543932A
Application number: CN202080004415.XA
Authority: CN
Inventors: 李宏广; 聂为然; 高益
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-01-22
Filing date: 2020-01-22
Publication date: 2021-03-23
Also published as: WO2021147041A1

Abstract

A semantic analysis method, a semantic analysis device, semantic analysis equipment and a semantic analysis storage medium relate to the field of artificial intelligence, in particular to the field of natural language understanding. The method comprises the following steps: extracting a structured entity vector from a text to be analyzed, the structured entity vector being indicative of an identity of the entity and an attribute of the entity; extracting the characteristics of the structured entity vector to obtain entity characteristics; fusing the entity characteristics, the lexical characteristics of the text and the syntactic characteristics of the text to obtain semantic characteristics of the text; and decoding the semantic features to obtain semantic information of the text. The method can enhance the semantic understanding capability by utilizing the attributes of the entities.

Description

Semantic analysis method, device, equipment and storage medium

Technical Field

The present application relates to the field of natural language understanding technologies, and in particular, to a semantic analysis method, apparatus, device, and storage medium.

Background

Natural Language Understanding (NLU) is a technology for analyzing the semantics of text in a natural language form by a computer, and aims to make a computing mechanism understand the meaning of natural language, thereby facilitating a user to communicate with the computer using natural language. NLU technology is widely used in many scenarios. For example, in the field of vehicle-mounted devices, after a driver speaks voice based on natural language, a vehicle-mounted terminal can convert the voice into a text, perform semantic analysis on the text to obtain semantic information of the text, and execute a corresponding instruction according to the semantic information, thereby implementing a voice interaction function.

In the meantime, the text to be analyzed can be segmented to obtain each word contained in the text, each word is respectively input into a word2vector model (a model for converting words into vectors), each word is represented as a vector through the word2vector model, and semantic information of the text is analyzed according to the vector corresponding to each word.

Often, the text contains some specific entities, such as songs, places, etc., which have a great influence on the semantic meaning of the text. When the method is adopted, the capability of identifying the entities in the text is poor, so that the semantic understanding capability of the computer is insufficient.

Disclosure of Invention

The application provides a semantic analysis method, a semantic analysis device, semantic analysis equipment and a semantic analysis storage medium, which can improve the semantic understanding capability of a computer.

In a first aspect, a semantic analysis method is provided, in which an entity in a text to be analyzed is obtained; acquiring a structured entity vector corresponding to the entity according to the entity in the text to be analyzed, wherein the structured entity vector is used for indicating the identity of the entity and the attribute of the entity; carrying out feature extraction on the structured entity vector to obtain entity features; and fusing the entity characteristics, the lexical characteristics of the text and the syntactic characteristics of the text to obtain the semantic characteristics of the text, wherein the semantic characteristics are used for acquiring semantic information of the text.

In the method, a structured entity vector is constructed, the identification of the entity and the attribute of the entity are represented in a vector form, entity features are extracted from the structured entity vector, the entity features are fused with lexical features and syntactic features to obtain semantic features containing the entity features, the lexical features and the syntactic features, and semantic information is obtained after the semantic features are decoded.

Optionally, the extracting manner of the structured entity vector may include: and acquiring the structured entity vector from an entity construction table according to the entity in the text to be analyzed, wherein the entity construction table is used for storing the mapping relation between the entity and the structured entity vector. By the method, the vector of the entity can represent the entity and the attribute of the entity, so that the vectorization representation effect of the entity is good, the effective embedding of the entity is realized, and the vehicle-mounted semantic intention understanding capability and the semantic slot position extraction capability of the pre-training model can be enhanced when the subsequent pre-training model is further identified according to the structured entity vector.

Optionally, the entity construction table includes entities associated with the vehicle-mounted domain, and the text is obtained by recognizing the speech acquired by the vehicle-mounted terminal. In this way, the vehicle-mounted domain structured knowledge entity is facilitated to be constructed.

Optionally, the entity building table includes at least one of an entity with irregular name, an entity with a number of characters of the name exceeding a threshold, and an entity with a word frequency of the name below a threshold. The entities are easy to cause ambiguity or have multiple meanings due to names, the machine is difficult to understand correct semantics, vector representations of the entities are stored in an entity construction table in advance, the machine can obtain the accurate vector representations by table lookup, and entity features are blended in the semantic understanding process, so that the semantic understanding accuracy is improved.

Optionally, the manner of fusing the entity feature, the lexical feature and the syntactic feature includes: carrying out weighted summation on the entity characteristic, the lexical characteristic and the syntactic characteristic to obtain a fusion characteristic; and carrying out nonlinear transformation on the fusion feature through an activation function to obtain the semantic feature. Since the lexical characteristics, the syntactic characteristics and the entity characteristics are characteristics in different vector spaces, or the lexical characteristics, the syntactic characteristics and the entity characteristics are heterogeneous information, the three characteristics can be fused together by performing weighted summation on the entity characteristics, the lexical characteristics and the syntactic characteristics, so that heterogeneous information fusion is realized.

Optionally, the lexical and syntactic features of the text are extracted in such a way that: inputting a text into a semantic understanding model, wherein the semantic understanding model is obtained by performing migration training on a pre-training model according to a first sample, the first sample comprises a text marked with semantic information, the pre-training model is obtained by training according to a second sample, and the second sample comprises a masked text; the lexical features and the syntactic features are extracted from the text by the semantic understanding model. The pre-training model is trained by adopting a mask strategy, so that the pre-training model has basic natural language processing capacity. On the basis of the pre-training model, the text marked with semantic information is used for carrying out model fine adjustment on the pre-training model by combining with a target of semantic understanding, so that the pre-training model learns the association relation between the text and the semantic information in the fine adjustment process, and has the extraction capability of lexical features, syntactic features and semantic features. In the stage of model application, the semantic understanding model can be used for extracting accurate lexical characteristics, syntactic characteristics and semantic characteristics.

Optionally, the manner of extracting the lexical features and the syntactic features by the semantic understanding model may include: performing attention operation on the text to obtain a first output result, wherein the first output result is used for indicating the dependency relationship between words in the text; normalizing the first output result to obtain a second output result; performing linear transformation and nonlinear transformation on the second output result to obtain a third output result; and normalizing the third output result to obtain the lexical feature and the syntactic feature.

Optionally, the semantic understanding model comprises a first multi-head attention model, and accordingly, the manner of attention operation comprises: inputting the text into the first multi-headed attention model; performing attention operation on the text through each attention module in the first multi-head attention model to obtain an output result of each attention module; splicing the output results of each attention module to obtain a splicing result; and carrying out linear transformation on the splicing result to obtain the first output result. By the method, a multi-head attention mechanism can be utilized, long-distance features in the text can be captured, abundant context semantic representation information can be extracted, and the extraction capability of lexical features and syntactic features is enhanced.

Optionally, the manner of extracting the entity features includes: inputting the structured entity vector into a second multi-headed attention model; performing attention operation on the structured entity vector through each attention module in the second multi-head attention model to obtain an output result of each attention module; splicing the output results of each attention module to obtain a splicing result; and carrying out linear transformation on the splicing result to obtain the entity characteristics. Through the mode, the multi-head attention mechanism is utilized, the correlation between words in the structured entity vector can be captured, long-distance features can be captured, extracted entity features can accurately express semantics, and therefore the entity features are more accurate.

In a second aspect, a semantic analysis device is provided, which has a function of performing semantic analysis according to the first aspect or any one of the optional aspects of the first aspect. The semantic analysis device comprises at least one module, and the at least one module is used for implementing the semantic analysis method provided by the first aspect or any one of the optional manners of the first aspect. For specific details of the semantic analysis device provided by the second aspect, reference may be made to the first aspect or any optional manner of the first aspect, and details are not described here.

In a third aspect, an execution device is provided, where the execution device includes a processor, and the processor is configured to execute instructions, so that the execution device performs the semantic analysis method provided in the first aspect or any one of the alternatives of the first aspect. For specific details of the execution device provided in the third aspect, reference may be made to the first aspect or any optional manner of the first aspect, and details are not described here again.

In a fourth aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the instruction is read by a processor to enable an execution device to execute the semantic analysis method provided in the first aspect or any one of the alternatives of the first aspect.

In a fifth aspect, a computer program product is provided, which, when run on an execution device, causes the execution device to perform the semantic analysis method provided in the first aspect or any of the alternatives of the first aspect.

A sixth aspect provides a chip, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to execute the semantic analysis method provided in the first aspect or any one of the optional manners of the first aspect.

Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, and when the instructions are executed, the processor is configured to execute the semantic analysis method provided by the first aspect or any one of the optional manners of the first aspect.

Drawings

Fig. 1 is a schematic structural diagram of a system architecture according to an embodiment of the present application;

FIG. 2 is a diagram illustrating lexical and syntactic features extraction according to a semantic understanding model provided in an embodiment of the present application;

FIG. 3 is a schematic flow chart diagram of a training method of a semantic understanding model provided by an embodiment of the present application;

FIG. 4 is a schematic flow chart diagram of a semantic analysis method provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of extracting a structured entity vector according to an embodiment of the present application;

FIG. 6 is a diagram illustrating a fusion of entity, lexical, and syntactic features provided in an embodiment of the present application;

FIG. 7 is a schematic flow chart diagram of a method for vehicle-mounted voice interaction based on a semantic understanding model and a structured entity vector according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of semantic intent understanding and semantic slot extraction provided by an embodiment of the present application;

fig. 9 is a schematic structural diagram of a semantic analysis apparatus according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a training apparatus for semantic understanding models according to an embodiment of the present disclosure;

fig. 11 is a hardware structure diagram of a semantic analysis apparatus provided in an embodiment of the present application;

fig. 12 is a hardware configuration diagram of a training apparatus for a semantic understanding model according to an embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

The semantic analysis method provided by the embodiment of the application can be applied to a human-computer interaction scene and other scenes needing to enable a computer to solve natural language. Specifically, the semantic analysis method of the embodiment of the application can be applied to a voice interaction scene, for example, in a vehicle-mounted voice interaction scene, and the voice interaction scene and the vehicle-mounted voice interaction are respectively and simply introduced below.

Voice interaction refers to the transfer of information by natural speech between a human and a device. The vehicle-mounted voice interaction scene is a scene in which a user performs voice interaction with a vehicle-mounted terminal mounted on an automobile. For example, in the driving process of a vehicle, a user can send out voice containing instructions, the vehicle-mounted terminal can convert the voice of the user into instructions which can be understood by a machine, and the instructions are executed to realize operation, so that intelligent activation functions of voice communication, opening and closing of a vehicle-mounted air conditioner, automatic height/temperature adjustment of a vehicle-mounted seat, music playing and the like are realized. Through voice interaction's human-computer interaction mode, the user can leave free hand and eyes to handle other things, for example when wanting to listen to the music, the user comes the on-demand song through the mode of pronunciation, and hand and eyes can concentrate on being used for driving like this to driving security and convenience in the on-vehicle scene greatly promote.

In the application scenario of voice interaction, Natural Language Understanding (NLU) is a key technology for implementing a vehicle-mounted voice interaction system. Natural language understanding is part of Natural Language Processing (NLP), is the core of NLP, and is also a difficulty of NLP. In general, natural language understanding is a technique that expects a machine to have the ability to understand natural language like a human, and to output correct semantic information (such as correct semantic meaning and semantic slot) given an input text. The natural language is an expression mode commonly used by people in life at ordinary times, for example, when describing the characteristic of humpback, the expression by the natural language can be as follows: i have a little humpback, and the expression in non-natural language can be: my back is curved.

However, at present, natural language understanding has some unsatisfactory places, especially in a vehicle-mounted voice interaction scene, the vehicle-mounted terminal often has the problem of insufficient semantic intention understanding capability, and meanwhile, the vehicle-mounted terminal cannot understand some structured knowledge entities and abstract semantic representations. For example, basic entities such as song names with irregular grammar, long character place names, low-frequency character place names and the like are difficult to identify by the vehicle-mounted terminal, and the accuracy of understanding semantics is greatly influenced due to insufficient entity identification capability. For example, the user wants to go to a holiday square named "flowers of the world" in Beijing, and then the user says "search flowers of the world" to the in-vehicle terminal. The user's intent expressed in this sentence is navigation, the destination being the flowers of the world. When the vehicle-mounted terminal identifies the four characters of 'flowers in the world', the flowers in the world are easily understood as a song, the words are wrongly understood as that the user intends to listen to the song, the song is named as the flowers in the world, so that the vehicle-mounted terminal should execute the navigation service, and as a result, the music playing service is executed because the user intends to understand the words wrongly, and the service executed by the vehicle-mounted terminal cannot meet the feedback expected by the user.

Therefore, in a vehicle-mounted voice interaction scene, how to improve semantic comprehension capability is crucial, and the method is a popular research direction in the vehicle-mounted field in the future.

In some embodiments of the present application, however, a semantic understanding method is provided that combines a pre-trained model and a structured entity vector. On one hand, the method comprises the steps of carrying out random multivariate dynamic mask training by adopting large-scale linguistic data to obtain a pre-training model, carrying out model fine adjustment on the pre-training model to obtain a semantic understanding model, enabling the semantic understanding model to be capable of extracting lexical features, syntactic features and semantic features, and improving the understanding capacity of semantic intentions and the extracting capacity of semantic slots through the pre-training process and the model fine adjustment process of the semantic understanding model, particularly improving the extraction of the lexical features, the syntactic features and the semantic features in the vehicle-mounted field, and having very strong semantic intention understanding capacity. On the other hand, the representation work of the entity is realized by constructing the structured entity vector, and the semantic intention understanding capability of the semantic understanding model can be enhanced by the attributes of the entity. Particularly, the vehicle-mounted terminal is facilitated to identify basic structural entity vectors by setting the entity construction table in the vehicle-mounted field, and the semantic intention understanding capability and the semantic slot position extracting capability are improved. On the other hand, the integration of heterogeneous information is realized by integrating the entity characteristics, the lexical characteristics and the syntactic characteristics, and the semantic information of three different vector spaces, namely the entity characteristics, the lexical characteristics and the syntactic characteristics, is combined together to identify the semantic, so that the accuracy of semantic understanding is improved.

The method provided by the application is described from the model training side and the model application side as follows:

the semantic understanding model training method provided by the embodiment of the application relates to natural language understanding, and can be particularly applied to data processing methods such as data training, machine learning and deep learning, and intelligent information modeling, extraction, preprocessing, pre-training, model fine-tuning and the like which are symbolized and formalized are carried out on training data (such as masked texts or texts marked with semantic information such as semantic intentions, semantic slots and the like in the application), so that a trained semantic understanding model is finally obtained; in addition, the semantic analysis method provided in the embodiment of the present application may use the trained semantic understanding model to input data (e.g., a text to be analyzed in the embodiment of the present application) into the trained semantic understanding model, so as to obtain output data (e.g., semantic information such as a semantic intention and a semantic slot in the embodiment of the present application). It should be noted that the training method and the semantic analysis method of the semantic understanding model provided in the embodiment of the present application are inventions based on the same concept, and can also be understood as two parts in a system or two stages of an overall process: such as a model training phase and a model application phase.

Since the semantic understanding model of the present application relates to the application of the attention mechanism to natural language understanding, for the convenience of understanding, the related concepts in the attention mechanism related to the embodiments of the present application will be described first.

(1) Self-attention (self-attention) mechanism.

The autoflight mechanism is an improvement on the attentiveness mechanism, which reduces reliance on external information and is more adept at capturing internal correlations of data or features. The essence of the self-attention mechanism is to calculate a sequence related to the self-attention mechanism; the target sequence is the same as the source sequence in the self-attention mechanism. By applying the self-attention mechanism in the NLP field, inter-word dependencies of sentences can be extracted, such as common phrases, pronouns and other things. When a sentence is input, the machine not only focuses on the word to be coded but also focuses on other words of the input sentence when coding each word, and learns the word dependency inside the sentence by performing attention calculation on each word and all words in the sentence, thereby capturing the internal structure of the sentence. The Attention operation process may be encapsulated in an Attention function (Attention function), which may be denoted as Attention (X, X), and after the machine obtains the input text sequence, the machine may call the Attention function to perform the self-Attention operation by taking the text sequence as X. The self-attention mechanism has many advantages. For example, from the perspective of long-distance dependency learning, since the self-attention mechanism is that attention is calculated for each word and all words, the maximum path length is only 1 regardless of the distance between words, and thus the dependency relationship can be calculated regardless of the distance between words, thereby learning the internal structure of a sentence.

In the following, how to implement the self-attention operation using the vector is described first, and then how to implement the self-attention operation using the matrix is described next.

The process of implementing the self-attention operation using the vector may include the following steps S10 to S14:

step S10, for each word in the input sequence, generates three vectors including a query vector, a key vector, and a value vector. Typically, these three vectors are created by multiplying the word embedding of the word with three weight matrices. For example, if the input sentence is a Thinking machine, the first word in the sentence is "think" (Thinking), "think" word is embedded as X1, X1 is multiplied by the WQ weight matrix to obtain q1, and q1 is the query vector associated with the word.

Step S11, a score is calculated. Assuming that the first word "Thinking" in this example is computed from the attention vector, each word in the input sentence may be used to Score "Thinking" resulting in a Score of the word (Score). For example, a fractional expression of the word "Thinking" has other parts of the multi-view sentence in the process of encoding the word "Thinking". The score of the word "Thinking" is calculated by dot product of the key vector of the word (all words of the input sentence) scored by "Thinking" and the query vector of "Thinking". For example, if a sentence contains 2 words, the word embedding of the 1 st word is x1, the query vector of the 1 st word is q1, the key vector of the 1 st word is k1, the value vector of the 1 st word is v1, the word embedding of the 1 st word is x2, the query vector of the 1 st word is q2, the key vector of the 1 st word is k2, the value vector of the 1 st word is v2, if the self-attention of the 1 st word is to be processed, the first score is the dot product of q1 and k1, and the second score is the dot product of q1 and k 2.

Step S12, the Score (Score) of the word is processed, for example, the Score is divided by a default value, and then the division result is operated by the softmax function to obtain the softmax Score of the word. The division of the score by the default value has the effect of reducing the score to a smaller value range through division, and the condition that the softmax score is not zero, namely 1, is avoided. The effect of operating on the softmax function is to normalize the scores of all words so that the softmax score of each word is a positive number and the sum of the softmax scores of all words in the sentence is 1. The softmax score determines the contribution of each word to encoding the current word (e.g., "Thinking" and "machine" to "Thinking").

Step S13, multiply each value vector by the softmax score.

Step S14, summing the weighted value vectors, and obtaining the output from the attention layer at that position (e.g., the output of the first word "Thinking").

By performing the above-described steps S10 to S14, the calculation of self-attention is completed, and the calculated vector can be transmitted to the feedforward neural network. In some cases, the above steps S10 to S14 may perform operations in a matrix form so as to perform operations faster. For example, the calculation of the self-attention may be realized with a matrix by performing the following steps S20 to S21.

And step S20, calculating a query matrix, a key matrix and a value matrix. Specifically, the word vector of each word in the input sentence is loaded into a matrix X, and the matrix X is multiplied by a query weight matrix W respectively^QWeight matrix W^KThe value weight matrix W^VAnd obtaining a query matrix Q, a key matrix K and a value matrix V. The query matrix Q may be calculated by using the following formula (1), the key matrix K may be calculated by using the following formula (2), and the value matrix V may be calculated by using the following formula (3).

Q＝WQX1(1)

K＝WKX1(2)

V＝WVX1(3)

Wherein each row in the matrix X corresponds to a word in the input sentence, each row in the matrix X is a word vector of a word, the matrix Q represents a Query (Queries) matrix of the input sentence, each row in the matrix Q is a Query vector of a word, the matrix K represents a Key (Key) matrix of the input sentence, each row in the matrix K is a Key vector of a word, the matrix V represents a Value (Value) matrix of the input sentence, and each row in the matrix V is a Value vector of a word.

Step S21 can be expressed by the following formula (4), and the following formula (4) is a combination of the above-described steps S11 to S14.

(2) The Multi-Head Attention (Multi-Head Attention) model.

The multi-head attention model is called as multi-head because the multi-head attention model includes h attention modules, each of which can realize the self-attention mechanism shown in (1), and h attention operations are performed by the h attention modules, where h is a positive integer greater than 1, and h may be 8, for example. Wherein each attention module maintains independent query weight matrix, key weight matrix, value weight matrix, thus using the input matrix X and the query weight matrix W for each attention module_QWeight matrix W_KThe value weight matrix W_VAfter operation, h query matrices Q, a key matrix K and a value matrix V are generated, and h matrices Z are further generated, wherein the h matrices Z are respectively matrices Z₀Matrix Z₁To matrix Z_h,. However, in general, a network (e.g., a feedforward network) such as a network behind a multi-headed attention model does not need to input h matrices, which requires a matrix to be input, the matrix being composed of the representative vectors of each word. Thus, h matrices Z can be compressed into one matrix. One way to achieve compression is to combine h matrices (matrix Z)₀Matrix Z₁To matrix Z_h) Spliced together and then used with an additional weight matrix W^OAnd multiplying the result of the concatenation by a matrix Z that fuses all the attention module information, and using the matrix Z for subsequent operations, such as sending to a feed-forward network. Optionally, the number of dimensions of the output result of the concatenation is equal to the sum of the number of dimensions of the input parameters of the concatenation, and the number of rows of the output result of the concatenation is equal to the number of rows of the input parameters of the concatenation. For exampleFor h matrices (matrix Z)₀Matrix Z₁To matrix Z_h) After splicing, the output result is a large matrix containing h matrixes, the dimension number of the large matrix is the sum of the dimension numbers of the h matrixes, and the row number of the large matrix is equal to the row number of each matrix in the h matrixes.

The multi-head attention model has many effects.

From the perspective of semantic feature extraction capability, the multi-head attention model has strong capability of extracting semantic features because multiple attention modules are used, and various weight matrixes corresponding to each attention module are initialized randomly, and after training, each weight matrix is used for projecting an input word embedding (or a vector from a lower encoder/decoder) into different representation subspaces, so that the model is allowed to learn related information in the different representation subspaces.

From the perspective of long-distance feature capture capability, firstly, the multi-head attention model is based on the self-attention mechanism, so that the multi-head attention model has the advantages of the self-attention mechanism, and the internal structure of a sentence can be learned. On the basis, the multi-head attention model expands the capability of the model to focus on different positions due to the use of a plurality of attention modules, so that the long-distance feature capture capability is further enhanced.

From the perspective of task comprehensive feature extraction capability, the multi-head attention model is excellent in performances of lexical, syntactic, semantic, context processing capability, long-distance feature capture and the like, so that the comprehensive feature extraction capability is very strong.

From the viewpoint of parallel computing power, the multi-head attention model does not depend on the computation at the previous time, so that the multi-head attention model can be operated in parallel.

The above describes the self-attention mechanism related to the semantic understanding model of the embodiment of the present application, and the semantic understanding model of the embodiment of the present application also relates to some concepts in the AI field, which are described below for the convenience of understanding.

(3) Activation functions (activation functions): is a function for performing a non-linear transformation.

(4) A Gaussian error linear unit (Gelu) is a high-performance activation function, and the nonlinear transformation of the Gelu function is a random regular transformation mode which accords with expectation, so that the linear transformation is excellent in the NLP field, and especially the linear transformation is best in a self-attention model; the problem of gradient disappearance can be avoided.

(5) Loss function

In the process of training the model, because the output of the model is expected to be as close as possible to the value really expected to be predicted, the weight vector of each layer of neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first update, namely parameters are configured in advance for each layer in the model), for example, if the predicted value of the network is high, the weight vector is adjusted to be lower, and the adjustment is carried out continuously until the model can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function indicates the larger the difference, the training of the model becomes the process of reducing the loss as much as possible.

(6) Back propagation algorithm

The model can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and parameters in an initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the super-resolution model, such as a weight matrix.

The above describes the self-attention mechanism involved in the semantic understanding model of the embodiment of the present application, and the semantic understanding model of the embodiment of the present application also relates to some concepts in the field of knowledge graph technology, which are described below for the convenience of understanding.

(7) Entity (entity) refers to something that is distinguishable and exists independently. An entity may be a specific object, such as a person, a city, a plant, etc., a commodity, etc. Entities may also be abstract events, such as: one-time book borrowing, one-field ball game, etc. Everything in the world is made up of specific things, and things can all be referred to as entities.

(8) And the entity extraction refers to extracting entities in the text, such as name of person, organization/organization name, geographic position, event/date, character value and money value in the text. Entity extraction includes detecting (find) and classifying (classify) entities. In general, entity extraction is to find an entity from a sentence and tag the entity.

(9) The attributes are as follows: an entity has a number of properties, each of which is called an attribute. Each attribute has a range of values, which can be of the type integer, real, string. Such as: the students (entities) have attributes such as school numbers, names, ages, sexes and the like, and the corresponding value ranges are characters, character strings, integers and character string types.

The system architecture provided by the embodiments of the present application is described below.

Referring to fig. 1, a system architecture 100 is provided in an embodiment of the present application. As shown in the system architecture 100, the data acquisition device 16 is configured to acquire training data, which in this embodiment of the present application includes: text tagged with semantic information, such as text tagged with semantic intent and semantic slots. Optionally, the training data further includes masked text, for example, samples processed by the random multivariate masking strategy; the data acquisition device 16 stores the training data in the database 13. Training device 12 trains resulting semantic understanding model 200 based on training data maintained in database 13. In the first embodiment, how the training device 12 obtains the semantic understanding model 200 based on the training data will be described in more detail below, where the semantic understanding model 200 can be used to implement the function of extracting the lexical feature and the syntactic feature in the embodiment of the present application, that is, the lexical feature and the syntactic feature can be obtained by inputting the text to be analyzed into the semantic understanding model 200 after relevant preprocessing.

The semantic understanding model 200 in the embodiment of the present application may specifically be an attention-based model, and in some embodiments of the present application, the semantic understanding model 200 is obtained by performing model fine-tuning on a pre-training model (such as a multi-head attention model and some weight matrices). In practical applications, the training data maintained in the database 13 is not necessarily acquired by the data acquisition device 16, and may be received from other devices. It should be noted that, the training device 12 does not necessarily perform training of the semantic understanding model 200 based on the training data maintained by the database 13, and may also obtain the training data from a cloud or other places for performing model training.

The semantic understanding model 200 obtained by training according to the training device 12 may be applied to different systems or devices, for example, the semantic understanding model 200 is applied to the execution device 11 shown in fig. 1, where the execution device 11 may be a terminal, such as a vehicle-mounted terminal, a mobile phone terminal, a tablet computer, a laptop computer, an AR/VR, and may also be a server or a cloud. In fig. 1, the execution apparatus 11 is configured with an I/O interface 112 for data interaction with an external apparatus.

The system architecture shown in fig. 1 may be applied to a voice interaction scenario, a product form of the voice interaction scheme provided in the embodiment of the present application may be a voice personalized adaptive algorithm module of a voice interaction software system, and an implementation form of the product is a computer program running on various terminal devices. For example, when the method is applied to a scene of vehicle-mounted voice interaction, the semantic intention of a vehicle-mounted user control instruction can be understood through the voice interaction product provided by the embodiment of the application, and the function of a corresponding vehicle-mounted module is realized.

The functions of the various modules in the system architecture are illustrated below.

A user may input speech to I/O interface 112 through audio capture device 14. The audio capture device 14 may include a distributed microphone array for capturing the user's voice control commands, and in addition, the audio capture device 14 may perform some audio signal pre-processing operations such as sound source localization, echo cancellation, and signal enhancement.

The speech recognition module 113 is configured to perform speech recognition according to input data (such as the speech signal) received by the I/O interface 112, so as to obtain a text to be analyzed. In this way, the input data is converted from a speech signal into a text signal and output to the semantic understanding module 111.

The semantic understanding module 111 is used to understand semantics, such as extracting semantic intent and semantic slots of a user. The semantic understanding module 111 may include a semantic understanding model 200, an entity extraction module 210, an entity construction module 220, a heterogeneous information fusion module 230, and a semantic decoding module 240. The specific functions of each module are as follows:

the semantic understanding model 200 is obtained after migration training is performed according to the pre-training model, and the semantic understanding model 200 is responsible for extracting lexical and syntactic semantic features of text input and realizing preliminary semantic intention understanding of user commands.

The entity extraction module 210 is used for extracting entities from the text input to obtain valid entities.

The entity construction module 220 is configured to perform vectorization representation on the entity to obtain the representation of the entity and the attribute.

The heterogeneous information fusion module 230 fuses lexical features, syntactic features and entity features of the text input to obtain semantic features, which can enhance the comprehension ability of semantic intentions and the extraction ability of semantic slots due to the combination of effective information in different vector spaces.

The semantic decoding module 240 is configured to decode the semantic features to obtain semantic information, such as semantic intent understanding and semantic slot extraction of user command input, and output a control command.

In the process that the execution device 11 preprocesses the input data or in the process that the semantic understanding module 111 of the execution device 11 performs the calculation or other related processing, the execution device 11 may call the data, the code, and the like in the data storage system 15 for corresponding processing, and may store the data, the instruction, and the like obtained by corresponding processing into the data storage system 15. In addition, after the execution device 11 determines the semantic intent and the semantic slot of the user, it issues a control command to the I/O interface 112.

Finally, the I/O interface 112 returns the control command to the vehicle-mounted execution system 18, and the vehicle-mounted execution system 18 executes the corresponding control command, such as listening to songs, performing voice navigation, answering incoming calls, controlling the vehicle temperature, and the like, so as to support an intelligent vehicle-mounted scene.

It is worth mentioning that the scenario of vehicle-mounted voice interaction is only illustrative. Training device 12 may also generate corresponding semantic understanding models 200 for different tasks based on different training data, and such corresponding semantic understanding models 200 may be used to achieve the above goals or accomplish the above tasks, thereby providing the user with desired results.

For example, the system architecture described above may also be applied to a machine translation scenario or a robot question and answer scenario, and the audio capture device 14 shown in fig. 1 may also be replaced by a mobile phone, a personal computer, or other user devices. The user may manually give input data, which may be manipulated through an interface provided by the I/O interface 112. Alternatively, the user device may automatically send the input data to the I/O interface 112, and if the user device is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the user device. The user can view the result output by the execution device 11 at the user device, and the specific presentation form may be a display, a sound, an action, and the like. The user device may also be used as a data acquisition terminal, and acquires input data inputted to the I/O interface 112 and an output result outputted from the I/O interface 112 as new sample data, and stores the new sample data in the database 13. Of course, the input data inputted into the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 13 as new sample data by the I/O interface 112 without being collected by the user equipment.

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 1, the data storage system 15 is an external memory with respect to the execution device 11, and in other cases, the data storage system 15 may also be disposed in the execution device 11.

As shown in fig. 2, the semantic understanding model 200 is obtained by training the training device 12, and the semantic understanding model 200 provided in the embodiment of the present application may include: a first multi-head attention model 201, a first vector normalization layer 202, a forward pass layer 203, and a second vector normalization layer 204.

The first multi-head attention model 201 is used for receiving an input text, performing attention operation on the text, and sending an output result to the first vector normalization layer 202. The first multi-head attention model 201 comprises a plurality of attention modules, each of which is also referred to as an attention module. For example, in fig. 2, first multi-headed attention model 201 includes attention module 0, attention module 1, attention module 2, attention module 3, attention module 4, attention module 5, attention module 6, and attention module 7. The technical details of the first multi-head attention model 201 as a whole can refer to the description of (2) above. Each attention module may implement attention operation, and the technical details of each attention module operation may refer to the description of (1) above.

The first vector normalization layer 202 is configured to receive an input of the first multi-headed attention model 201, perform normalization calculation on the input, and send an output result to the forward transfer layer 203. The first vector normalization layer 202 performs normalization calculation to normalize the mean variance of the samples, thereby simplifying the whole learning difficulty.

The forward pass layer 203 is configured to receive an input of the first vector normalization layer 202, perform a forward pass calculation on the input, and send an output result to the second vector normalization layer 204. The forward pass layer 203 can implement a line linear transformation and a non-linear transformation by forward pass computation, mapping the input of the first vector normalization layer 202 to a high-dimensional vector space.

The second vector normalization layer 204 is configured to receive an input of the forward transfer layer 203, perform normalization calculation on the input, and output an output result. The second vector normalization layer 204 can also perform normalization calculation to realize the normalization of the mean variance of the samples, thereby simplifying the whole learning difficulty.

The first embodiment is as follows:

fig. 3 is a semantic understanding model training method provided in an embodiment of the present application, where the embodiment may be specifically performed by the training apparatus 12 shown in fig. 1, and the embodiment relates to a pre-training process and a model fine-tuning (fine-tuning) process, where samples used in the pre-training process may be different from samples used in the model fine-tuning process. For the purpose of description differentiation, the present embodiment refers to the samples used by the model fine-tuning process as the first samples, and refers to the samples used by the pre-training process as the second samples. The first sample and the second sample may be training data maintained in the database 13 shown in fig. 1, optionally, S301 and S302 in the first embodiment may be executed in the training device 12, or may be executed in advance by other functional modules before the training device 12, for example, the cloud device performs a preprocessing on the second sample received or acquired from the database 13, for example, the pre-training process of S301 and S302 is performed to obtain a pre-training model, the pre-training model and the first sample are used as input of the training device 12, and the training device 12 executes S303 to S304.

Exemplarily, this embodiment one includes the following S301 to S304:

s301, the training equipment acquires a second sample.

The second sample is text processed based on a Mask (Mask) policy, and the second sample includes the masked text. The second sample may label the lexeme corresponding to the mask, that is, the label of the second sample is the position of the word replaced by mask in the sentence.

In a possible implementation, the large-scale corpus may be obtained, the mask policy may be adopted to process the large-scale corpus, and the processed large-scale corpus may be labeled to obtain the second sample. Wherein the masking policy may include at least one of a random masking policy and a multiple Mask (N-gram Mask) policy. The manner in which the model is trained using the masking strategy may be referred to as random multivariate dynamic mask training.

For example, the original text is "turn on the air conditioner in the car", and after the "turn on the air conditioner in the car" is processed based on the mask policy, the obtained second sample is "turn [ mask ] in the car.

For another example, the original text is "navigate to remove the great east of Purper", and after the "navigate to remove the great east of Purper" is processed based on the mask strategy, the second sample obtained is "navigate to [ mask ] [ mask ] great east.

For another example, the original text is the "qilixiang i want to hear zhou jiron", and the second sample obtained after the "qilixiang i want to hear zhou jiron" is processed based on the mask strategy is the "qilixiang i want to hear [ mask ] [ mask ].

For another example, the original text is "i want to make a call to me home", and after "i want to make a call to me home" is processed based on the mask policy, the second sample obtained is "i want to make a call to me home [ mask ] [ mask ]".

For another example, after the original text is "heat the passenger seat", and the "heat the passenger seat" is processed based on the mask policy, the obtained second sample is "heat the [ mask ] seat [ mask ] for [ mask ] driving.

And S302, performing model training by the training equipment according to the second sample to obtain a pre-training model.

The model training can be realized by a loss function and a back propagation algorithm, and the specific details thereof can be referred to the descriptions of (5) and (6) above.

S303, the training equipment acquires a first sample.

The first sample includes text annotated with semantic information. For example, the first sample is annotated with a semantic intent and a semantic slot. Alternatively, in the vehicle-mounted domain, the first sample may be a text in the vehicle-mounted domain, for example, a corpus in a vehicle-mounted voice interaction scene.

S304, the training equipment performs transfer training on the pre-training model according to the first sample to obtain a semantic understanding model.

In S304, the migration training may be model tuning. Model tuning is conceptually different from model training, which generally means that before training, the parameters of the model are randomly initialized, and a new network is trained from the beginning according to the randomly initialized parameters. The model fine tuning refers to fine tuning the parameters of the model according to a specific task on the basis of the pre-training model, and the fine tuning mode can utilize the trained parameters in the pre-training model, so that compared with the training from the beginning, a large amount of computing resources and computing time are saved, and the computing efficiency and accuracy are improved. The model fine tuning can be realized by a loss function and a back propagation algorithm, and the specific details thereof can be referred to the descriptions of (5) and (6) above.

Of course, the above-described manner of obtaining the semantic understanding model is only an example, and the semantic understanding model may also be other large-scale pre-training language models based on pre-training and fine-tuning paradigms.

The embodiment provides a model training method for realizing a semantic understanding function, and a pre-training model is trained by adopting a mask strategy, so that the pre-training model has basic natural language processing capacity. On the basis of the pre-training model, the text marked with semantic information is used for carrying out model fine adjustment on the pre-training model by combining with a target of semantic understanding, so that the pre-training model learns the association relation between the text and the semantic information in the fine adjustment process, and has the extraction capability of lexical features, syntactic features and semantic features. In the stage of model application, the semantic understanding model can be used for extracting accurate lexical characteristics, syntactic characteristics and semantic characteristics.

Example two:

fig. 4 is a semantic analysis method provided in an embodiment two of the present application, where the embodiment two may be specifically executed by the execution device 11 shown in fig. 1, a text to be analyzed in the embodiment two may be obtained by voice conversion given by the audio acquisition device 14 shown in fig. 1, the voice recognition module 113 in the execution device 11 may be used to execute S401 in the embodiment two, and the semantic understanding module 111 in the execution device 11 may be used to execute S402 to S407.

Optionally, the second embodiment may be processed by a Central Processing Unit (CPU), or may be processed by a CPU and a Graphics Processing Unit (GPU), or may use other processors suitable for neural network computing instead of the GPU, and the application is not limited thereto. This embodiment two includes S401 to S407.

S401, the execution equipment acquires a text to be analyzed.

For example, when the method is applied to the field of voice interaction, after a user speaks, the execution device collects a voice signal and performs voice recognition on the voice signal to obtain a text. The voice signal contains a control command for the vehicle-mounted terminal, and the text can be in the form of a text signal.

Referring to the system architecture shown in fig. 1, S401 may include the following steps a to B.

Step A, in the process of starting or driving the automobile, an audio acquisition device (such as a distributed microphone array) acquires a voice signal T ═ (T1, T2 and … tn1), and then the audio acquisition device transmits the voice signal T ═ (T1, T2 and … tn1) to an ASR system of the vehicle-mounted terminal. Where n1 represents the length of the user's voice control command.

And step B, receiving the voice signal T ═ T1 and T2 … tn1 collected by the audio equipment by an Automatic Speech Recognition (ASR) system of the vehicle-mounted terminal, performing voice recognition on the voice signal T ═ T1, T2 and … tn1 to obtain a text signal X ═ X1, X2 and … xn2, and continuously transmitting the X ═ X1, X2 and … xn2 to the semantic understanding module. Where n2 represents the length of the text input and n2 and n1 are equal or unequal.

S402, the execution device extracts lexical features and syntactic features from the text.

For example, the execution apparatus extracts the lexical feature and the syntactic feature by performing the following steps one to two.

Step one, the execution equipment inputs the text into a semantic understanding model.

Alternatively, the text may be input into the semantic understanding model in the form of a vector or a matrix. For example, the execution device may extract a character word vector, a relative position word vector, and a character type word vector of the text, and input a matrix composed of the character word vector, the relative position word vector, and the character type word vector into the semantic understanding model. For example, referring to FIG. 2, the text is "I want to hear murraya", and the input is ([ CLS)]I want to hear the SEP pad pad pad of murraya paniculata). The character word vector is (E)_[CLS] E_{I am} E_{Want to} E_{Listening device} E_Seven-piece E_{Lining (Chinese character of 'li')}E_{Incense stick} E_[SEP] E_[pad] E_[pad] E_[pad]). The relative position word vector is (E)₀ E₁ E₂ E₃ E₄ E₅ E₆ E₇ E₈ E₉ E₁₀). The type word vector is (E)₁ E₁ E₁ E₁ E₁ E₁ E₁ E₁ E₀ E₀ E₀). Among these parameters, E is an abbreviation of embedding, and E represents a word vector. [ CLS]And [ SEP ]]Are separators. Pad is a fill element for processing the entered text to the same length.

And step two, the execution equipment extracts lexical characteristics and syntactic characteristics from the text through a semantic understanding model.

For example, the execution apparatus performs the following steps 2.1 to 2.4.

And 2.1, performing attention operation on the text by the execution equipment to obtain a first output result, wherein the first output result indicates the dependency relationship between words in the text.

The technical details of the attention calculation can be referred to the descriptions of (1) to (2) in the above-described conceptual introduction.

Optionally, the performing device implements step 2.1 using a multi-head attention mechanism. For example, the execution apparatus performs the following steps 2.1.1 to 2.1.4.

Step 2.1.1, the performing device enters the text into the first multi-headed attention model 201.

In this embodiment, a multi-head attention model may be set in the pre-training model, and the entity feature extraction stage may also utilize the multi-head attention model, in order to distinguish descriptions, the multi-head attention model included in the pre-training model is referred to as a first multi-head attention model, and the multi-head attention model used in the entity feature extraction stage is referred to as a second multi-head attention model.

For example, the first multi-head attention model 201 includes m-layer transform (transformer) units, each for performing a multi-head attention mechanism, each comprising h self-attention modules. For example, referring to fig. 2, first multi-head attention model 201 includes attention module 0, attention module 1, attention module 2, attention module 3, attention module 4, attention module 5, attention module 6, and attention module 7.

For example, referring to fig. 2, a matrix composed of a character word vector, a relative position word vector, and a character type word vector may be used as the input matrix X of the first multi-head attention model 201, and the input matrix X may be input to the attention module 0, the attention module 1, the attention module 2, the attention module 3, the attention module 4, the attention module 5, the attention module 6, and the attention module 7, respectively.

And 2.1.2, the execution device performs attention operation on the text through each attention module in the first multi-head attention model 201 to obtain an output result of each attention module.

For example, referring to fig. 2, the attention module 0, the attention module 1, the attention module 2, the attention module 3, the attention module 4, the attention module 5, the attention module 6, and the attention module 7 may perform attention calculation on the input matrix X to obtain an output result of the attention module 0, an output result of the attention module 1, an output result of the attention module 2, an output result of the attention module 3, an output result of the attention module 4, an output result of the attention module 5, an output result of the attention module 6, and an output result of the attention module 7.

Each attention module may perform attention calculation using the following equations (5) to (7), and the output result of the attention module may be expressed by equation (8). Note that the Attention operation is Attention in the following equation (8).

Q＝W^QX₁ (5)

K＝W^KX₁ (6)

V＝W^VX₁ (7)

head(i)＝Attention(Q，K，V) (8)

Wherein, X₁Is the input text signal. W in formula (5)^QA query weight matrix representing one attention module in the first multi-headed attention model 201, and Q represents a query matrix of one attention module in the first multi-headed attention model 201. W in formula (6)^KA key weight matrix representing one attention module in the first multi-head attention model 201, and K represents a key matrix of one attention module in the first multi-head attention model 201. W in formula (7)^VA value weight matrix representing one attention module in the first multi-headed attention model 201, and V represents a value matrix of one attention module in the first multi-headed attention model 201. Head (i) in equation (8) represents the output matrix of the current self-attention mechanism, each line of head (i) represents a self-attention vector of a word, the self-attention vector represents the contribution degree of each word (the current word itself and other words) in the sentence to the current word or the score of each word to the current word, i represents the ith attention module, i is a positive integer greater than 1, i is less than or equal to h, and the column number of head (i) is equal to the column number of Value vectors. dk is the corresponding hidden neural unit dimension. Attention denotes Attention calculation.

And 2.1.3, splicing the output results of each attention module by the execution equipment to obtain a splicing result.

Optionally, the data form of the output result of the attention module is a matrix, the data form of the stitching result is also a matrix, and the number of dimensions of the stitching result is equal to the sum of the number of dimensions of the output result of each attention module. The splicing mode can be transverse splicing, and the splicing process can be realized by calling a concat function. It should be understood that the manner of transverse stitching is merely illustrative. Optionally, the output result of each attention module is spliced by using other splicing manners, for example, the output result of each attention module is spliced by using a longitudinal splicing manner to obtain a splicing result, the number of lines of the splicing result is equal to the sum of the number of lines of the output result of each attention module, and this embodiment does not specifically limit how to splice.

And 2.1.4, performing linear transformation on the splicing result by the execution equipment to obtain a first output result.

The linear transformation may be multiplied by a weight matrix, that is, step 2.1.4 may specifically be: and the execution equipment multiplies the splicing result by the weight matrix, and the product is used as a first output result. Alternatively, the linear transformation may also adopt other manners besides the multiplication with the weight matrix, for example, the splicing result is multiplied by a certain constant to perform linear transformation on the splicing result, or the splicing result is added by a certain constant to perform linear transformation on the splicing result, which is not limited in the embodiment.

Illustratively, step 2.1.3 and step 2.1.4 can be represented by the following formulas (9-1) and (9-2). The splicing in step 2.1.3 is Concat in the following equation (9-1), the linear transformation in step 2.1.4 is the neutralization of W in the following equation (9-1)^OMultiplication.

MultiHead(Q，K，V)＝Concat(head₁，......head_h)W^O (9-1)

headi＝Attention(QW_i ^Q，KW_i ^K，VW_i ^V) (9-2)

Wherein, W^OIs a weight matrix, W^OThe matrix is obtained by joint training in a first multi-headed attention model, Concat represents the stitching operation. MultiHead is the output of the first multi-head attention model. MultiHead is a matrix that is a fusion of h self-attention matrices. h represents the number of attention modules in the first multi-head attention model, h is a positive integer greater than 1, head₁Indicating attention Module 1, head_hIndication noteIntention module h, "head₁，……head_h"denotes h attention modules from attention module 1, attention module 2 to attention module h, h × dk is the overall dimension of the multi-head attention mechanism of the current transducer unit. Where is meant. Attention denotes Attention calculation.

By the method, a multi-head attention mechanism can be utilized, long-distance features in the text can be captured, abundant context semantic representation information can be extracted, and the extraction capability of lexical features and syntactic features is enhanced.

And 2.2, the execution equipment normalizes the first output result to obtain a second output result.

For example, the execution apparatus operates using the following equation (10), and the normalization is performed by the LayerNorm function in the following equation (10). Of course, the LayerNorm function is only an exemplary implementation manner, and the execution device may also perform normalization in other manners, and the embodiment does not specifically limit how to perform normalization.

x＝ LayerNorm (MultiHead(Q，K，V) + sublayer(MultiHead(Q，K，V))) (10)

In equation (10), x represents the second output result. LayerNorm represents a standardized computational operation. MultiHead means multi-head attention, where MultiHead (Q, K, V) is the first output result, and MultiHead (Q, K, V) is the output of the multi-head attention mechanism, which is also the result of equation (9). Sublayer represents the residual calculation operation.

By step 2.2, vector normalization can be achieved, which can achieve normalization of the mean variance of the samples, thus simplifying the difficulty of learning.

And 2.3, performing linear transformation and nonlinear transformation on the second output result by the execution equipment to obtain a third output result.

For example, referring to fig. 2, the output result of the first multi-head attention model 201 may be input into the first vector normalization layer 202, and linear transformation and nonlinear transformation are performed by the first vector normalization layer 202 to obtain a third output result. By the method, after a standardized output result is obtained, forward transfer calculation is adopted to realize high-dimensional mapping of a vector space, and lexical features, syntactic features and semantic features are extracted.

The linear transformation may include a multiplication operation with a matrix and an addition operation with an offset, and the nonlinear transformation may be implemented by a nonlinear function. For example, the nonlinear transformation may be an operation of finding a maximum value, and for example, the execution apparatus may operate using the following formula (11), and the linear transformation is performed by multiplying W by the following formula (11)₁And plus b₁The nonlinear transformation is realized by a max function in the following formula (11). The max function is only an exemplary implementation manner of the nonlinear transformation, and the execution device may also perform the nonlinear transformation in other manners, for example, perform an operation by using an activation function, so as to implement the nonlinear transformation. Further, multiplied by W₁And plus b₁The implementation method is only an exemplary implementation manner of the linear transformation, and the execution device may also perform the linear transformation in other manners, and the embodiment does not specifically limit how to perform the linear transformation.

FFN(x)＝max(0，xW₁+b₁)W₂+b₂ (11)

Where FFN denotes a feed-forward neural network (feed-forward neural network), max denotes an operation of calculating a maximum value, W1 and W2 each denote a weight matrix of forward transfer, b1 and b2 each denote a bias parameter of the weight matrix, and x denotes an output of vector normalization, that is, a result of equation (10), that is, a second output result.

And 2.4, the execution equipment normalizes the third output result to obtain lexical characteristics and syntactic characteristics.

For example, referring to fig. 2, the output result of the forward transfer layer 203 may be input into the second vector normalization layer 204, and normalization is performed by the second vector normalization layer 204 to obtain lexical and syntactic characteristics. Through the step 2.4, the mean variance normalization of the samples is realized, and the whole learning difficulty is simplified.

For example, the execution apparatus operates using the following formula (12). Wherein the normalization is performed by the LayerNorm function in the following equation (12). Of course, the LayerNorm function is only an exemplary implementation manner, and the execution device may also normalize the third output result in other manners, and the embodiment does not specifically limit how to normalize the third output result.

V＝LayerNorm(FFN(x)+sublayer(FFN(x))) (12)

Where LayerNorm represents a normalization calculation operation, ffn (x) is an output of forward transfer, sublayer represents a residual calculation operation, V represents an output matrix of a transform unit, and the dimension of V is the total number of dimensions of the lexical feature and the syntactic feature as a whole.

In some embodiments, in the process of executing the method, it is determined whether the m-layer transformer unit completes the calculation, and if it is determined that the m-layer transformer unit has not completed the calculation, the calculation is continued until the m-layer transformer calculation is completed, and a tensor structure of the final pre-training language model is output.

Through the mode, the semantic understanding model obtained based on the pre-training model fine tuning is adopted, the lexical characteristics and the syntactic characteristics contained in the input text are extracted, and the model has very strong semantic understanding capability on the whole, such as semantic intention understanding capability and semantic slot position information extracting capability, due to the fact that the model is subjected to the pre-training process and the model fine tuning process. Particularly, the text of the vehicle-mounted field is used as a sample to carry out model fine adjustment, so that the model has strong semantic intention understanding capability of the vehicle-mounted field. In addition, when the semantic understanding model is realized by adopting a self-attention mechanism, the correlation between words in the text can be captured by performing attention operation, long-distance features can be captured, and the extracted syntactic features are more accurate.

And S403, the execution equipment acquires the entity in the text to be analyzed.

For example, the execution device performs entity extraction on the text to obtain an entity in the text. For example, the execution device obtains the input text as X ═ X1X 2 … xn, and the execution device performs Extract operation (i.e., entity extraction operation) on X ═ X1X 2 … xn, so as to obtain the entity (e1, …, ej). For example, referring to fig. 5, (E) may be_[CLS] E_{I am} E_{Want to} E_{Listening device} E_Seven-piece E_{Lining (Chinese character of 'li')} E_{Incense stick}E_[SEP] E_[pad] E_[pad] E_[pad]) As input to the entity extraction module, the entity extraction module pair (E)_[CLS] E_{I am} E_{Want to} E_{Listening device}、E_Seven-pieceE_{Lining (Chinese character of 'li')} E_{Incense stick} E_[SEP] E_[pad] E_[pad] E_[pad]) Performing entity extraction to obtain extracted entity (E)_Seven-piece E_{Lining (Chinese character of 'li')} E_{Incense stick}). Where ej represents the jth entity in the text, and j is a positive integer.

S404, the execution equipment acquires a structural entity vector corresponding to the entity according to the entity in the text to be analyzed, wherein the structural entity vector is used for indicating the identification of the entity and the attribute of the entity.

A structured entity vector is a vector representation of an entity. Due to the adoption of the data form of the vector, the data structure is more regular and complete. For example, the number of dimensions of the structured entity vector is 100 dimensions, but of course, the structured entity vector may also be a vector of other dimensions instead of the 100 dimensions, and the specific dimension of the structured entity vector is not limited in this embodiment. For example, default is an identification of an entity, default attribute is a song name, default structured entity vector is (-0.0369-0.14940.07320.07740.05180.0518 ….), wherein ellipses in (-0.0369-0.14940.07320.07740.05180.0518 ….) represent 94 dimensions not shown, -0.0369, -0.1494, 0.0732, 0.0774, 0.0518, 0.0518 are values of 6 dimensions, respectively. (-0.0369-0.14940.07320.07740.05180.0518 ….) indicates the names of default and song.

In some embodiments, the execution device obtains a structured entity vector from the entity build table based on the entities in the text. For example, referring to FIG. 5, the enforcement device is based on entity (E)_Seven-piece E_{Lining (Chinese character of 'li')} E_{Incense stick}) And obtaining the structured entity vector of the Qilixiang from the entity construction table as (-0.7563-0.65320.21820.39140.36280.5528).

The entity construction table is used for storing the mapping relation between the entity and the structured entity vector. The entity building table is also called a knowledge entity mapping table and is used for mapping the entity into a structured entity vector to realize the representation work of the entity. Optionally, the entity building table is pre-stored in the execution device. Optionally, the executing device queries the entity building table by using the entity as an index to obtain a structured entity vector, so as to map the entity into a vector representation. Optionally, the entity construction table is set according to experience, for example, each word in the chinese word stock is input into a word embedding model in advance, each word is processed by the word embedding model, and a word vector of each word is output. According to experience, a user selects an entity from each word in a Chinese word library, selects word vectors representing the entity from all word vectors output by the word embedding model, takes the selected word vectors as structured entity vectors, and stores the structured entity vectors into an entity construction table. Wherein the word embedding model may be a neural network model.

Illustratively, referring to fig. 5, the entity build table may be as shown in table 1 below. The meaning of the entity construction table is that the default is an entity, and the default structured entity vector is (-0.0369-0.14940.07320.07740.05180.0518 … …); the senium is an entity, and the structured entity vector of the senium is (-0.0154-0.23850.19430.48920.75310.9021 … …); the bird in the forest is an entity, and the structural entity vector of the bird in the forest is (-0.1692-0.44940.79110.96510.72260.3128 … …); the Qilixiang is an entity, and the structured entity vector of the Qilixiang is (-0.7563-0.65320.21820.39140.36280.5528 … …). In fig. 5 and table 1, each structured entity vector is a 100-dimensional vector, ellipses in each structured entity vector in fig. 5 and table 1 indicate 94-dimensional values which are not shown, and the last row in table 1 indicates other entities which are included in the entity construction table and are not shown in table 1.

TABLE 1

Entity	Structured entity vector
		Silent glass	-0.0369 -0.1494 0.0732 0.0774 0.0518 0.0518……
Laojiu door	-0.0154 -0.2385 0.1943 0.4892 0.7531 0.902……
		Bird in forest	-0.1692 -0.4494 0.7911 0.9651 0.7226 0.3128……
Radix Et rhizoma Rhei	-0.7563 -0.6532 0.2182 0.3914 0.3628 0.5528……
		……	……………………………………………………………………

In some embodiments, the application is in a vehicular domain, and the entity build table includes entities associated with the vehicular domain. For example, the vehicle-mounted field includes a navigation service field, a music playing service field, a radio station service field, a communication service field, a short message transceiving service field, an instant messaging application service field, a schedule query service field, a news push service field, an intelligent question and answer service field, an air conditioner control service field, a vehicle control service field, and a maintenance service field, and the entity construction table includes entities related to the service fields. The navigation scene and the song listening scene in the vehicle-mounted field are numerous, and the entity construction table can comprise places and songs. In this way, the vehicle-mounted domain structured knowledge entity is facilitated to be constructed.

For example, if the text to be analyzed is 'default for playing Naying', the execution device performs entity extraction on the text to obtain the attribute of 'default' and 'default' as song names. The executing device queries the table 1 above according to the default to obtain the attribute that the structured entity vector is (-0.0369-0.14940.07320.07740.05180.0518 … …), (-0.0369-0.14940.07320.07740.05180.0518 … …) represents the default entity and the song name, and then determines that the intention is 'listen to song' according to the structured entity vector. For another example, if the text to be analyzed is "song of old nine", the execution device performs entity extraction on the text to obtain an attribute of the entity of "old nine", and the attribute of the old nine "is the name of the singer. The executing device queries the table 1 in the above table according to the "old nine gate" to obtain the attributes that the structured entity vector is (-0.0154-0.23850.19430.48920.75310.902 … …), (-0.0154-0.23850.19430.48920.75310.902 … …) represents the entity of "old nine gate" and the name of the singer. For another example, if the text to be analyzed is "help me find a bird in a forest", the execution device performs entity extraction on the text to obtain an entity of "bird in forest", and the attribute of "bird in forest" is song name. The executive equipment queries the table 1 according to the bird in forest, and obtains the attributes that the structured entity vector is (-0.1692-0.44940.79110.96510.72260.3128 … …), (-0.1692-0.44940.79110.96510.72260.3128 … …) represents the entity of the bird in forest and the name of the song.

In some embodiments, the entity build table includes at least one of entities with irregular names, entities with a number of characters of a name exceeding a threshold, entities with a word frequency of a name below a threshold. An irregular-name entity is, for example, a grammatically irregular song. An entity for which the number of characters of a name exceeds a threshold is, for example, a long character place name. Entities with a word frequency of a name below a threshold are for example low frequency character place names. The entities are easy to cause ambiguity or have multiple meanings due to names, the machine is difficult to understand correct semantics, vector representations of the entities are stored in an entity construction table in advance, the machine can obtain the accurate vector representations by table lookup, and entity features are blended in the semantic understanding process, so that the semantic understanding accuracy is improved.

For example, when the semantic understanding is performed on the phrase "flowers in the search world", the phrase "flowers in the world" is easily recognized as the song name, and the semantic intention of the phrase is erroneously determined to be "listening to a song". And constructing a structured entity vector for the flowers of the world in advance, expressing the attributes of the flowers of the world and the place names by using one vector, storing the vector in an entity construction table, if a user says ' search flowers of the world ', using the ' search flowers of the world ' as a text to be identified by an execution device, extracting the entities as the flowers of the world ', and inquiring the entity construction table to obtain a vector representation (namely the structured entity vector) constructed for the flowers of the world in advance. Since the vector indicates that the attribute of the world flower is the place name instead of the song name, after semantic analysis is carried out by the execution equipment according to the vector, the semantic intention of the sentence can be judged to be 'navigation' instead of 'listening to songs', thereby improving the accuracy of semantic intention identification.

In summary of the above steps a and b, for example, the executing device uses the following formula (13) to realize the extraction of the structured entity vector. Wherein, obtaining the entity in the text to be analyzed is Extract in the following formula (13), and obtaining the structured entity vector is F in the following formula (13).

E1＝{e1，…ej}＝F(Extract{x1，…xn}) (13)

Where x1 … … xn represents the text to be analyzed, x1 represents the 1 st word in the text, xn represents the nth word in the text, … … represents the words contained in the text but not shown, Extract represents the entity extraction operation, F represents the mapping function used to construct the entities, E1 represents the structured entity vector, E1, … ej represents the vector representation of each entity extracted.

By the method, the entities in the input text are extracted, the structured entity vectors are constructed, so that the entities are subjected to vectorization representation, and the vectors of the entities can represent the attributes of the entities and the entities, so that the vectorization representation effect of the entities is good, the effective embedding of the entities is realized, and the vehicle-mounted semantic intention understanding capability and the semantic slot position extracting capability of the pre-training model can be enhanced when the subsequent pre-training model is further identified according to the structured entity vectors.

It should be understood that the timing sequence of S402 and S403 is not limited in this embodiment. In some embodiments, S402 and S403 may be performed sequentially. For example, S402 may be performed first, and then S403 may be performed; s403 may be executed first, and then S402 may be executed. In other embodiments, S402 and S403 may also be executed in parallel, that is, S402 and S403 may be executed simultaneously.

S405, the execution equipment performs feature extraction on the structured entity vector to obtain entity features.

Optionally, the executing device performs attention operation on the structured entity vector to obtain entity features, so that the entity features can capture structures and dependencies inside the structured entity vector. Illustratively, the executing device performs the following steps (1) to (4) by using a multi-head attention model to perform feature extraction on the structured entity vector.

And (1) inputting the structured entity vector into a second multi-head attention model by the executing equipment.

For example, the second multi-head attention model includes m layers of transform units, each for performing the multi-head attention mechanism, each transform unit including h self-attention modules. For example, referring to FIG. 5, the second multi-headed attention model includes attention module 0, attention module 1, attention module 2, attention module 3, attention module 4, attention module 5, attention module 6, and attention module 7.

For example, referring to fig. 5, the structured entity vector (-0.7563-0.65320.21820.39140.36280.5528 … …) of qilixiang can be used as the input matrix X of the second multi-head attention model, and the input matrix X can be input into attention module 0, attention module 1, attention module 2, attention module 3, attention module 4, attention module 5, attention module 6, and attention module 7, respectively.

And (2) the execution equipment respectively carries out attention operation on the structured entity vector through each attention module in the second multi-head attention model to obtain an output result of each attention module.

For example, referring to fig. 5, the attention module 0, the attention module 1, the attention module 2, the attention module 3, the attention module 4, the attention module 5, the attention module 6, and the attention module 7 may perform attention calculation on the input matrix X to obtain an output result of the attention module 0, an output result of the attention module 1, an output result of the attention module 2, an output result of the attention module 3, an output result of the attention module 4, an output result of the attention module 5, an output result of the attention module 6, and an output result of the attention module 7.

Each attention module may perform attention calculation using the following equations (14) to (17), and the output result of the attention module may be represented by equation (18).

Q＝W^QX₂ (14)

K＝W^KX₂ (15)

V＝W^VX₂ (16)

head(i)＝Attention(Q，K，V) (18)

Wherein, X₂The structured entity vector representing the input, W in equation (14)^QA query weight matrix representing one attention module in the second multi-headed attention model, and Q represents a query matrix of one attention module in the second multi-headed attention model. W in the formula (15)^KFor a key weight matrix of an attention module in the second multi-headed attention model, K represents a key matrix of an attention module in the second multi-headed attention model. W in formula (16)^VV represents a matrix of values of an attention module in the second multi-headed attention model. head (i) denotes the output matrix of the current self-attention mechanism, and the number of columns of head (i) is equal to the number of columns of the Value vector. dk is the corresponding hidden neural unit dimension. Attention is expressed by AttentionAnd (6) operation. softmax denotes operation by softmax function.

And (3) splicing the output results of each attention module by the execution equipment to obtain a splicing result.

For example, the multi-head attention model has 12 attention modules, the output result of each of the 12 attention modules is a matrix with 10 rows and 64 columns, then the splicing result is a matrix with 10 rows and 768 columns, where the 1 st column to 12 th column in the splicing result are the output results of the 1 st attention module, the 13 th column to 24 th column in the splicing result are the output results of the 2 nd attention module, the 25 th column to 36 th column in the splicing result are the output results of the 3 rd attention module, and so on, and the 705 th column to 768 th column in the splicing result are the output results of the 12 th attention module. For example, referring to equations (19) and (20) below, the output of each attention module is the head in equation (20)_iThe output of the h attention modules is the head in equation (19)₁，......head_hWhere head1 is the output of attention module 1, head h is the output of attention module h, ellipses represent the output of other attention modules not shown, and the concatenation may be an operation performed by the Concat function in equation (19).

MultiHead(Q，K，V)＝Concat(head₁，......head_h)W^O (19)

headi＝Attention(Q_iW_i ^Q，K_iW_i ^K，V_iW_i ^V) (20)

Wherein Concat in equation (19) represents a stitching operation, h represents the number of attention modules, h is a positive integer greater than 1, WO represents a weight matrix, WO is obtained by joint training in a second multi-headed attention model, Multihead is the output of the second multi-headed attention model, Q_iQ matrix, K, representing the correspondence of attention module headi_iK matrix, V, representing the correspondence of attention module headi_iThe V matrix corresponding to the attention module headi is shown.

And (4) performing linear transformation on the splicing result by the execution equipment to obtain entity characteristics.

Optionally, the linear transformation is multiplied by a weight matrix, that is, the step (4) may specifically be: and the execution equipment multiplies the splicing result by the weight matrix, and takes the product as an entity characteristic. For example, referring to the above equation (20), the weight matrix used for the linear transformation is WO, and the step (4) may specifically be: for Concat (head)₁，......head_h) And W^OMultiplying, and obtaining a product of MultiHead (Q, K, V), wherein the MultiHead (Q, K, V) is the entity feature. Alternatively, the linear transformation may also adopt other manners besides the multiplication with the weight matrix, for example, the splicing result is multiplied by a certain constant to perform linear transformation on the splicing result, or the splicing result is added by a certain constant to perform linear transformation on the splicing result, which is not limited in the embodiment.

Exemplarily, the step (3) and the step (4) may be represented by the above formula (19), the formula (20), and the following formula (21).

E2＝MultiHead(Q，K，V) (21)

E2 denotes entity features extracted from the structured entity vector of the text. Optionally, the data form of E2 is a matrix, each line of E2 is a structured entity vector corresponding to an entity in the text, and the number of dimensions of E2 is equal to the number of dimensions of a structured entity vector. For example, if N entities are collectively contained in the text to be analyzed, E2 is a matrix of N rows, line 1 of E2 is a structured entity vector corresponding to the 1 st entity in the text, line 2 of E2 is a structured entity vector corresponding to the 2 nd entity in the text, and if one structured entity vector is a 100-dimensional vector, the number of dimensions of E2 is equal to 100. N is a positive integer.

Through the mode, the multi-head attention mechanism is utilized, the correlation between words in the structured entity vector can be captured, long-distance features can be captured, extracted entity features can accurately express semantics, and therefore the entity features are more accurate.

S406, the execution equipment fuses the entity features of the text, the lexical features of the text and the syntactic features of the text to obtain semantic features of the text.

The execution device extracts lexical features, syntactic features and entity features from the text, so that preliminary semantic intention understanding of the text information is realized. And then, the executive equipment fuses the lexical characteristics, the syntactic characteristics and the entity characteristics, so that the three characteristics are combined, the semantic characteristics obtained through fusion comprise the entity characteristics, the lexical characteristics and the syntactic characteristics and contain rich semantic related information, therefore, the semantic characteristics can be used for obtaining semantic information of the text, and the vehicle-mounted semantic intention understanding capability and the semantic slot position extraction capability of the pre-training model can be further enhanced by using the fused semantic characteristics.

For example, referring to fig. 6, the result of obtaining the output of the semantic understanding model is that (w1 w2 w3 w4 w5 w6 w7 w8 w9), (w1 w2 w3 w4 w5 w6 w7 w8 w9) contains the lexical features of the text and the syntactic features of the text, and (w1 w2 w3 w4 w5 w6 w7 w8 w9) is the fusion of the lexical features and the syntactic features of the text, and the lexical features and the syntactic features are fused in the internal calculation process of the semantic understanding model. Further, the entity obtained by 504 is characterized by (e5 e6 e 7). Then, (w1 w2 w3 w4 w5 w6 w7 w8 w9) and (e5 e6 e7) can be fused, and the fused result is taken as a semantic feature, wherein e5 is an entity feature of one structured entity vector, e5 is a vector, e6 is an entity feature of another structured entity vector, e6 is a vector, e7 is an entity feature of another structured entity vector, and e7 is also a vector. Since (w1 w2 w3 w4 w5 w6 w7 w8 w9) already contains lexical and syntactic characteristics, after the lexical and syntactic characteristics are fused with the entity characteristics, the semantic characteristics contain the lexical, syntactic and entity characteristics.

For example, the execution device may perform feature fusion through the following steps one to two.

Step one, the execution equipment carries out weighted summation on the entity characteristics of the text, the lexical characteristics of the text and the syntactic characteristics of the text to obtain fusion characteristics.

Since the lexical characteristics, the syntactic characteristics and the entity characteristics are characteristics in different vector spaces, or the lexical characteristics, the syntactic characteristics and the entity characteristics are heterogeneous information, the three characteristics can be fused together by performing weighted summation on the entity characteristics, the lexical characteristics and the syntactic characteristics, so that heterogeneous information fusion is realized.

And step two, the execution equipment carries out nonlinear transformation on the fusion features through the activation function to obtain the semantic features.

Wherein, the activation function may adopt a GELU function. For example, the execution device may operate using the following formula (22) and formula (23), and formula (22) and formula (23) may be provided as the heterogeneous information fusion policy.

h＝GELU(W_t*w_i+W_e*e_i+b) (22)

EGLU(X)＝xP(X＜x＝x)＝xφ(x)，φ(x)～(0，1) (23)

Wherein GELU denotes an activation function, W_tRepresenting a weight matrix, W_eRepresenting the weight matrix, b representing the bias parameters, wi representing the output of the semantic understanding model 200, wi may be in the form of a text sequence. For example, V derived by LayerNorm in the above equation (12) may be in the form of a matrix, and wi in the equation (22) is one row of V in the above equation (12). ei denotes the output result of the entity building block, the form of eiThe equation may be a knowledge sequence, i.e., a structured entity vector, ei may be a row in the matrix E2 derived from equation (21), and phi (x) represents a probability distribution function conforming to a (0, 1) normal distribution.

S407, the executive equipment decodes the semantic features to obtain the semantic information of the text.

S407 is an optional step, and this embodiment does not limit whether to execute S407.

For example, the semantic information includes at least one of a semantic intent and a semantic slot. The execution device may calculate a probability distribution of the semantic intent to obtain the current semantic intent and the semantic slot. For example, the semantic understanding encoder of the implementation apparatus may process the text signal sequence X ═ X1X 2 … xn, generate a new sequence Z ═ Z1Z 2 … zn, where n represents the length of the text input, and then the semantic understanding decoder continues to process the text signal sequence Z, resulting in a final output sequence Y ═ Y1Y 2 … yn + 1. Where y1 represents the semantic intent and y2 … yn +1 represents the semantic slot information of the text signal. For example, the execution apparatus performs calculation using the following formula (24) and formula (25).

y1＝F(Wh1*hi+b1) (24)

yi＝F(Wh2*hi+b2) (25)

In formula (24), y1 represents semantic intention, Wh1 represents weight matrix, b1 represents bias parameter, and F represents function for decoding. In formula (25), yi represents a semantic slot, Wh2 represents a weight matrix, and b2 represents a bias parameter.

Optionally, after the execution device understands the semantic information, corresponding operations are executed according to the semantic information. For example, the method is applied to the field of vehicle-mounted devices, the execution device is a vehicle-mounted terminal, and the vehicle-mounted terminal controls a vehicle-mounted execution system to operate according to semantic information so as to perform vehicle-mounted voice interaction. After that, the execution device may wait, and if a new voice signal arrives, the execution device re-executes the above process to understand the semantics of the new voice signal.

In the method provided by this embodiment, a structured entity vector is constructed, the identifier of the entity and the attribute of the entity are represented in a vector form, entity features are extracted from the structured entity vector, the entity features are fused with lexical features and syntactic features, semantic features including the entity features, the lexical features and the syntactic features are obtained, and semantic information is obtained after the semantic features are decoded.

The following example illustrates example two with example three. In the method shown in the third embodiment, the execution device is a vehicle-mounted terminal, and the text to be recognized is obtained by recognizing the voice collected by the vehicle-mounted terminal. In other words, the third embodiment relates to how the vehicle-mounted terminal performs voice interaction with the user by using the second embodiment. It should be understood that the steps in the third embodiment that are similar to those in the second embodiment are also referred to in the second embodiment, and are not described in detail in the third embodiment.

EXAMPLE III

Fig. 7 is a third embodiment of vehicle-mounted speech interaction based on a semantic understanding model and a structured entity vector provided in the third embodiment of the present application, where the third embodiment may be specifically executed by a vehicle-mounted terminal, and the third embodiment includes S701 to S704.

S701, voice input by a user is collected by audio equipment of the vehicle-mounted terminal, the voice is a control command signal, and the audio equipment is a distributed microphone array.

And S702, converting the voice signal into a text signal by the voice recognition module of the vehicle-mounted terminal, and inputting the text signal into the semantic understanding module of the vehicle-mounted terminal.

S703, referring to fig. 8, the steps corresponding to the semantic understanding module include S7031 to S7039.

S7031, the vehicle-mounted terminal performs attention operation on the text signal through the plurality of attention modules based on a multi-head attention mechanism to obtain an output result of each attention module, and a first output result is obtained after splicing and linear transformation.

S7032, the vehicle-mounted terminal performs a vector normalization operation on the first output result, so that the first output result is normalized to a second output result.

S7033, the vehicle-mounted terminal conducts forward transmission operation on the second output result, and the second output result is converted into a third output result after linear transformation and nonlinear transformation.

S7034, the vehicle-mounted terminal performs vector standardization operation on the third output result, and the third output result is normalized into syntactic characteristics and lexical characteristics.

Through S7031 to S7034, the extraction of lexical features, syntactic features and semantic features of text input is realized, and preliminary semantic intention understanding of a user command is realized.

S7035, extracting the entity in the text input by a knowledge entity extraction module of the vehicle-mounted terminal to obtain an effective entity.

S7036, the knowledge entity construction module of the vehicle-mounted terminal carries out vectorization representation on the entity to obtain the representation of the attribute of the entity.

S7037, the vehicle-mounted terminal performs attention operation on the attribute characterization of the entity through a plurality of attention modules based on a multi-head attention mechanism to obtain an output result of each attention module, and entity features are obtained after splicing and linear transformation.

S7038, the heterogeneous information fusion module of the vehicle-mounted terminal enables syntactic characteristics, lexical characteristics and entity characteristics of text input to be effectively fused in different vector spaces.

S7039, the vehicle-mounted terminal calculates semantic intention probability distribution through a semantic decoder to obtain the current semantic intention and the semantic slot position of the user.

And S704, the vehicle-mounted function module receives the control command signal and executes operation according to the control command signal.

The method provided by the embodiment provides a vehicle-mounted voice interaction method based on a semantic understanding model and a structured entity vector in the vehicle-mounted field, and the method utilizes the semantic understanding model which is subjected to pre-training and model fine tuning, extracts entity features based on the structured entity vector, and fuses the entity features, lexical features and syntactic features, so that the problems of insufficient semantic intention understanding capability and incapability of completely identifying basic structured knowledge entities can be solved in the scene of vehicle-mounted voice interaction, and the semantic intention understanding capability and the semantic slot position information extraction capability of the vehicle-mounted field are further enhanced.

It can be understood that, the first embodiment is a training stage of the semantic understanding model (e.g., a stage performed by the training device 12 shown in fig. 1), and the specific training is performed by using a pre-training model provided in any one of possible implementations based on the first embodiment and the first embodiment; the second embodiment can be understood as an application stage of the semantic understanding model (e.g., a stage executed by the execution device 11 shown in fig. 1), which can be embodied as using the semantic understanding model obtained by the training of the first embodiment and obtaining output semantic information according to the speech or text input by the user, while the third embodiment is an embodiment included in the second embodiment.

The semantic analysis method according to the embodiment of the present application is described above, and the semantic analysis device according to the embodiment of the present application is described below, it being understood that the semantic analysis device has any function of the execution device in the method described above.

Fig. 9 is a schematic structural diagram of a semantic analysis apparatus according to an embodiment of the present application, and as shown in fig. 9, the semantic analysis apparatus 900 includes: an obtaining module 901 configured to execute S403 to S404; an extraction module 902 configured to perform S405; a fusion module 903, configured to execute S406.

Optionally, the fusion module 903 includes: a weighted summation sub-module, configured to perform step one in S406; and a transformation submodule for executing the step two in S406.

Optionally, the extracting module 902 includes: an attention submodule for performing step 2.1 in S402; a normalization submodule, configured to perform step 2.2 in S402; a transformation submodule for performing step 2.3 in S402; the normalization submodule is further configured to perform step 2.4 in S402.

Optionally, the attention submodule is configured to perform step 2.1.1 to step 2.1.4 in S402.

Optionally, the extracting module 902 includes: an input sub-module for performing step (1) in S405; an attention submodule for performing step (2) in S405; a splicing submodule for executing the step (3) in S405; and a transformation submodule for performing step (4) in S405.

It should be understood that the semantic analysis apparatus 900 provided in the embodiment of fig. 9 corresponds to the execution device in the foregoing method embodiment, and each module and the other operations and/or functions in the semantic analysis apparatus 900 are respectively for implementing various steps and methods implemented by the execution device in the method embodiment, and specific details may be referred to in the foregoing method embodiment and are not described herein again for brevity.

It should be understood that, when analyzing semantics, the semantic analysis apparatus provided in the embodiment of fig. 9 is only illustrated by the above-mentioned division of the functional modules, and in practical applications, the above-mentioned function allocation may be completed by different functional modules according to needs, that is, the internal structure of the semantic analysis apparatus is divided into different functional modules to complete all or part of the above-described functions. In addition, the semantic analysis device provided in the above embodiment and the second embodiment belong to the same concept, and the specific implementation process thereof is described in the method embodiment, which is not described herein again.

Fig. 10 is a schematic structural diagram of a training apparatus for a semantic understanding model according to an embodiment of the present application, and as shown in fig. 10, the training apparatus 1000 for a semantic understanding model includes: an obtaining module 1001 configured to execute S301; a training module 1002 for performing S302; the obtaining module 1001 is further configured to perform S303, and the training module 1002 is further configured to perform S304.

It should be understood that the training apparatus 1000 of the semantic understanding model provided in the embodiment of fig. 10 corresponds to the training device in the foregoing method embodiment, and each module and the above-mentioned other operations and/or functions in the training apparatus 1000 of the semantic understanding model are respectively for implementing various steps and methods implemented by the training device in the method embodiment, and specific details may be referred to the foregoing method embodiment and are not described herein again for brevity.

It should be understood that, when the training device of the semantic understanding model provided in the embodiment of fig. 10 trains the semantic understanding model, only the division of the above functional modules is used for illustration, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the training device of the semantic understanding model is divided into different functional modules to complete all or part of the above described functions. In addition, the training device of the semantic understanding model provided in the above embodiment belongs to the same concept as that of the above embodiment, and specific implementation processes thereof are described in the method embodiment and are not described herein again.

Fig. 11 is a hardware configuration diagram of a semantic analysis apparatus according to an embodiment of the present application. The semantic analysis apparatus 1100 shown in fig. 11 (the apparatus 1100 may be a computer device) includes a memory 1101, a processor 1102, a communication interface 1103, and a bus 1104. The memory 1101, the processor 1102 and the communication interface 1103 are communicatively connected to each other through a bus 1104.

The Memory 1101 may be a Read Only Memory (ROM), a static Memory device, a dynamic Memory device, or a Random Access Memory (RAM). The memory 1101 may store a program, and when the program stored in the memory 1101 is executed by the processor 1102, the processor 1102 and the communication interface 1103 are used for executing the steps of the semantic analysis method according to the embodiment of the present application.

The processor 1102 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more Integrated circuits, and is configured to execute related programs to implement functions required to be executed by units in the semantic analysis apparatus according to the embodiment of the present disclosure, or to execute the semantic analysis method according to the embodiment of the present disclosure.

The processor 1102 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the semantic analysis method of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1102. The processor 1102 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1101, and the processor 1102 reads the information in the memory 1101, and completes the functions required to be executed by the units included in the semantic analysis device according to the embodiment of the present application or executes the semantic analysis method according to the embodiment of the method of the present application by combining the hardware thereof.

The communication interface 1103 enables communication between the apparatus 1100 and other devices or communication networks using transceiver means, such as, but not limited to, a transceiver. For example, text (text to be analyzed in the second embodiment of the present application) can be acquired through the communication interface 1103.

Bus 1104 may include a path that conveys information between various components of apparatus 1100 (e.g., memory 1101, processor 1102, communication interface 1103).

It is to be understood that the extraction module 902, the fusion module 903, and the decoding module 903 in the semantic analysis device 900 may correspond to the processor 1102.

Fig. 12 is a hardware configuration diagram of a training apparatus for a semantic understanding model according to an embodiment of the present application. The training apparatus 1200 of the semantic understanding model shown in fig. 12 (the apparatus 1200 may be specifically a computer device) includes a memory 1201, a processor 1202, a communication interface 1203, and a bus 1204. The memory 1201, the processor 1202, and the communication interface 1203 are communicatively connected to each other through a bus 1204.

The Memory 1201 may be a Read Only Memory (ROM), a static Memory device, a dynamic Memory device, or a Random Access Memory (RAM). The memory 1201 may store programs that, when executed by the processor 1202, stored in the memory 1201, the processor 1202 and the communication interface 1203 are used to perform the steps of the training method of semantic understanding models of the embodiments of the present application.

The processor 1202 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more Integrated circuits, and is configured to execute related programs to implement functions required to be executed by units in the training apparatus of the semantic understanding model according to the embodiment of the present Application, or to execute the training method of the semantic understanding model according to the embodiment of the present Application.

The processor 1202 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the training method of the semantic understanding model of the present application may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 1202. The processor 1202 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1201, and the processor 1202 reads information in the memory 1201, and completes, in combination with hardware thereof, functions required to be executed by units included in the training apparatus for a semantic understanding model according to the embodiment of the present application, or executes a training method for a semantic understanding model according to the embodiment of the present application.

The communication interface 1203 enables communication between the apparatus 1200 and other devices or communication networks using transceiver means such as, but not limited to, a transceiver. For example, the training data (e.g., the text masked or the text labeled with semantic information such as semantic intent and semantic slot in the first embodiment of the present application) may be obtained through the communication interface 1203.

The bus 1204 may include pathways to transfer information between various components of the apparatus 1200, such as the memory 1201, the processor 1202, and the communication interface 1203.

It should be understood that the obtaining module 1001 in the training apparatus 1000 of the semantic understanding model corresponds to the communication interface 1203 in the training apparatus 1200 of the semantic understanding model, and the training module 1002 may correspond to the processor 1202.

It should be noted that although the apparatuses 1200 and 1100 shown in fig. 12 and 11 only show memories, processors, and communication interfaces, in a specific implementation, those skilled in the art will understand that the apparatuses 1200 and 1100 also include other devices necessary for normal operation. Also, those skilled in the art will appreciate that the apparatus 1200 and 1100 may also include hardware components to implement other additional functions, according to particular needs. Further, those skilled in the art will appreciate that the apparatus 1200 and 1100 may also include only those components necessary to implement the embodiments of the present application, and not necessarily all of the components shown in fig. 12 or 11.

It is understood that the apparatus 1200 corresponds to the training device 12 of fig. 1 and the apparatus 1100 corresponds to the performing device 11 of fig. 1. Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the unit is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

This functionality, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of semantic analysis, the method comprising:

acquiring an entity in a text to be analyzed;

acquiring a structured entity vector corresponding to the entity according to the entity in the text to be analyzed, wherein the structured entity vector is used for indicating the identity of the entity and the attribute of the entity;

carrying out feature extraction on the structured entity vector to obtain entity features;

and fusing the entity characteristics, the lexical characteristics of the text and the syntactic characteristics of the text to obtain the semantic characteristics of the text, wherein the semantic characteristics are used for acquiring semantic information of the text.

2. The method according to claim 1, wherein the obtaining a structured entity vector corresponding to the entity according to the entity in the text to be analyzed comprises:

and acquiring the structured entity vector from an entity construction table according to the entity in the text to be analyzed, wherein the entity construction table is used for storing the mapping relation between the entity and the structured entity vector.

3. The method according to claim 1, wherein said fusing the entity features, the lexical features of the text, and the syntactic features of the text to obtain the semantic features of the text comprises:

carrying out weighted summation on the entity characteristics, the lexical characteristics and the syntactic characteristics to obtain fusion characteristics;

and carrying out nonlinear transformation on the fusion features through an activation function to obtain the semantic features.

4. The method of claim 1, wherein prior to fusing the entity features, the lexical features of the text, and the syntactic features of the text, the method further comprises:

inputting the text into a semantic understanding model, wherein the semantic understanding model is obtained by performing migration training on a pre-training model according to a first sample, the first sample comprises a text marked with semantic information, the pre-training model is obtained by training according to a second sample, and the second sample comprises a masked text;

and extracting the lexical features and the syntactic features from the text through the semantic understanding model.

5. The method of claim 4, wherein said extracting said lexical features and said syntactic features from said text by said semantic understanding model comprises:

performing attention operation on the text to obtain a first output result, wherein the first output result is used for indicating the dependency relationship between words in the text;

normalizing the first output result to obtain a second output result;

performing linear transformation and nonlinear transformation on the second output result to obtain a third output result;

and normalizing the third output result to obtain the lexical characteristics and the syntactic characteristics.

6. The method of claim 5, wherein the semantic understanding model comprises a first multi-head attention model, and wherein performing the attention operation on the text to obtain a first output result comprises:

entering the text into the first multi-headed attention model;

performing attention operation on the text through each attention module in the first multi-head attention model to obtain an output result of each attention module;

splicing the output results of each attention module to obtain a splicing result;

and carrying out linear transformation on the splicing result to obtain the first output result.

7. The method of claim 1, wherein the performing feature extraction on the structured entity vector to obtain entity features comprises:

inputting the structured entity vector into a second multi-headed attention model;

performing attention operation on the structured entity vector through each attention module in the second multi-head attention model to obtain an output result of each attention module;

and carrying out linear transformation on the splicing result to obtain the entity characteristics.

8. A semantic analysis apparatus, characterized in that the apparatus comprises:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an entity in a text to be analyzed and acquiring a structured entity vector corresponding to the entity according to the entity in the text to be analyzed, and the structured entity vector is used for indicating the identity of the entity and the attribute of the entity;

the extraction module is used for extracting the characteristics of the structured entity to obtain the entity characteristics;

and the fusion module is used for fusing the entity characteristics, the lexical characteristics of the text and the syntactic characteristics of the text to obtain the semantic characteristics of the text, wherein the semantic characteristics are used for acquiring semantic information of the text.

9. The apparatus of claim 8, wherein the obtaining module is configured to obtain the structured entity vector from an entity construction table according to the entity in the text to be analyzed, and the entity construction table is configured to store a mapping relationship between an entity and a structured entity vector.

10. The apparatus of claim 8, wherein the fusion module comprises:

the weighted summation submodule is used for carrying out weighted summation on the entity characteristics, the lexical characteristics and the syntactic characteristics to obtain fusion characteristics;

and the transformation submodule is used for carrying out nonlinear transformation on the fusion characteristic through an activation function to obtain the semantic characteristic.

11. The apparatus of claim 8, further comprising:

the input module is used for inputting the text into a semantic understanding model, the semantic understanding model is obtained by performing migration training on a pre-training model according to a first sample, the first sample comprises a text marked with semantic information, the pre-training model is obtained by training according to a second sample, and the second sample comprises a masked text;

the extraction module is further configured to extract the lexical features and the syntactic features from the text through the semantic understanding model.

12. The apparatus of claim 11, wherein the extraction module comprises:

the attention submodule is used for carrying out attention operation on the text to obtain a first output result, and the first output result is used for indicating the dependency relationship between words in the text;

the normalization submodule is used for normalizing the first output result to obtain a second output result;

the transformation submodule is used for carrying out linear transformation and nonlinear transformation on the second output result to obtain a third output result;

and the normalization submodule is further used for normalizing the third output result to obtain the lexical characteristics and the syntactic characteristics.

13. The apparatus of claim 12, wherein the semantic understanding model comprises a first multi-headed attention model, the attention sub-module to enter the text into the first multi-headed attention model; performing attention operation on the text through each attention module in the first multi-head attention model to obtain an output result of each attention module; splicing the output results of each attention module to obtain a splicing result; and carrying out linear transformation on the splicing result to obtain the first output result.

14. The apparatus of claim 8, wherein the extraction module comprises:

an input sub-module for inputting the structured entity vector into a second multi-headed attention model;

the attention submodule is used for respectively carrying out attention operation on the structured entity vector through each attention module in the second multi-head attention model to obtain an output result of each attention module;

the splicing submodule is used for splicing the output result of each attention module to obtain a splicing result;

and the transformation submodule is used for carrying out linear transformation on the splicing result to obtain the entity characteristics.

15. An execution device, wherein the execution device comprises a processor configured to execute instructions that cause the execution device to perform the method of any one of claims 1 to 7.

16. A computer-readable storage medium having stored therein at least one instruction that is read by a processor to cause an execution device to perform the method of any one of claims 1-7.