CN115062620A

CN115062620A - Semantic understanding method and device

Info

Publication number: CN115062620A
Application number: CN202110221290.2A
Authority: CN
Inventors: 李宏广; 聂为然; 李宏言; 高信龙一; 黄明烈
Original assignee: Tsinghua University; Huawei Technologies Co Ltd
Current assignee: Tsinghua University; Huawei Technologies Co Ltd
Priority date: 2021-02-27
Filing date: 2021-02-27
Publication date: 2022-09-16
Also published as: WO2022179206A1

Abstract

The embodiment of the application provides a semantic understanding method and a semantic understanding device in the field of artificial intelligence. The method comprises the following steps: and acquiring text data obtained based on the recorded voice conversion. The method comprises the steps of obtaining first semantic feature data corresponding to text data through an encoder, and determining semantic intents corresponding to the first semantic feature data through a semantic intent decoder to achieve intent recognition of the text data. And generating fusion data according to the text data and the semantic intention, and acquiring second semantic feature data corresponding to the fusion data based on an encoder. And determining an entity corresponding to the second semantic feature data based on the semantic entity decoder to realize entity identification associated with the semantic intention in the text data. By adopting the method provided by the application, the semantic understanding accuracy can be improved.

Description

Semantic understanding method and device

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a semantic understanding method and apparatus.

Background

With the development of artificial intelligence and car networking technologies, a safer and more intelligent travel era is coming. Currently, a traditional touch screen type interaction mode (that is, a user clicks each function button displayed in a display screen of an on-vehicle terminal to realize control of functions such as navigation, air conditioning, music and the like in a vehicle) no longer meets user requirements any more, and how to maximize improvement of interaction experience of the user in a vehicle cabin becomes one of the most concerned hotspots of people. Natural Language Understanding (NLU) is a key technology in the field of artificial intelligence, and is important in speech interaction experience of future intelligent automobiles.

When NLU processing is performed, firstly, a word vector corresponding to each character in a plurality of characters forming a sentence to be understood needs to be obtained, and then an input matrix formed by the word vectors corresponding to the characters is respectively input into an intention classification model and an entity recognition model for calculation, so as to respectively obtain an intention classification result and an entity recognition result of the sentence to be understood. However, the semantic understanding method in the related art is not high in semantic understanding accuracy.

Disclosure of Invention

The application provides a semantic understanding method and a semantic understanding device, which can improve the accuracy of semantic understanding and have strong applicability.

It should be understood that the methods provided by the embodiments of the present application may be performed by a semantic understanding apparatus. The semantic understanding apparatus may be a terminal device, or may be a part of devices in the terminal device, for example, a chip applied to the terminal device, or may be a server, for example, a local server or a cloud server, or may be a part of devices in the server, for example, a chip applied to the server, and the like, which is not limited herein.

In a first aspect, the present application provides a semantic understanding method, including: and acquiring text data obtained based on the recorded voice conversion. The recorded voice can be acquired based on a microphone, and the recorded voice is converted into corresponding text data through a voice recognition system. Then, first semantic feature data corresponding to the text data is obtained through an encoder, and a semantic intention corresponding to the first semantic feature data is determined through a semantic intention decoder, so that intention identification of the text data is achieved. And generating fusion data according to the text data and the semantic intention, and acquiring second semantic feature data corresponding to the fusion data based on the encoder. And determining an entity corresponding to the second semantic feature data based on a semantic entity decoder so as to realize entity identification associated with the semantic intention in the text data.

In the application, text data is firstly input into an encoder to obtain first semantic feature data, and then the first semantic feature data passes through a semantic intention decoder to obtain semantic intention output. Furthermore, the semantic intention and the original text data are fused, the fusion result can be input into the encoder again to obtain second semantic feature data, and finally the second semantic feature data can be obtained through a semantic entity decoder to obtain an output entity. The semantic intention corresponding to the text data is obtained by the encoder and the semantic intention decoder, the semantic intention is used as prior information to be fused with the text data, the fused data obtained by fusion is encoded through the same encoder, and then the entity is determined based on the semantic entity decoder, so that the conflict between semantic intention information and entity information can be reduced, and the semantic understanding precision in voice interaction can be improved.

With reference to the first aspect, in a first possible implementation manner, before the obtaining, by the encoder, the first semantic feature data corresponding to the text data, the method further includes: and determining a first word vector matrix corresponding to the text data. The encoding the text data by the encoder to obtain the first semantic feature data corresponding to the text data includes: and inputting the first word vector matrix into an encoder, and acquiring the semantic feature vector output by the encoder as first semantic feature data corresponding to the text data.

In the application, the first word vector matrix corresponding to the text data is determined, and the first word vector matrix is input into the encoder for encoding, so that the semantic feature vector output by the encoder is used as the first semantic feature data corresponding to the text data, the operation is easy, and the applicability is high.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner, the determining a first word vector matrix corresponding to the text data includes: and performing character splitting processing on the text data to obtain a plurality of characters included in the text data. And acquiring a character word vector, a position word vector and a character type word vector corresponding to each character in the plurality of characters, wherein the position word vector is used for indicating the position of the character in the text data. And summing the character word vector, the position word vector and the character type word vector corresponding to each character to obtain the word vector corresponding to each character. And generating a first word vector matrix corresponding to the text data according to a plurality of word vectors corresponding to the characters.

In the application, the character word vector corresponding to each character, the vector sum of the position word vector and the character type word vector are determined as the word vector corresponding to each character, and then a first word vector matrix is generated for processing, so that the position information of each character in the text, namely the sequence information of the text data and the character type information of each character are fully considered, the coding effect is better, and the accuracy of semantic understanding is improved.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner, the obtaining a character word vector, a position word vector, and a character type word vector corresponding to each character of the plurality of characters includes: and acquiring a character word vector query table, a position word vector query table and a character type word vector query table. The character word vector lookup table includes n character word vectors corresponding to n characters. The position word vector lookup table comprises m position word vectors corresponding to m positions. The character type word vector lookup table comprises character type word vectors corresponding to non-filled characters and character type word vectors corresponding to filled characters. n and m are both integers greater than 0. And acquiring a character word vector corresponding to each character in a plurality of characters from the character word vector lookup table, acquiring a position word vector corresponding to the position of each character in the text data from the position word vector lookup table, and acquiring a character type word vector corresponding to the non-filled character from the character type word vector lookup table as the character type word vector corresponding to each character.

In the application, the character word vector, the position word vector and the character type word vector corresponding to each character are obtained by inquiring the preset character word vector query table, the position word vector query table and the character type word vector query table, so that the method is simple and convenient to operate, the data processing efficiency is improved, and the applicability is high.

With reference to any one of the first possible implementation manner of the first aspect to any one of the third possible implementation manner of the first aspect, in a fourth possible implementation manner, the encoder includes j layers of encoding units, an input of a first layer of encoding unit in the j layers of encoding units is the first word vector input matrix, an input of any layer of encoding unit after the first layer of encoding unit is an output of a previous layer of encoding unit in the any layer of encoding unit, and an output of a j layer of encoding unit in the j layers of encoding units is the first semantic feature data, where j is an integer greater than 0.

With reference to any one of the first aspect to the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner, the generating fused data according to the text data and the semantic intention includes: and splicing the text data and the semantic intention to obtain fusion data corresponding to the text data and the semantic intention.

In the application, the determined semantic intention is used as prior information to be spliced with the text data to obtain fusion data, and then the fusion data is processed, so that the recognition precision of the entity associated with the semantic intention is improved, and the accuracy of semantic understanding is improved.

With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner, before the obtaining, based on the encoder, the second semantic feature data corresponding to the fused data, the method further includes: and acquiring a word vector corresponding to each character in a plurality of characters forming the fused data. And generating a second word vector matrix according to a plurality of word vectors corresponding to a plurality of characters forming the fused data. The obtaining of the second semantic feature data corresponding to the fusion data based on the encoder includes: and inputting the second word vector matrix into the encoder, and acquiring the semantic feature vector output by the encoder as second semantic feature data corresponding to the fusion data.

With reference to any one of the first aspect to the sixth possible implementation manner of the first aspect, in a seventh possible implementation manner, the text data is converted based on an input voice, and the input voice is a vehicle control voice. The method further comprises the following steps: and controlling the vehicle to execute the target function according to the semantic intention and the entity.

With reference to any one of the first to seventh possible implementation manners of the first aspect, in an eighth possible implementation manner, the encoder is trained according to a first training sample and a second training sample, the semantic intent decoder is trained according to the first training sample, the semantic entity decoder is trained according to the second training sample, the first training sample includes sample text data and an intent type corresponding to the pre-labeled sample text data, the second training sample includes sample fusion data and an entity corresponding to the pre-labeled sample fusion data, and the sample fusion data includes the sample text data and the intent type corresponding to the sample text data.

With reference to the eighth possible implementation manner of the first aspect, in a ninth possible implementation manner, the encoder, the semantic intent decoder, and the semantic entity decoder are trained through the following steps:

acquiring third semantic feature data corresponding to the sample text data through an initial encoder, and predicting semantic intents corresponding to the third semantic feature data through an initial semantic intent decoder;

adjusting the weight parameters of the initial encoder and the initial semantic intention decoder according to the predicted semantic intention and the intention type corresponding to the pre-labeled sample text data so as to train the initial encoder and the initial semantic intention decoder to obtain a first encoder and the semantic intention decoder;

acquiring fourth semantic feature data corresponding to the sample fusion data through the first encoder, and predicting an entity corresponding to the fourth semantic feature data based on an initial semantic entity decoder;

and adjusting the weight parameters of the first encoder and the initial semantic entity decoder according to the entity obtained by prediction and the entity corresponding to the pre-labeled sample fusion data so as to train the first encoder and the initial semantic entity decoder to obtain the encoder and the semantic entity decoder.

In the application, the encoder obtained through the decoupling pre-training mode training is beneficial to improving the encoding effect of the encoder, and further is beneficial to improving the semantic understanding precision in the subsequent use. Wherein, the decoupling pre-training mode can be understood as: training based on the sample text data and the intention category corresponding to the pre-labeled sample text data to obtain a first encoder and a semantic intention decoder. And training the first encoder and the initial semantic entity decoder according to the entity corresponding to the sample fusion data and the pre-labeled sample fusion data to obtain a final encoder and a final semantic entity decoder. In other words, if there are 10 ten thousand sample data, based on the training mode of the decoupling pre-training, the encoder will learn the characteristics of 20 ten thousand data (20 ten thousand data includes 10 ten thousand sample data and 10 ten thousand sample fusion data), so the learning effect of the encoder is better.

In a second aspect, the present application provides a semantic understanding method, including: the method comprises the steps of obtaining a first training sample and a second training sample, wherein the first training sample comprises sample text data and intention categories corresponding to the sample text data, the second training sample comprises sample fusion data and entities corresponding to the sample fusion data labeled in advance, and the sample fusion data comprises the sample text data and the intention categories corresponding to the sample text data labeled in advance. And acquiring third semantic feature data corresponding to the sample text data through an initial encoder, and predicting semantic intents corresponding to the third semantic feature data through an initial semantic intent decoder. And adjusting the weight parameters of the initial encoder and the initial semantic intention decoder according to the predicted semantic intention and the intention type corresponding to the pre-labeled sample text data so as to train the initial encoder and the initial semantic intention decoder to obtain a first encoder and a semantic intention decoder. And acquiring fourth semantic feature data corresponding to the sample fusion data through the first encoder, and predicting an entity corresponding to the fourth semantic feature data based on an initial semantic entity decoder. And adjusting the weight parameters of the first encoder and the initial semantic entity decoder according to the entity obtained by prediction and the entity corresponding to the pre-labeled sample fusion data so as to train the first encoder and the initial semantic entity decoder to obtain an encoder and a semantic entity decoder.

In the application, the encoder obtained through the decoupling pre-training mode training is beneficial to improving the encoding effect of the encoder, and further is beneficial to improving the semantic understanding precision in the subsequent use. Wherein, the decoupling pre-training mode can be understood as: training based on the sample text data and the intention category corresponding to the pre-labeled sample text data to obtain a first encoder and a semantic intention decoder. And training the first encoder and the initial semantic entity decoder according to the entity corresponding to the sample fusion data and the pre-labeled sample fusion data to obtain a final encoder and a final semantic entity decoder.

With reference to the second aspect, in a first possible implementation manner, the adjusting the weight parameters of the initial encoder and the initial semantic intention decoder according to the predicted semantic intention and an intention type corresponding to the pre-labeled sample text data includes: and determining a first loss according to the predicted semantic intention and the intention type corresponding to the pre-labeled sample text data. And adjusting the weight parameters of the initial encoder and the initial semantic intent decoder according to the first loss.

With reference to any one of the second aspect to the first possible implementation manner of the second aspect, in a second possible implementation manner, the training of the first encoder and the initial semantic entity decoder by adjusting the weight parameters of the first encoder and the initial semantic entity decoder according to the entity obtained by prediction and the entity corresponding to the pre-labeled sample fusion data includes: and determining a second loss according to the entity obtained by prediction and the entity corresponding to the pre-labeled sample fusion data. And adjusting the weight parameters of the first encoder and the initial semantic entity decoder according to the second loss.

With reference to any one of the second aspect to the second possible implementation manner of the second aspect, in a third possible implementation manner, the method includes: and acquiring text data obtained based on the recorded voice conversion. The recorded voice can be acquired based on a microphone, and the recorded voice is converted into corresponding text data through a voice recognition system. Then, first semantic feature data corresponding to the text data is obtained through an encoder, and a semantic intention corresponding to the first semantic feature data is determined through a semantic intention decoder, so that intention identification of the text data is achieved. And generating fusion data according to the text data and the semantic intention, and acquiring second semantic feature data corresponding to the fusion data based on the encoder. And determining an entity corresponding to the second semantic feature data based on a semantic entity decoder so as to realize entity identification associated with the semantic intention in the text data.

According to the method and the device, after the semantic intention corresponding to the text data is obtained according to the encoder and the semantic intention decoder, the semantic intention is fused with the text data as the prior information, the fused data obtained through fusion is encoded through the same encoder, then the entity is determined based on the semantic entity decoder, the conflict between the semantic intention information and the entity information can be reduced, and the semantic understanding precision in voice interaction can be improved.

In a third aspect, the present application provides a semantic understanding apparatus, comprising: and the transceiving unit is used for acquiring the text data. And the processing unit is used for acquiring first semantic feature data corresponding to the text data through an encoder and determining a semantic intention corresponding to the first semantic feature data through a semantic intention decoder so as to realize intention identification on the text data. The processing unit is further configured to generate fusion data according to the text data and the semantic intent, and obtain second semantic feature data corresponding to the fusion data based on the encoder. The processing unit is further configured to determine an entity corresponding to the second semantic feature data based on a semantic entity decoder, so as to implement entity identification associated with the semantic intent in the text data.

With reference to the third aspect, in a first possible implementation manner, the processing unit is further configured to: and determining a first word vector matrix corresponding to the text data. And inputting the first word vector matrix into an encoder, and acquiring the semantic feature vector output by the encoder as first semantic feature data corresponding to the text data.

With reference to the first possible implementation manner of the third aspect, in a second possible implementation manner, the processing unit is further configured to: and performing character splitting processing on the text data to obtain a plurality of characters included in the text data. And acquiring a character word vector, a position word vector and a character type word vector corresponding to each character in the plurality of characters, wherein the position word vector is used for indicating the position of the character in the text data. And summing the character word vector, the position word vector and the character type word vector corresponding to each character to obtain the word vector corresponding to each character. And generating a first word vector matrix corresponding to the text data according to a plurality of word vectors corresponding to the characters.

With reference to the second possible implementation manner of the third aspect, in a third possible implementation manner, the processing unit is further configured to: and acquiring a character word vector query table, a position word vector query table and a character type word vector query table. The character word vector lookup table comprises n character word vectors corresponding to n characters. The position word vector lookup table comprises m position word vectors corresponding to m positions. The character type word vector lookup table comprises character type word vectors corresponding to non-filled characters and character type word vectors corresponding to filled characters. n and m are both integers greater than 0. And acquiring a character word vector corresponding to each character in a plurality of characters from the character word vector lookup table, acquiring a position word vector corresponding to the position of each character in the text data from the position word vector lookup table, and acquiring a character type word vector corresponding to the non-filled character from the character type word vector lookup table as the character type word vector corresponding to each character.

With reference to any one of the first possible implementation manner of the third aspect to the third possible implementation manner of the third aspect, in a fourth possible implementation manner, the encoder includes j-layer encoding means, an input of a first layer encoding means in the j-layer encoding means is the first word vector input matrix, an input of any layer encoding means subsequent to the first layer encoding means is an output of a previous layer encoding means in the any layer encoding means, an output of a j-th layer encoding means in the j-layer encoding means is the first semantic feature data, and j is an integer greater than 0.

With reference to any one of the third to fourth possible implementation manners of the third aspect, in a fifth possible implementation manner, the processing unit is further configured to: and splicing the text data and the semantic intention to obtain fusion data corresponding to the text data and the semantic intention.

With reference to the fifth possible implementation manner of the third aspect, in a sixth possible implementation manner, the processing unit is further configured to: and acquiring a word vector corresponding to each character in a plurality of characters forming the fused data. And generating a second word vector matrix according to a plurality of word vectors corresponding to a plurality of characters forming the fused data. And inputting the second word vector matrix into the encoder, and acquiring the semantic feature vector output by the encoder as second semantic feature data corresponding to the fusion data.

With reference to any one of the third to sixth possible embodiments of the third aspect, in a seventh possible embodiment, the recorded voice is a vehicle control voice; the processing unit is further configured to: and controlling the vehicle to execute the target function according to the semantic intention and the entity.

With reference to any one of the third to the seventh possible implementation manners of the third aspect, in an eighth possible implementation manner, the encoder is trained according to a first training sample and a second training sample, the semantic intent decoder is trained according to the first training sample, the semantic entity decoder is trained according to the second training sample, the first training sample includes sample text data and an intent type corresponding to the pre-labeled sample text data, the second training sample includes sample fusion data and an entity corresponding to the pre-labeled sample fusion data, and the sample fusion data includes the sample text data and the intent type corresponding to the sample text data.

With reference to the eighth possible implementation manner of the third aspect, in a ninth possible implementation manner, the processing unit is further configured to train to obtain the encoder, the semantic intent decoder, and the semantic entity decoder by:

In a fourth aspect, the present application provides a semantic understanding apparatus, comprising: the system comprises a receiving and sending unit, a processing unit and a processing unit, wherein the receiving and sending unit is used for obtaining a first training sample and a second training sample, the first training sample comprises sample text data and an intention type corresponding to the sample text data, the second training sample comprises sample fusion data and an entity corresponding to the pre-labeled sample fusion data, and the sample fusion data comprises the sample text data and the intention type corresponding to the pre-labeled sample text data. And the processing unit is used for acquiring third semantic feature data corresponding to the sample text data through an initial encoder and predicting a semantic intention corresponding to the third semantic feature data through an initial semantic intention decoder. The processing unit is further configured to adjust weight parameters of the initial encoder and the initial semantic intent decoder according to the predicted semantic intent and an intent type corresponding to the pre-labeled sample text data, so as to train the initial encoder and the initial semantic intent decoder, thereby obtaining a first encoder and a semantic intent decoder. The processing unit is further configured to obtain fourth semantic feature data corresponding to the sample fusion data through the first encoder, and predict an entity corresponding to the fourth semantic feature data based on an initial semantic entity decoder. The processing unit is further configured to adjust weight parameters of the first encoder and the initial semantic entity decoder according to the entity obtained by prediction and an entity corresponding to the pre-labeled sample fusion data, so as to train the first encoder and the initial semantic entity decoder, thereby obtaining an encoder and a semantic entity decoder.

With reference to the fourth aspect, in a first possible implementation manner, the processing unit is specifically configured to:

determining a first loss according to the predicted semantic intention and an intention category corresponding to the pre-labeled sample text data;

and adjusting the weight parameters of the initial encoder and the initial semantic intent decoder according to the first loss.

With reference to any one of the fourth aspect to the first possible implementation manner of the fourth aspect, in a second possible implementation manner, the processing unit is specifically configured to:

determining a second loss according to the entity obtained by prediction and the entity corresponding to the pre-labeled sample fusion data;

and adjusting the weight parameters of the first encoder and the initial semantic entity decoder according to the second loss.

With reference to any one of the fourth aspect to the second possible implementation manner of the fourth aspect, in a third possible implementation manner, the transceiver unit is further configured to acquire text data; the processing unit is further configured to obtain first semantic feature data corresponding to the text data through an encoder, and determine a semantic intention corresponding to the first semantic feature data through a semantic intention decoder, so as to implement intention identification on the text data; the processing unit is further configured to generate fusion data according to the text data and the semantic intent, and obtain second semantic feature data corresponding to the fusion data based on the encoder; the processing unit is further configured to determine an entity corresponding to the second semantic feature data based on a semantic entity decoder, so as to implement entity identification associated with the semantic intent in the text data.

In a fifth aspect, an embodiment of the present application provides a terminal device. The terminal equipment comprises a memory, a transceiver and a processor; wherein the memory, the transceiver and the processor are connected by a communication bus or the processor and the transceiver are for coupling with the memory. The memory is used for storing a set of program codes, the processor is used for calling the program codes stored in the memory to execute the semantic understanding method provided by the first aspect and/or any one of the possible implementations of the first aspect, so as to realize the beneficial effects of the method provided by the first aspect, or the processor is used for calling the program codes stored in the memory to execute the semantic understanding method provided by the second aspect and/or any one of the possible implementations of the second aspect, so as to realize the beneficial effects of the method provided by the second aspect.

In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores instructions, and when the instructions are executed on a terminal, the terminal is enabled to execute the semantic understanding method provided in any one of the above first aspect and/or any one of the above possible implementations of the first aspect, and also to achieve the beneficial effects of the method provided in the first aspect, or the terminal is enabled to execute the semantic understanding method provided in any one of the above second aspect and/or any one of the above possible implementations of the second aspect, and also to achieve the beneficial effects of the method provided in the second aspect.

In a seventh aspect, an embodiment of the present application provides a communication apparatus, where the communication apparatus may be a chip or multiple cooperating chips, and the communication apparatus includes an input device coupled to the communication apparatus (e.g., a chip) for executing the technical solution provided in the first aspect or the second aspect of the embodiment of the present application. It should be understood that "coupled" herein means that two components are directly or indirectly joined to each other. The combination may be fixed or movable, which may allow flowing fluid, electrical or other types of signals to be communicated between the two components.

In an eighth aspect, an embodiment of the present application provides a computer program product including instructions, which, when the computer program product runs on a terminal, enables the terminal to perform the semantic understanding method provided in the first aspect or the second aspect, and also can achieve the beneficial effects of the method provided in the first aspect or the second aspect.

In the semantic understanding method provided by the application, text data is acquired. The method comprises the steps of obtaining first semantic feature data corresponding to text data through an encoder, and determining semantic intents corresponding to the first semantic feature data through a semantic intent decoder to achieve intent recognition of the text data. And generating fusion data according to the text data and the semantic intention, and acquiring second semantic feature data corresponding to the fusion data based on an encoder. And determining an entity corresponding to the second semantic feature data based on the semantic entity decoder to realize entity identification associated with the semantic intention in the text data. By adopting the method provided by the application, the semantic understanding accuracy can be improved.

Drawings

FIG. 1 is a flow chart of a semantic understanding method;

FIG. 2 is a flow chart of a semantic understanding method provided by an embodiment of the present application;

FIG. 3 is another flow chart diagram of a semantic understanding method provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of an application scenario of semantic intent recognition provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an encoder according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a multi-head attention mechanism layer provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of another structure of an encoder according to an embodiment of the present application;

fig. 8 is a schematic view of an application scenario of entity identification provided in an embodiment of the present application;

FIG. 9 is another flow chart diagram of a semantic understanding method provided by an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a semantic understanding apparatus provided in an embodiment of the present application;

fig. 11 is another schematic structural diagram of a semantic understanding apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

From the functional machine era to the smart machine era, the way in which people interact with machines is constantly changing. Especially in recent years, voice interaction has been a focus of research. The semantic understanding method can be widely applied to industries and scenes such as smart home, automobiles, smart customer service and the like. Particularly, with the rise of the internet of vehicles and intelligent automobiles, more and more functions are carried on the vehicle, so that voice interaction in the intelligent automobile cabin becomes more important. For example, in aspects of vehicle control, intelligent navigation, multimedia entertainment and the like, a user can control the vehicle to perform corresponding functions through voice. For example, in the aspect of vehicle control, a user may perform window adjustment or temperature adjustment in a vehicle by instructing to "open a right front window", "too hot, please help i open a maximum cooling mode", or the like, or the user may also adjust a vehicle rearview mirror or change a gear by an instruction, which is determined according to an actual application scenario, and is not limited herein. In the aspect of intelligent navigation, a user can control the vehicle navigation service through a voice command of 'I want to go to the great east street' and the like. In the aspect of multimedia entertainment, a user can control playing, pausing, song switching and the like of music through voice instructions of 'i want to listen to murraya', 'i do not want to listen to songs', or 'i do not like this song, change into murraya' and the like, without limitation. Obviously, in the process of voice interaction, if the accurate control of voice is to be realized, the accurate understanding of the real meaning in the voice of the user by the terminal equipment is key. For convenience of description, the following description of the present application takes the car semantic understanding as an example, that is, the speech interaction in the car intelligent cabin is taken as an example for description.

The semantic understanding method provided by the embodiment of the application can be suitable for various terminal devices. For example, the terminal device may be a smart car machine (or referred to as a vehicle-mounted terminal) installed on a vehicle cab, or the terminal device may also be a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like, which is not limited herein. Optionally, the semantic understanding method provided in the embodiment of the present application may also be executed by a chip, for example, a vehicle-mounted chip, and the like, which is not limited herein. Optionally, the semantic understanding method provided in this embodiment of the present application may also be executed by a server or a chip in the server, where the server may be a local server or a cloud server, and the like, which is not limited herein. For convenience of description, the following description will be given taking a terminal device as an example. Specifically, the semantic understanding method provided by the application can be used for performing intention recognition on the user voice and performing entity recognition on an entity which is contained in the user voice and is associated with the user intention through the terminal device, so that the corresponding function of the vehicle can be accurately controlled according to the recognized semantic intention and the entity. The entities in the present application include names of people, places, organizations, dates, currencies, percentages, and other customized entities. For example, other customized entities may be a role name, a dish name, and the like, which are determined according to an actual application scenario and are not limited herein. For example, for the text data 1 "i want to hear murraya", the semantic intent is to play music, and the entity associated with the semantic intent is "murraya". For another example, for text data 2 "navigate to the zhuchai bridge", the semantic intent is to start navigation, and the entity associated with the semantic intent is "zhuchai bridge".

Specifically, when intention recognition and entity recognition are performed on user voice, the user voice can be converted into corresponding text data, and then the text data obtained through conversion is processed. Referring to fig. 1, fig. 1 is a flow chart of a semantic understanding method according to an embodiment of the present disclosure. As shown in fig. 1, after text data obtained by voice conversion of a user is acquired, the text data may be input to an encoder, so as to implement semantic feature extraction on the text data through the encoder. Then, the semantic feature data corresponding to the text data output by the encoder is input to a semantic intention decoder to acquire the semantic intention output by the semantic intention decoder. In order to improve the recognition accuracy of the entity associated with the semantic intention, the semantic intention output by the semantic intention decoder can be used as prior information to be fused with original text data (namely the text data corresponding to the user voice), and then the semantic feature extraction of the fused data is realized based on the same encoder. And finally, inputting the semantic feature data corresponding to the fused data output by the encoder into a semantic entity decoder so as to output an entity recognition result through the semantic entity decoder. And finally, performing corresponding function control according to the acquired semantic intention and the entity.

Among other things, current NLU processing typically accomplishes intent recognition and entity recognition as two separate tasks. For example, please refer to fig. 2, fig. 2 is a flow chart of a semantic understanding method. As shown in fig. 2, after the text data corresponding to the user voice is obtained, the text data may be respectively input into the first encoder and the second encoder to be respectively encoded, so as to obtain semantic feature data corresponding to the text data output by the first encoder and semantic feature data corresponding to the text data output by the second encoder. Then, the semantic feature data output by the first encoder is input into a semantic intention decoder to obtain the semantic intention output after passing through the semantic intention decoder. Meanwhile, the semantic feature data output by the second encoder is input into a semantic entity decoder to obtain an entity output after passing through the semantic entity decoder. And finally, performing corresponding function control according to the acquired semantic intention and the entity.

Obviously, compared with the semantic feature data of the text data respectively extracted by the first encoder and the second encoder in the prior art, the method and the device have the advantages that the semantic feature of the text data is extracted by the encoders, and the semantic feature of the fusion data comprising the text data and the semantic intention is extracted based on the same encoder, so that more semantic features can be learned based on the encoders, and the semantic understanding accuracy is improved.

The technical solution in the present application will be described in detail with reference to fig. 3 to 10. Referring to fig. 3, fig. 3 is another schematic flow chart of the semantic understanding method according to the embodiment of the present application. It can be understood that the semantic understanding method provided in the present application can be executed by a terminal device, or a chip in the terminal device, or a chip in a server or a server, etc., and is not limited herein. For convenience of description, the following embodiments of the present application take a terminal device as an example for description. As shown in fig. 3, the semantic understanding method may include the following steps:

s101, acquiring text data.

In some possible embodiments, the speech signal of the user may be collected by a microphone or a microphone array (i.e. a plurality of microphones arranged) of the terminal device, and then the speech signal is input into an Automatic Speech Recognition (ASR) system to convert the speech signal into text data. For example, during the starting process or the driving process of the automobile, the voice signal T of the user can be collected through a microphone of a terminal device (i.e., an on-board terminal) installed on the automobile, and then the collected voice signal T is transmitted to the on-board voice recognition system. After the vehicle-mounted voice recognition system receives the voice signal T, the voice signal T can be converted into text data X, and therefore semantic understanding of voice input by a user can be achieved by analyzing the text data.

In some feasible embodiments, after the terminal device collects the voice signal, the collected voice signal may be subjected to denoising processing such as echo cancellation and crosstalk cancellation, and then the denoised voice signal is input into the ASR system for text conversion.

Optionally, in some possible embodiments, if the terminal device does not have a voice collecting or voice recognition function, the terminal device may further receive text data corresponding to the voice of the user from another terminal device having the voice collecting and voice recognition functions through a wired or wireless communication manner.

Optionally, in some possible embodiments, the terminal device may further obtain text data from a local storage or a cloud storage, or may further receive pre-stored text data from another terminal device for semantic understanding through a wired or wireless communication manner, or may further obtain text data input by a user on a display interface of the terminal device for semantic understanding, and the like, which is not limited herein.

S102, first semantic feature data corresponding to the text data are obtained through an encoder, and a semantic intention corresponding to the first semantic feature data is determined through a semantic intention decoder.

In some possible embodiments, semantic feature data corresponding to the text data, that is, the first semantic feature data, may be obtained by the encoder, and then the semantic intent corresponding to the first semantic feature data is determined by the semantic intent decoder. That is, the encoder can be used to extract semantic features such as lexical, syntactic, etc. of the input text (i.e., text data) to provide information input for the semantic intent decoder. Specifically, when performing intent recognition on text data, a first word vector matrix corresponding to the text data may be determined first. And then inputting the first word vector matrix into an encoder to obtain a semantic feature vector output by the encoder as the first semantic feature data. The first word vector matrix comprises a word vector corresponding to each character in a plurality of characters forming the text data. It should be understood that the semantic feature data referred to in this application may be lexical, syntactic, etc. features in the textual information.

The character splitting processing is performed on the text data, so that a plurality of characters included in the text data can be obtained. The word segmentation processing on the text data can be understood as dividing the text data into words. For example, 6 characters "i" want to "listen to" seven "and" like "can be obtained by separating the characters of the text data 1" i "want to" listen "to" seven "and" like "by the word separation processing. For another example, the text data 2 "navigate to the big bridge" is processed by separating characters, so that 7 characters "lead", "navigate", "go", "bead", "sea", "big" and "bridge" can be obtained. Further, by obtaining a character word vector, a position word vector, and a character type word vector corresponding to each character in the plurality of characters constituting the text data, the obtained character word vector, position word vector, and character type word vector corresponding to each character may be summed to obtain a word vector corresponding to each character. Therefore, a first word vector matrix corresponding to the text data can be generated according to a plurality of word vectors corresponding to a plurality of characters. It should be understood that the above-described character word vector corresponding to each character is a vectorized representation of each character. The position word vector corresponding to each character is used to indicate the position of the character in the text data. The character type word vector is used to indicate the character type to which the character belongs. And the vector dimensions of the character word vector, the position word vector and the character type word vector corresponding to each character are the same. For example, the vector dimensions of the character word vector, the position word vector, and the character type word vector in the embodiment of the present application may be 768 or 312, which is determined according to the actual application scenario and is not limited herein.

Generally, to improve the processing efficiency, a character word vector lookup table, a position word vector lookup table, and a character type word vector lookup table may be preset. The character word vector lookup table comprises n character word vectors corresponding to n characters. The position word vector lookup table comprises m position word vectors corresponding to m positions. Wherein n and m are integers greater than 0. The character type word vector lookup table includes a character type word vector corresponding to a non-pad character and a character type word vector corresponding to a pad character, that is, the present application may include 2 types of characters, which are pad characters and non-pad characters, respectively, where each character constituting the text data belongs to a type of the non-pad character. Therefore, after a plurality of characters forming the text data are determined, a character word vector corresponding to each character in the plurality of characters forming the text data can be obtained from the character word vector lookup table by obtaining the character word vector lookup table, the position word vector lookup table and the character type word vector lookup table, a position word vector corresponding to the position of each character in the text data is obtained from the position word vector lookup table, and a character type word vector corresponding to a non-filled character is obtained from the character type word vector lookup table as a character type word vector corresponding to each character. And summing the character word vector, the position word vector and the character type word vector corresponding to each character to obtain the word vector corresponding to each character.

For example, taking the character "i" in the text data 1 "i want to hear murraya, assuming that the character word vector [1,2, …,6], the position word vector [3,4, …,1], and the character type word vector [1,1, …,1] corresponding to" i "are respectively obtained from the character word vector lookup table, the position word vector lookup table, and the character type word vector lookup table, the word vector [5, 7, …, 8] corresponding to the character" i "can be obtained by summing up the above 3 vectors (i.e., the character word vector, the position word vector, and the character type word vector). For another example, taking the character "want" in the text data 1 "i want to hear qilixiang" as an example, assuming that the character word vector corresponding to "want" is [0,1, …,3], the position word vector is [2,7, …,9], and the character type word vector is [1,1, …,1] are respectively obtained from the character word vector lookup table, the position word vector lookup table, and the character type word vector lookup table, the word vector corresponding to the character "want" can be [3, 9, …, 13] by summing up the 3 vectors (i.e., the character word vector, the position word vector, and the character type word vector), and so on, the word vectors corresponding to each character in the characters "seven", "li", and "in the text data 1 can be respectively obtained.

After a plurality of word vectors corresponding to a plurality of characters forming the text data are obtained, a first word vector matrix corresponding to the text data can be generated according to the plurality of word vectors corresponding to the plurality of characters. In general, the length of the character sequence input to the encoder is fixed, i.e., the matrix size of the word vector matrix input to the encoder is fixed. Therefore, when the length of the character data (i.e., the number of characters constituting the character data) is smaller than the length of the character sequence, the character length can be made up to the length of the prescribed character sequence. The length of the character sequence input to the encoder in the embodiment of the present application may be 512, 256, and the like, which is determined specifically according to an actual application scenario and is not limited herein.

For example, please refer to fig. 4, fig. 4 is a schematic diagram of an application scenario of semantic intent recognition provided in an embodiment of the present application. As shown in fig. 4, it is assumed that the length of a character sequence that can be processed by a predetermined encoder is 10. Aiming at the acquired text data 1 ' i wants to hear Qilixiang ' corresponding to the input voice of the user, characters [ CLS ] are added at the beginning of the text data 1 ' i wants to hear Qilixiang]To indicate the start of text data, a character [ SEP ] is added to the end of the text data 1 "I want to hear murraya]For indicating the end of the text data, the length 6 according to the text data 1 itself (i.e. the 6 characters included in the text data 1), and the character [ CLS ]]And character [ SEP ]]It can be determined that the input length of the text data (here, the input length is 8 characters) is smaller than the set processing length 10 of the encoder, and thus, can be based on PAD characters [ PAD ]]And carrying out length completion on the text data. That is, when the input length of text data is smaller than the length of a character sequence preset by an encoder, a character [ PAD ] can be filled in]The text data is length-padded so that the input of the text data satisfies the settings of the encoder. Wherein [ CLS ] in front of the first character "I" of the text data, as shown in FIG. 4]The characters are used for indicating the beginning of the text data and the end of the text data after the character SEP]The characters are used to indicate the end of the text data. In the scenario shown in FIG. 4, may be in [ SEP]Character post-add 2 PAD characters [ PAD ]]To fill the length of the text data to 10 characters. Further, by obtaining the character word vector, the position word vector, and the character type word vector corresponding to each of the 10 characters, the character word vector, the position word vector, and the character type word vector corresponding to each character may be summed, and the summed vector may be determined as the word vector corresponding to the character. It should be understood that since [ CLS]Character, [ SEP ]]The characters and each character constituting the text data are non-fill characters, and thus, the character type word vectors thereof are all non-fill charactersCharacter type word vector E corresponding to the filled character ₁₁ . Due to the character [ PAD ]]Is a fill character, and therefore, the character type word vectors thereof are all the character type word vectors E corresponding to the fill character ₀₀ . Further, assuming that the vector dimension of the word vector corresponding to each character is 768, the character word vectors corresponding to the above 10 characters may generate a first word vector matrix of 10 × 768, where each row in the first word vector matrix represents a word vector corresponding to one character. It should be understood that, by inputting the first word vector matrix into a pre-trained encoder, the semantic feature vector output after passing through the encoder can be obtained as the first semantic feature data corresponding to the text data. Further, the semantic intention output by the semantic intention decoder is obtained by inputting the first semantic feature data into the semantic intention decoder. As shown in fig. 4, the semantic intention of the available text data 1 "i want to hear murraya" is "Music", i.e., playing Music.

The encoder in the embodiment of the present application may be composed of j layers of encoding units. Where j is an integer greater than 0, for example, j may be equal to 6. It should be understood that the j layers of coding units are connected in series, that is, the input of the first layer of coding unit in the j layers of coding units is the first word vector matrix corresponding to the text data, and the input of any layer of coding unit after the first layer of coding unit is the output of the upper layer of coding unit of the any layer of coding unit. For example, please refer to fig. 5, fig. 5 is a schematic structural diagram of an encoder according to an embodiment of the present disclosure. As shown in fig. 5, it is assumed that the encoder includes 4-layer coding units (i.e., j — 4) which are a first-layer coding unit, a second-layer coding unit, a third-layer coding unit, and a fourth-layer coding unit, respectively. The input of the first layer coding unit is a first word vector matrix corresponding to the text data, the input of the second layer coding unit is the output of the first layer coding unit, the input of the third layer coding unit is the output of the second layer coding unit, and so on, the input of the fourth layer coding unit is the output of the third layer coding unit. Finally, the output of the fourth layer encoding unit may be taken as the output of the entire encoder.

It should be understood that the encoder includes j layers of encoding units, each layer of encoding units may be composed of a multi-head attention mechanism layer and a forward transfer layer, and the multi-head attention mechanism layer and the forward transfer layer are connected in series. For example, referring to fig. 5 together, taking the first layer coding unit as an example, the forward pass layer included in the first layer coding unit is connected after the multi-head attention mechanism layer. Therefore, for the first layer coding unit, by inputting the first word vector matrix into the multi-head attention mechanism layer of the first layer coding unit, the output data obtained through the multi-head attention mechanism layer of the first layer coding unit can be input into the forward transfer layer of the first layer coding unit. Furthermore, the output data obtained after passing through the forward transport layer of the first layer coding unit may be input as the output data of the first layer coding unit into the next layer coding unit of the first layer coding unit (i.e., the second layer coding unit), and so on until the output of the forward transport layer of the fourth layer coding unit is obtained as the output of the encoder. That is, the semantic feature vector output by the fourth layer encoding unit may be taken as the first semantic feature data corresponding to the text data. Finally, semantic intentions corresponding to the first semantic feature data can be determined through a semantic intention decoder so as to realize intention identification on the text data.

It should be understood that the multi-head attention mechanism layer is the most core level in each layer of coding units. Wherein the multi-head attention mechanism layer may learn a weight for each character in the input word vector matrix. Referring to fig. 6, fig. 6 is a schematic structural diagram of a multi-head attention mechanism layer according to an embodiment of the present disclosure. For a word vector corresponding to each character included in an input word vector matrix (i.e., a first word vector matrix), a Query vector (for convenience of description, hereinafter referred to as a Q vector), a Key vector (for convenience of description, hereinafter referred to as a K vector), and a Value vector (for convenience of description, hereinafter referred to as a V vector) may be generated. That is, for a first word vector matrix, the first word vector matrix is obtained by multiplying three different weight matrices W, respectively ^Q ，W ^K ，W ^V Then the Q vector matrix and K corresponding to the first word vector matrix can be obtainedAnd the vector matrix and the V vector matrix, and furthermore, the Attention calculation is carried out on the Q vector matrix, the K vector matrix and the V vector matrix, so that the output result of each head in the plurality of heads can be obtained. As shown in fig. 6, assuming that a total of h heads are included in the multi-head attention mechanism layer, h is an integer greater than 1. Wherein the orientation calculation mode for each of the h headers is the same, except for the weight matrix W for linear transformation in each header ^Q ，W ^K ，W ^V Different. For convenience of description, the following embodiments of the present application are explained taking the Attention calculation of any one of the h heads i as an example. Specifically, let W be the parameter of linear transformation performed by any head i _i ^Q 、W _i ^K And W _i ^V Then, the first word vector matrix X is linearly transformed to obtain the corresponding Q _i Vector matrix, K _i Vector matrix sum V _i Vector matrix, and then by pair Q _i Vector matrix, K _i Vector matrix sum V _i The vector matrix is used for the Attention calculation to obtain the zoom dot product Attention result head output by any head i _i . Wherein:

Q _i ＝W _i ^Q X；

K _i ＝W _i ^K X；

V _i ＝W _i ^V X；

head _i ＝Attention(Q _i ,K _i ,V _i )

wherein d is _k Hidden neuron dimension for a multi-headed attention-ground layer, generally, d _k Is 64.

It should be understood that when each of the h headers is subjected to the above calculation process, h scaled dot product Attention results can be obtained. Suppose that the h scaled dot product orientations results are head respectively ₁ 、head ₂ ，…，head _h Thus, h calculations can be performedAnd (3) line splicing is carried out, and the splicing result is subjected to linear transformation, so that the output Multihead (Q, K, V) of the multi-head attention mechanism layer can be obtained, wherein:

MultiHead(Q,K,V)＝Concat(head ₁ ,...,head _h )W ^O

wherein, W ^O The weight matrix used for the linear transformation.

Optionally, in some possible embodiments, each of the j layers of coding units may further include a first vector normalization layer and a second vector normalization layer in addition to the attention mechanism layer and the forward transmission layer. It should be appreciated that a vector normalization layer may be used to achieve normalization of the output vector, simplifying the difficulty of learning. For example, please refer to fig. 7, fig. 7 is another structural schematic diagram of the encoder according to the embodiment of the present application. Let j be 4, that is, the encoder includes 4 layers of coding units, namely, a first layer coding unit, a second layer coding unit, a third layer coding unit and a fourth layer coding unit. As shown in fig. 7, taking the first layer coding unit as an example, the first layer coding unit may include a multi-headed attention mechanism layer, a first vector normalization layer, a forward pass layer, and a second vector normalization layer. The multi-head attention mechanism layer is connected with the forward transmission layer through the first vector normalization layer, and the output of the forward transmission layer is connected with the second vector normalization layer. That is, the multi-head attention mechanism layer connects the first vector normalization layer, the first vector normalization layer connects the forward pass layer, and the forward pass layer connects the second normalization layer. Therefore, for the first layer encoding unit, by inputting the first word vector matrix into the multi-head attention mechanism layer of the first layer encoding unit, the output data obtained through the multi-head attention mechanism layer of the first layer encoding unit can be input into the first vector normalization layer of the first layer encoding unit for normalization or normalization processing. Then, the output data after being normalized is input into the forward transfer layer of the first layer coding unit, and after being sequentially processed by the forward transfer layer and the second vector normalization layer, the output data after being normalized by the second normalization layer can be used as the output data of the first layer coding unit, and then input into the next layer coding unit (i.e. the second layer coding unit) of the first layer coding unit, and so on, until the semantic feature vector output by the second vector normalization layer of the fourth layer coding unit is obtained and used as the first semantic feature data corresponding to the text data. Finally, the semantic intention determined based on the semantic intention decoder can be obtained by inputting the semantic feature vector output by the fourth layer coding unit into the semantic intention decoder, so that the intention identification of the text data is realized. That is, for a certain layer coding unit, after obtaining the output MultiHead (Q, K, V) of the multi-head attention mechanism layer included in the layer coding unit, the output of the multi-head attention mechanism layer may be input into a first vector normalization layer, where the first vector normalization layer may satisfy:

x＝LayerNorm(MultiHead(Q,K,V)+Sublayer(MultiHead(Q,K,V)))

where x is the output of the first vector normalization layer, LayerNorm denotes the normalization calculation operation, MultiHead (Q, K, V) is the output of the multi-head attention mechanism layer, and Sublayer denotes the residual calculation operation.

After an output result x subjected to standardization processing by the first vector standardization layer is obtained, high-dimensional mapping of a vector space can be realized through the forward transmission layer, semantic information such as a lexical method and a syntax of abstract high dimensions is extracted, and specifically, the forward transmission layer can meet the following requirements:

FFN(x)＝max(0,xW ₁ +b ₁ )W ₂ +b ₂

wherein FFN (x) represents the output of the forward transport layer, max represents the maximum value calculation operation, W ₁ And W ₂ Representing a weight matrix in the forward transport layer, b ₁ And b ₂ X is the output of the first vector normalization layer, which is the bias parameter of the weight matrix.

Further, after obtaining the output result ffn (x) after passing through the forward transfer layer, the output result of the forward transfer layer may be further input into a second vector normalization layer, where the second vector normalization layer satisfies:

V＝LayerNorm(FFN(x)+Sublayer(FFN(x)))

where V denotes an output result of the second vector normalization layer, LayerNorm denotes a normalization calculation operation, ffn (x) denotes an output of the forward transfer layer, and Sublayer denotes a residual calculation operation.

It should be understood that, after obtaining the output result after the normalization processing performed by the second normalization layer of the layer of coding units, the output result may be used as the output result of the layer of coding units, so as to further input the output result to the next layer of coding units of the layer of coding units, and so on until obtaining the output result of the second vector normalization layer of the j-th layer of coding units as the first semantic feature data corresponding to the text data.

In some possible embodiments, after obtaining an output result of the encoder, that is, an output result of the j-th layer encoding unit, a semantic intention corresponding to the first semantic feature data may be determined according to the output result of the encoder (that is, the first semantic feature data) and the semantic intention decoder, so as to implement intention recognition on the text data. Wherein, the semantic intention decoder can satisfy:

y ₁ ＝F(W ₃ *v ₁ +b ₃ )

wherein, y ₁ Representing semantic intention, W ₃ Weight matrix representing semantic intention decoder, b ₃ Representing the offset parameters and V1 representing the first row vector of the output matrix V of the encoder.

S103, generating fusion data according to the text data and the semantic intention, and acquiring second semantic feature data corresponding to the fusion data based on the encoder.

In some possible embodiments, after the semantic intent is determined based on the semantic intent decoder, further fusion data may be generated according to the text data and the obtained semantic intent, and second semantic feature data corresponding to the fusion data is obtained based on the encoder. It should be understood that by splicing the text data and the semantic intent, fused data corresponding to the text data and the semantic intent can be obtained. Further, a word vector corresponding to each character in the characters forming the fusion data is obtained, and finally, a second word vector matrix is generated according to the word vectors corresponding to the characters forming the fusion data. The method for obtaining the word vector corresponding to each character in the characters forming the fusion data is the same as the method for obtaining the word vector corresponding to each character in the characters forming the text data, namely the character word vector, the position word vector and the character type word vector corresponding to each character can be respectively obtained by querying a character word vector query table, a position word vector query table and a character type query table, and then the word vector corresponding to the character is obtained by summing the character word vector, the position word vector and the character type word vector corresponding to each character. Finally, according to the word vector corresponding to each character constituting the fused data, a second word vector matrix corresponding to the fused data is generated, wherein a specific implementation manner of generating the second word vector matrix corresponding to the fused data may refer to the description of generating the first word vector matrix corresponding to the text data, and is not repeated here. It is understood that after the second word vector matrix corresponding to the fusion data is determined, the semantic feature vector output by the encoder can be obtained as the second semantic feature data corresponding to the fusion data by inputting the second word vector matrix into the encoder.

And S104, determining an entity corresponding to the second semantic feature data based on the semantic entity decoder so as to realize entity identification associated with the semantic intention in the text data.

In some possible embodiments, after the second semantic feature data corresponding to the fusion data is determined, the entity associated with the semantic intent included in the text data can be determined according to the semantic entity decoder by inputting the second semantic feature data into the semantic entity decoder. Wherein, the semantic entity decoder can satisfy:

y _i ＝F(W ₄ *T+b ₄ )

wherein, y _i Representing the output result of a semantic entity decoder, W ₄ Weight matrix representing the semantic entity decoder, b ₄ And the bias parameter is represented, and T represents second semantic feature data corresponding to the fused data output by the encoder.

It should be understood that, when the semantic intent corresponding to the text data and the entity associated with the semantic intent are determined based on the above steps, the vehicle may be controlled to execute the target function according to the semantic intent and the entity. For example, please refer to fig. 8, and fig. 8 is a schematic view of an application scenario of entity identification according to an embodiment of the present application. As shown in fig. 8, it is assumed that the text data 1 corresponding to the user-entered speech is "i want to hear murraya". By determining the first word vector matrix corresponding to the text data 1, the first word vector matrix may be input to the encoder 210, so as to obtain the semantic feature vector output by the encoder 210 as the first semantic feature data corresponding to the text data 1. By inputting the above-mentioned first semantic feature data into the semantic intention decoder 211, it can be determined that the semantic intention of the user is "Music", i.e., playing Music, based on the output result of the semantic intention decoder 211. Further, by splicing the text data 1 with the semantic intention, fused data can be obtained. Further, by determining a second word vector matrix corresponding to the fused data, the second word vector matrix may be input to the encoder 210, and the encoder 210 determines second semantic feature data corresponding to the fused data. Finally, the second semantic feature data is input into the semantic entity decoder 212, and the entity recognition result output by the semantic entity decoder 212 can be acquired as "Qilixiang". Therefore, according to the determined semantic intention and the entity associated with the semantic intention, the vehicle-mounted terminal can play music for the user, and the played song is Qilixiang.

In the method and the device, after the text data obtained by converting the input voice is obtained, the first semantic feature data corresponding to the text data can be obtained through the encoder, and then the semantic intention corresponding to the first semantic feature data is determined according to the semantic intention decoder, so that intention identification of the text data is realized. And generating fusion data according to the text data and the semantic intention, acquiring second semantic feature data corresponding to the fusion data based on the encoder, and determining an entity corresponding to the second semantic feature data based on a semantic entity decoder so as to realize entity identification associated with the semantic intention in the text data.

It should be understood that the semantic feature extraction of the text data and the semantic feature extraction of the fusion data are realized based on the same encoder, and the accuracy of semantic understanding can be improved. That is, the encoder is obtained by training in a decoupling pre-training mode. In particular, the decoupled pre-training approach may be understood as: an encoder in the application is trained according to a first training sample and a second training sample, a semantic intention decoder is trained according to the first training sample, a semantic entity decoder is trained according to the second training sample, the first training sample comprises sample text data and intention categories corresponding to pre-labeled sample text data, the second training sample comprises sample fusion data and entities corresponding to pre-labeled sample fusion data, and the sample fusion data comprises the sample text data and the intention categories corresponding to the sample text data. For ease of understanding, the training process of the encoder, semantic intent decoder, and semantic entity decoder provided in the present application will be exemplified below.

Referring to fig. 9 together, fig. 9 is another flowchart of a semantic understanding method provided in the embodiment of the present application. As shown in fig. 9, the semantic understanding method may include the following steps:

s201, obtaining a first training sample and a second training sample.

In some possible embodiments, when training the initial encoder, the initial semantic intent decoder, and the initial semantic entity decoder, a training sample set may be obtained first, where the training sample set includes a first training sample and a second training sample. The first training sample comprises sample text data and intention categories corresponding to the pre-marked sample text data. The second training sample comprises sample fusion data and entities corresponding to the pre-labeled sample fusion data. It can be understood that the sample fusion data is determined according to the sample text data included in the first training sample and the intention category corresponding to the pre-labeled sample text data. Generally, the entity corresponding to the pre-labeled sample fusion data is an entity associated with the intent category included in the sample text data.

S202, third semantic feature data corresponding to the sample text data are obtained through the initial encoder, and semantic intentions corresponding to the third semantic feature data are predicted through the initial semantic intention decoder.

In some possible embodiments, when the initial encoder and the initial semantic intent decoder are trained according to the first training sample, the weight parameters of the initial encoder and the weight parameters of the initial semantic intent decoder may be adjusted, so as to obtain the first encoder and the semantic intent decoder. When the initial encoder and the initial semantic entity decoder are trained according to the second training sample, the further adjustment of the weight parameters of the adjusted initial encoder (i.e. the adjustment of the weight parameters of the first encoder) and the adjustment of the weight parameters of the initial semantic entity decoder can be realized. That is, third semantic feature data corresponding to the sample text data may be obtained by the initial encoder, and semantic intents corresponding to the third semantic feature data may be predicted by the initial semantic intent decoder. Specifically, the word vector matrix corresponding to the sample text data included in the first training sample is input into the initial encoder, and after the initial encoder performs encoding, the semantic feature data output by the initial encoder can be input into the initial semantic intention decoder, so that the semantic intention output by the initial semantic intention decoder is obtained. Therefore, the weighting parameters of the initial encoder and the initial semantic intent decoder can be adjusted according to the predicted semantic intent and the intent categories corresponding to the pre-labeled sample text data, so as to train the initial encoder and the initial semantic intent decoder.

S203, adjusting weight parameters of an initial encoder and an initial semantic intention decoder according to the semantic intention obtained through prediction and intention categories corresponding to the pre-labeled sample text data so as to train the initial encoder and the initial semantic intention decoder, and obtaining a first encoder and a semantic intention decoder.

In some possible embodiments, the weight parameters of the initial encoder and the initial semantic intent decoder may be adjusted according to the predicted semantic intent and the intent category corresponding to the pre-labeled sample text data, so as to train the initial encoder and the initial semantic intent decoder, and obtain the first encoder and the semantic intent decoder. Specifically, the first encoder and the semantic intention decoder may be obtained by calculating a first loss of the semantic intention output by the initial semantic intention decoder and an intention category corresponding to the pre-labeled sample text data, and then adjusting the weight parameters of the initial encoder and the initial semantic intention decoder according to the calculated first loss.

S204, fourth semantic feature data corresponding to the sample fusion data are obtained through the first encoder, and an entity corresponding to the fourth semantic feature data is predicted based on the initial semantic entity decoder.

In some possible embodiments, fourth semantic feature data corresponding to the sample fusion data may be obtained by the first encoder, and an entity corresponding to the fourth semantic feature data may be predicted based on the initial semantic entity decoder. Therefore, the weight parameters of the first encoder and the initial semantic entity decoder can be adjusted according to the entity obtained by prediction and the entity corresponding to the pre-labeled sample fusion data, so as to train the first encoder and the initial semantic entity decoder. Specifically, the word vector matrix corresponding to the sample fusion data composed of the sample text data and the intention category corresponding to the pre-labeled sample text data included in the second training sample may be input to the encoder trained by the first training sample, so as to obtain semantic feature data corresponding to the sample fusion data output by the encoder. By inputting the semantic feature data corresponding to the sample fusion data into the initial semantic entity decoder, the entity output by the initial semantic entity decoder can be obtained. Therefore, the first encoder and the initial semantic entity decoder may be trained based on the entities obtained from the prediction and the entities corresponding to the pre-labeled sample fusion data.

S205, adjusting the weight parameters of the first encoder and the initial semantic entity decoder according to the entity obtained by prediction and the entity corresponding to the pre-labeled sample fusion data, so as to train the first encoder and the initial semantic entity decoder, and obtain the encoder and the semantic entity decoder.

In some possible embodiments, the weight parameters of the first encoder and the initial semantic entity decoder are adjusted according to the entity obtained by prediction and the entity corresponding to the pre-labeled sample fusion data, and the first encoder and the initial semantic entity decoder can be trained to obtain the encoder and the semantic entity decoder. Specifically, by calculating a second loss between the entity output by the initial semantic entity decoder and the entity marked in advance, the weight parameter of the encoder can be further adjusted according to the second loss, and the weight parameter of the initial semantic entity decoder is adjusted until it is determined based on the test sample set that the encoder trained by the first training sample and the second training sample, the semantic intent decoder trained by the first training sample, and the semantic entity decoder trained by the second training sample satisfy the target convergence condition, and the training process is ended.

Specifically, when determining whether the adjusted encoder, semantic intent decoder, and semantic entity decoder are full of the target convergence condition, a set of test samples may be obtained first. The test sample set comprises a plurality of test text data, pre-marked intention categories corresponding to the test text data and entities included in the test text data. Therefore, each test text in the test sample set can be encoded based on the adjusted encoder to obtain each semantic feature data corresponding to each test text data. And then, performing intention identification on each semantic feature data corresponding to each test text data based on the adjusted semantic intention decoder to obtain each user intention corresponding to each test text data. And further, encoding each fused data composed of each test text and each output user intention based on the adjusted encoder to obtain each semantic feature data corresponding to each fused data. And then carrying out entity recognition on each semantic feature data based on the adjusted semantic entity decoder to obtain each entity included in each test text. If the intention recognition accuracy is determined to be not less than the first accuracy threshold according to the user intents corresponding to the test texts output by the semantic intention decoder and the intention categories corresponding to the pre-marked test texts, and the entity recognition accuracy is determined to be not less than the second accuracy threshold according to the entities included in the test texts output by the semantic entity decoder and the entities included in the pre-marked test texts, it can be determined that the adjusted encoder, semantic intention decoder and semantic entity decoder meet the target convergence condition, so that the training can be finished, and the encoder, semantic intention decoder and semantic entity decoder obtained by the training can be used as the encoder, semantic intention decoder and semantic entity decoder used in the steps of fig. 3. Correspondingly, if the adjusted encoder, the semantic intent decoder and the semantic entity decoder are determined to not meet the target convergence condition based on the test sample set, the training process is continued until the target convergence condition is met, and the training is finished. The following process of processing the text data based on the trained encoder, the semantic intent decoder, and the semantic entity decoder may refer to the implementation process described in each step in fig. 3, and is not described herein again.

Understandably, the encoder obtained by the decoupling pre-training mode is beneficial to improving the encoding effect of the encoder, and is further beneficial to improving the precision of semantic understanding during subsequent use.

The semantic understanding apparatus in the present application will be explained below.

In the case of using an integrated unit, referring to fig. 10, fig. 10 is a schematic structural diagram of a semantic understanding apparatus provided in an embodiment of the present application. The semantic understanding device may be a terminal device or a chip in the terminal device, such as a vehicle-mounted chip. Optionally, the semantic understanding apparatus may also be a server or a chip in the server, for example, the server may be a cloud server, and the like, which is not limited herein. As shown in fig. 10, the semantic understanding apparatus includes a processing unit 1001 and a transceiving unit 1002. The transceiver 1002 may be a transceiver or a communication interface, and the processing unit 1001 may be one or more processors. The semantic understanding apparatus can be used to implement the functions of the terminal device or chip or server involved in the above method embodiments.

Illustratively, the semantic understanding apparatus may be a terminal device. The end device may be a network element in a hardware device, a software function running on dedicated hardware, or a virtualization function instantiated on a platform (e.g., a cloud platform). Optionally, the semantic understanding apparatus may further include a storage unit (not shown in the figure) for storing program codes and data of the semantic understanding apparatus.

For example, when the semantic understanding apparatus is a chip, the transceiving unit 1002 may be an interface, a pin, a circuit, or the like. The interface can be used for inputting data to be processed to the processor and outputting the processing result of the processor outwards. In a specific implementation, the interface may be a general purpose input/output (GPIO) interface, and may be connected to a plurality of peripheral devices (e.g., a display (LCD), a camera (camara), a Radio Frequency (RF) module, an antenna, and the like). The interface is connected with the processor through a bus.

The processing unit 1001 may be a processor that may execute computer-executable instructions stored by the memory unit to cause the chip to perform the method according to the embodiment of fig. 3.

Further, the processor may include a controller, an operator, and a register. Illustratively, the controller is mainly responsible for instruction decoding and sending out control signals for operations corresponding to the instructions. The arithmetic unit is mainly responsible for executing fixed-point or floating-point arithmetic operation, shift operation, logic operation and the like, and can also execute address operation and conversion. The register is mainly responsible for storing register operands, intermediate operation results and the like temporarily stored in the instruction execution process. In a specific implementation, the hardware architecture of the processor may be an Application Specific Integrated Circuit (ASIC) architecture, a microprocessor without interlocked pipeline stage architecture (MIPS) architecture, an advanced reduced instruction set machine (ARM) architecture, or a Network Processor (NP) architecture. The processors may be single core or multicore.

The memory unit may be a memory unit in the chip, such as a register, a cache, etc. The storage unit may also be a storage unit located outside the chip, such as a Read Only Memory (ROM) or other types of static storage devices that can store static information and instructions, a Random Access Memory (RAM), and the like.

It should be noted that the functions corresponding to the processor and the interface may be implemented by hardware design, software design, or a combination of hardware and software, which is not limited herein.

Specifically, in one design, the semantic understanding apparatus may be configured to process the acquired text data based on a pre-trained encoder, a semantic intent decoder, and an entity intent decoder, specifically:

a transceiving unit 1002, configured to acquire text data;

a processing unit 1001, configured to obtain, by an encoder, first semantic feature data corresponding to the text data, and determine, by a semantic intent decoder, a semantic intent corresponding to the first semantic feature data, so as to implement intent recognition on the text data;

the processing unit 1001 is further configured to generate fusion data according to the text data and the semantic intent, and obtain second semantic feature data corresponding to the fusion data based on the encoder;

the processing unit 1001 is further configured to determine an entity corresponding to the second semantic feature data based on a semantic entity decoder, so as to implement entity identification associated with the semantic intent in the text data.

Optionally, the processing unit 1001 is further configured to:

determining a first word vector matrix corresponding to the text data;

and inputting the first word vector matrix into an encoder, and acquiring the semantic feature vector output by the encoder as first semantic feature data corresponding to the text data.

Optionally, the processing unit 1001 is further configured to:

performing character splitting processing on the text data to obtain a plurality of characters included in the text data;

acquiring a character word vector, a position word vector and a character type word vector corresponding to each character in the plurality of characters, wherein the position word vector is used for indicating the position of the character in the text data;

summing the character word vector, the position word vector and the character type word vector corresponding to each character to obtain a word vector corresponding to each character;

and generating a first word vector matrix corresponding to the text data according to a plurality of word vectors corresponding to the characters.

Optionally, the processing unit 1001 is further configured to:

acquiring a character word vector query table, a position word vector query table and a character type word vector query table, wherein the character word vector query table comprises n character word vectors corresponding to n characters, the position word vector query table comprises m position word vectors corresponding to m positions, the character type word vector query table comprises character type word vectors corresponding to non-filled characters and character type word vectors corresponding to filled characters, and n and m are integers larger than 0;

and acquiring a character word vector corresponding to each character in a plurality of characters from the character word vector lookup table, acquiring a position word vector corresponding to the position of each character in the text data from the position word vector lookup table, and acquiring a character type word vector corresponding to the non-filled character from the character type word vector lookup table as the character type word vector corresponding to each character.

Optionally, the encoder includes a j-layer encoding unit, an input of a first layer encoding unit in the j-layer encoding unit is the first word vector input matrix, an input of any layer encoding unit after the first layer encoding unit is an output of a previous layer encoding unit in the any layer encoding unit, an output of a j-th layer encoding unit in the j-layer encoding unit is the first semantic feature data, and j is an integer greater than 0.

Optionally, the processing unit 1001 is further configured to:

and splicing the text data and the semantic intention to obtain fusion data corresponding to the text data and the semantic intention.

Optionally, the processing unit 1001 is further configured to:

acquiring a word vector corresponding to each character in a plurality of characters forming the fused data;

generating a second word vector matrix according to a plurality of word vectors corresponding to a plurality of characters forming the fused data;

and inputting the second word vector matrix into the encoder, and acquiring the semantic feature vector output by the encoder as second semantic feature data corresponding to the fusion data.

Optionally, the text data is obtained by converting an input voice, and the input voice is a vehicle control voice; the processing unit 1001 is further configured to:

and controlling the vehicle to execute the target function according to the semantic intention and the entity.

Optionally, the encoder is trained according to a first training sample and a second training sample, the semantic intention decoder is trained according to the first training sample, the semantic entity decoder is trained according to the second training sample, the first training sample includes sample text data and an intention type corresponding to the pre-labeled sample text data, the second training sample includes sample fusion data and an entity corresponding to the pre-labeled sample fusion data, and the sample fusion data includes the sample text data and the intention type corresponding to the sample text data.

Optionally, the processing unit 1001 is further configured to train to obtain the encoder, the semantic intent decoder, and the semantic entity decoder by:

In another design, referring collectively to fig. 10, the semantic understanding apparatus may also be used to train an encoder, a semantic intent decoder, and a semantic entity decoder. It is understood that the semantic understanding apparatus for training the encoder, the semantic intent decoder, and the semantic entity decoder may be the same apparatus as the semantic understanding apparatus for processing the text data, or the semantic understanding apparatus for training the encoder, the semantic intent decoder, and the semantic entity decoder may be a different apparatus from the semantic understanding apparatus for processing the text data, and the like, without limitation. Wherein, when the semantic understanding device is a device for training an encoder, a semantic intent decoder and a semantic entity decoder, the semantic understanding device comprises:

a transceiver 1002, configured to obtain a first training sample and a second training sample, where the first training sample includes sample text data and an intention category corresponding to the sample text data, the second training sample includes sample fusion data and a pre-labeled entity corresponding to the sample fusion data, and the sample fusion data includes the sample text data and the pre-labeled intention category corresponding to the sample text data;

a processing unit 1001, configured to obtain, by an initial encoder, third semantic feature data corresponding to the sample text data, and predict, by an initial semantic intent decoder, a semantic intent corresponding to the third semantic feature data;

the processing unit 1001 is further configured to adjust weight parameters of the initial encoder and the initial semantic intent decoder according to the predicted semantic intent and an intent type corresponding to the pre-labeled sample text data, so as to train the initial encoder and the initial semantic intent decoder, thereby obtaining a first encoder and a semantic intent decoder;

the processing unit 1001 is further configured to obtain, by the first encoder, fourth semantic feature data corresponding to the sample fusion data, and predict an entity corresponding to the fourth semantic feature data based on an initial semantic entity decoder;

the processing unit 1001 is further configured to adjust weight parameters of the first encoder and the initial semantic entity decoder according to the entity obtained by prediction and an entity corresponding to the pre-labeled sample fusion data, so as to train the first encoder and the initial semantic entity decoder, thereby obtaining an encoder and a semantic entity decoder.

Optionally, the processing unit 1001 is specifically configured to:

Optionally, the transceiver 1002 is further configured to acquire text data;

the processing unit 1001 is further configured to obtain, by an encoder, first semantic feature data corresponding to the text data, and determine, by a semantic intent decoder, a semantic intent corresponding to the first semantic feature data, so as to implement intent recognition on the text data;

the processing unit 1001 is further configured to determine an entity corresponding to the second semantic feature data based on a semantic entity decoder, so as to implement entity identification associated with the semantic intention in the text data.

It should be understood that the semantic understanding apparatus may correspondingly perform the steps of the foregoing method embodiment, and the operations or functions of each unit in the semantic understanding apparatus are respectively for implementing corresponding operations performed by the terminal device in the foregoing method embodiment, where corresponding beneficial effects may refer to the method embodiment and are not described herein again for brevity.

The semantic understanding apparatus according to the embodiment of the present application is described above, and possible product forms of the semantic understanding apparatus are described below. It should be understood that any type of product having the functions of the semantic understanding apparatus described above with reference to fig. 10 falls within the scope of the embodiments of the present application. It should be further understood that the following description is only by way of example, and the product form of the semantic understanding apparatus according to the embodiments of the present application is not limited thereto.

As one possible product form, the semantic understanding apparatus according to the embodiment of the present invention can be implemented by a general bus architecture.

For convenience of explanation, referring to fig. 11, fig. 11 is another schematic structural diagram of the semantic understanding apparatus provided in the embodiment of the present application. The semantic understanding device can be a terminal device, or a chip in the terminal device, or a server, or a chip in the server, etc. Fig. 11 shows only the main components of the semantic understanding apparatus. In addition to the processor 1101 and the transceiver 1102, the semantic understanding apparatus may further include a memory 1103 and an input/output device (not shown).

The processor 1101 is mainly used for processing a communication protocol and communication data, controlling the entire semantic understanding apparatus, executing a software program, and processing data of the software program. The memory 1103 is mainly used for storing software programs and data. The transceiver 1102 may include control circuitry and an antenna, with the control circuitry being primarily used for conversion of baseband signals to radio frequency signals and processing of the radio frequency signals. The antenna is mainly used for receiving and transmitting radio frequency signals in the form of electromagnetic waves. Input and output devices, such as touch screens, display screens, keyboards, etc., are used primarily for receiving data input by a user and for outputting data to the user.

When the semantic understanding apparatus is powered on, the processor 1101 may read the software program in the memory 1103, interpret and execute the instructions of the software program, and process the data of the software program. When data needs to be sent wirelessly, the processor 1101 performs baseband processing on the data to be sent, and outputs a baseband signal to the radio frequency circuit, and the radio frequency circuit performs radio frequency processing on the baseband signal and sends the radio frequency signal to the outside in the form of electromagnetic waves through the antenna. When there is data to be sent to the semantic understanding apparatus, the radio frequency circuit receives a radio frequency signal through the antenna, converts the radio frequency signal into a baseband signal, and outputs the baseband signal to the processor 1101, and the processor 1101 converts the baseband signal into data and processes the data.

In another implementation, the radio frequency circuitry and antennas described above may be provided independently of the processor performing the baseband processing, for example in a distributed scenario, the radio frequency circuitry and antennas may be in a remote arrangement from being independent of the semantic understanding apparatus.

The processor 1101, the transceiver 1102, and the memory 1103 may be connected by a communication bus.

In one design, the semantic understanding apparatus may be configured to perform the functions of the terminal device in the foregoing method embodiment: processor 1101 may be used to perform steps S102-S104 in fig. 3, and/or to perform steps S202-S205 in fig. 9, and/or to perform other processes for the techniques described herein; the transceiver 1102 may be configured to perform step S101 in fig. 3, and/or to perform step S201 in fig. 9, and/or other processes for the techniques described herein.

In any of the designs described above, a transceiver may be included in processor 1101 for performing receive and transmit functions. The transceiver may be, for example, a transceiver circuit, or an interface circuit. The transmit and receive circuitry, interfaces or interface circuitry used to implement the receive and transmit functions may be separate or integrated. The transceiver circuit, the interface circuit or the interface circuit may be used for reading and writing code/data, or the transceiver circuit, the interface circuit or the interface circuit may be used for transmitting or transferring signals.

In any of the above designs, the processor 1101 may store instructions, which may be a computer program that runs on the processor 1101 and causes the semantic understanding apparatus to perform the method described in any of the above method embodiments. The computer program may be solidified in the processor 1000, in which case the processor 1101 may be implemented in hardware.

In one implementation, the semantic understanding apparatus may include a circuit, which may implement the functions of transmitting or receiving or acquiring in the foregoing method embodiments. The processors and transceivers described herein may be implemented on Integrated Circuits (ICs), analog ICs, Radio Frequency Integrated Circuits (RFICs), mixed signal ICs, Application Specific Integrated Circuits (ASICs), Printed Circuit Boards (PCBs), electronic devices, and the like. The processor and transceiver may also be fabricated using various IC process technologies, such as Complementary Metal Oxide Semiconductor (CMOS), N-type metal oxide semiconductor (NMOS), P-type metal oxide semiconductor (PMOS), Bipolar Junction Transistor (BJT), bipolar CMOS (bicmos), silicon germanium (SiGe), gallium arsenide (GaAs), and the like.

The scope of the semantic understanding apparatus described in the present application is not limited thereto, and the structure of the semantic understanding apparatus may not be limited by fig. 11. The semantic understanding means may be a stand-alone device or may be part of a larger device. For example, the semantic understanding means may be:

(1) a stand-alone integrated circuit IC, or chip, or system-on-chip or subsystem;

(2) a set of one or more ICs, which may optionally also include storage means for storing data, computer programs;

(3) an ASIC, such as a Modem (Modem);

(4) a module that may be embedded within other devices;

(5) receivers, terminals, smart terminals, cellular phones, wireless devices, handsets, mobile units, in-vehicle devices, network devices, cloud devices, artificial intelligence devices, and the like;

(6) others, and so forth.

As a possible product form, the terminal device according to the embodiment of the present application may be implemented by a general-purpose processor.

The general processor for realizing the terminal equipment comprises a processing circuit and an input/output interface which is connected and communicated with the inside of the processing circuit.

In one design, the general purpose processor may be configured to perform the functions of the terminal device in the foregoing method embodiments. Specifically, the processing circuit is configured to perform steps S102-S104 in fig. 3, and/or to perform steps S202-S205 in fig. 9, and/or to perform other processes of the techniques described herein; the input-output interface is used to perform step S101 in fig. 3, and/or to perform step S201 in fig. 9, and/or other processes for the techniques described herein.

It should be understood that, the semantic understanding apparatus for various product forms, which has any function of the terminal device in the foregoing method embodiments, may correspondingly implement the steps in the foregoing method embodiments and obtain corresponding technical effects, and for brevity, no further description is provided here.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program code is stored, and when the processor executes the computer program code, the computer program code is configured to perform the method in each step in fig. 3 or fig. 9 in the foregoing embodiments.

The embodiments of the present application also provide a computer program product, which when running on a computer, causes the computer to execute the method of each step in fig. 3 or fig. 9 in the foregoing embodiments.

The embodiment of the present application further provides a semantic understanding apparatus, which may exist in the product form of a chip, and the structure of the apparatus includes a processor and an interface circuit, where the processor is configured to communicate with other apparatuses through a receiving circuit, so that the apparatus executes the method in each step in fig. 3 or fig. 9 in the foregoing embodiment.

The steps of a method or algorithm described in connection with the disclosure herein may be embodied in hardware or in software instructions executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in Random Access Memory (RAM), flash Memory, Erasable Programmable read-only Memory (EPROM), Electrically Erasable Programmable read-only Memory (EEPROM), registers, a hard disk, a removable disk, a compact disc read-only Memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in a core network interface device. Of course, the processor and the storage medium may reside as discrete components in a core network interface device.

Those skilled in the art will recognize that in one or more of the examples described above, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer-readable storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above-mentioned embodiments, objects, technical solutions and advantages of the present application are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present application, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present application should be included in the scope of the present application.

Claims

1. A method of semantic understanding, the method comprising:

acquiring text data;

acquiring first semantic feature data corresponding to the text data through an encoder, and determining a semantic intention corresponding to the first semantic feature data through a semantic intention decoder to realize intention identification on the text data;

generating fusion data according to the text data and the semantic intention, and acquiring second semantic feature data corresponding to the fusion data based on the encoder;

and determining an entity corresponding to the second semantic feature data based on a semantic entity decoder to realize entity identification associated with the semantic intention in the text data.

2. The method of claim 1, wherein before the obtaining, by the encoder, the first semantic feature data corresponding to the text data, the method further comprises:

determining a first word vector matrix corresponding to the text data;

the encoding the text data through the encoder to obtain first semantic feature data corresponding to the text data includes:

and inputting the first word vector matrix into an encoder, and acquiring a semantic feature vector output by the encoder as first semantic feature data corresponding to the text data.

3. The method of claim 2, wherein determining the first word vector matrix to which the text data corresponds comprises:

performing word splitting processing on the text data to obtain a plurality of characters included in the text data;

acquiring a character word vector, a position word vector and a character type word vector corresponding to each character in the plurality of characters, wherein the position word vector is used for representing the position of the character in the text data;

4. The method according to claim 3, wherein the obtaining a character word vector, a position word vector, and a character type word vector corresponding to each of the plurality of characters comprises:

obtaining a character word vector query table, a position word vector query table and a character type word vector query table, wherein the character word vector query table comprises n character word vectors corresponding to n characters, the position word vector query table comprises m position word vectors corresponding to m positions, the character type word vector query table comprises character type word vectors corresponding to non-filled characters and character type word vectors corresponding to filled characters, and n and m are integers larger than 0;

and acquiring a character word vector corresponding to each character in a plurality of characters from the character word vector query table, acquiring a position word vector corresponding to the position of each character in the text data from the position word vector query table, and acquiring a character type word vector corresponding to the non-filled character from the character type word vector query table as the character type word vector corresponding to each character.

5. The method according to any one of claims 2-4, wherein the encoder comprises j layers of coding units, wherein an input of a first layer of coding units in the j layers of coding units is the first word vector input matrix, an input of any layer of coding units after the first layer of coding units is an output of a coding unit on a layer above the any layer of coding units, an output of a j layer of coding units in the j layers of coding units is the first semantic feature data, and j is an integer greater than 0.

6. The method according to any one of claims 1-5, wherein the generating fused data from the textual data and the semantic intent comprises:

7. The method according to claim 6, wherein before the encoder acquires the second semantic feature data corresponding to the fused data, the method further comprises:

acquiring a word vector corresponding to each character in a plurality of characters forming the fusion data;

generating a second word vector matrix according to a plurality of word vectors corresponding to a plurality of characters forming the fusion data;

the obtaining of the second semantic feature data corresponding to the fusion data based on the encoder includes:

8. The method according to any one of claims 1 to 7, characterized in that the text data is converted based on a recorded voice, the recorded voice being a vehicle control voice; the method further comprises the following steps:

and controlling the vehicle to execute a target function according to the semantic intention and the entity.

9. The method according to any one of claims 1-8, wherein the encoder is trained from a first training sample and a second training sample, the semantic intent decoder is trained from the first training sample, the semantic entity decoder is trained from the second training sample, the first training sample comprises sample text data and an intent class corresponding to the pre-labeled sample text data, the second training sample comprises sample fusion data and an entity corresponding to the pre-labeled sample fusion data, and the sample fusion data comprises the sample text data and the intent class corresponding to the sample text data.

10. The method of claim 9, wherein the encoder, the semantic intent decoder, and the semantic entity decoder are trained by:

acquiring third semantic feature data corresponding to the sample text data through an initial encoder, and predicting semantic intention corresponding to the third semantic feature data through an initial semantic intention decoder;

adjusting the weight parameters of the initial encoder and the initial semantic intention decoder according to the predicted semantic intention and the intention category corresponding to the pre-labeled sample text data so as to train the initial encoder and the initial semantic intention decoder to obtain a first encoder and the semantic intention decoder;

11. A method of semantic understanding, the method comprising:

acquiring a first training sample and a second training sample, wherein the first training sample comprises sample text data and an intention category corresponding to the sample text data, the second training sample comprises sample fusion data and a pre-labeled entity corresponding to the sample fusion data, and the sample fusion data comprises the sample text data and the pre-labeled intention category corresponding to the sample text data;

acquiring third semantic feature data corresponding to the sample text data through an initial encoder, and predicting semantic intents corresponding to the third semantic feature data through an initial semantic intention decoder;

adjusting the weight parameters of the initial encoder and the initial semantic intention decoder according to the predicted semantic intention and the intention category corresponding to the pre-labeled sample text data so as to train the initial encoder and the initial semantic intention decoder to obtain a first encoder and a semantic intention decoder;

and adjusting the weight parameters of the first encoder and the initial semantic entity decoder according to the entity obtained by prediction and the entity corresponding to the pre-labeled sample fusion data so as to train the first encoder and the initial semantic entity decoder to obtain an encoder and a semantic entity decoder.

12. The method of claim 11, wherein adjusting the weighting parameters of the initial encoder and the initial semantic intent decoder according to the predicted semantic intent and the pre-labeled intent class corresponding to the sample text data comprises:

adjusting the weight parameters of the initial encoder and the initial semantic intent decoder according to the first loss.

13. The method according to claim 11 or 12, wherein the adjusting the weighting parameters of the first encoder and the initial semantic entity decoder according to the entity corresponding to the predicted entity and the pre-labeled sample fusion data to train the first encoder and the initial semantic entity decoder comprises:

14. The method according to any one of claims 11-13, characterized in that the method comprises:

acquiring text data;

acquiring first semantic feature data corresponding to the text data through the encoder, and determining a semantic intention corresponding to the first semantic feature data through the semantic intention decoder to realize intention identification on the text data;

and determining an entity corresponding to the second semantic feature data based on the semantic entity decoder to realize entity identification associated with the semantic intent in the text data.

15. A semantic understanding apparatus, characterized in that the apparatus comprises:

a receiving and sending unit for acquiring text data;

the processing unit is used for acquiring first semantic feature data corresponding to the text data through an encoder and determining a semantic intention corresponding to the first semantic feature data through a semantic intention decoder so as to realize intention identification on the text data;

the processing unit is further configured to generate fusion data according to the text data and the semantic intent, and obtain second semantic feature data corresponding to the fusion data based on the encoder;

the processing unit is further configured to determine an entity corresponding to the second semantic feature data based on a semantic entity decoder, so as to realize entity identification associated with the semantic intent in the text data.

16. The apparatus of claim 15, wherein the processing unit is further configured to:

determining a first word vector matrix corresponding to the text data;

17. The apparatus of claim 16, wherein the processing unit is further configured to:

18. The apparatus of claim 17, wherein the processing unit is further configured to:

19. The apparatus according to any of claims 16-18, wherein the encoder comprises j layers of coding units, an input of a first layer of coding units in the j layers of coding units is the first word vector input matrix, an input of any layer of coding units after the first layer of coding units is an output of a coding unit of a layer above the any layer of coding units, an output of a j-th layer of coding units in the j layers of coding units is the first semantic feature data, and j is an integer greater than 0.

20. The apparatus according to any of claims 15-19, wherein the processing unit is further configured to:

21. The apparatus of claim 20, wherein the processing unit is further configured to:

22. The apparatus according to any one of claims 15 to 21, wherein the text data is converted based on a recorded voice, the recorded voice being a vehicle control voice; the processing unit is further to:

23. The apparatus according to any of claims 15-22, wherein the encoder is trained based on a first training sample and a second training sample, the semantic intent decoder is trained based on the first training sample, the semantic entity decoder is trained based on the second training sample, the first training sample comprises sample text data and an intent class corresponding to the pre-labeled sample text data, the second training sample comprises sample fusion data and an entity corresponding to the pre-labeled sample fusion data, and the sample fusion data comprises the sample text data and the intent class corresponding to the sample text data.

24. The apparatus of claim 23, wherein the processing unit is further configured to train the encoder, the semantic intent decoder, and the semantic entity decoder by:

25. A semantic understanding apparatus, characterized in that the apparatus comprises:

the system comprises a receiving and sending unit, a processing unit and a processing unit, wherein the receiving and sending unit is used for obtaining a first training sample and a second training sample, the first training sample comprises sample text data and an intention category corresponding to the sample text data, the second training sample comprises sample fusion data and a pre-labeled entity corresponding to the sample fusion data, and the sample fusion data comprises the sample text data and the pre-labeled intention category corresponding to the sample text data;

the processing unit is used for acquiring third semantic feature data corresponding to the sample text data through an initial encoder and predicting semantic intents corresponding to the third semantic feature data through an initial semantic intent decoder;

the processing unit is further configured to adjust weight parameters of the initial encoder and the initial semantic intention decoder according to the predicted semantic intention and an intention category corresponding to the pre-labeled sample text data, so as to train the initial encoder and the initial semantic intention decoder, thereby obtaining a first encoder and a semantic intention decoder;

the processing unit is further configured to obtain, by the first encoder, fourth semantic feature data corresponding to the sample fusion data, and predict an entity corresponding to the fourth semantic feature data based on an initial semantic entity decoder;

the processing unit is further configured to adjust weight parameters of the first encoder and the initial semantic entity decoder according to the entity obtained by prediction and an entity corresponding to the pre-labeled sample fusion data, so as to train the first encoder and the initial semantic entity decoder, thereby obtaining an encoder and a semantic entity decoder.

26. The apparatus according to claim 25, wherein the processing unit is specifically configured to:

27. The apparatus according to claim 25 or 26, wherein the processing unit is specifically configured to:

28. The apparatus of any one of claims 25-27,

the receiving and sending unit is also used for acquiring text data;

the processing unit is further used for acquiring first semantic feature data corresponding to the text data through an encoder and determining a semantic intention corresponding to the first semantic feature data through a semantic intention decoder so as to realize intention identification on the text data;

29. A terminal device, characterized in that the terminal device comprises a processor, a transceiver and a memory;

the processor and transceiver are configured to couple with the memory, read and execute instructions in the memory to implement the method of any of claims 1-10, or the method of any of claims 11-14.

30. A computer program product comprising instructions which, when run on a terminal, cause the terminal to perform the method of any one of claims 1-10, or the method of any one of claims 11-14.

31. A computer-readable storage medium, in which program instructions are stored which, when executed, cause the method of any of claims 1-10 to be performed, or the method of any of claims 11-14 to be performed.