CN112632962A

CN112632962A - Method and device for realizing natural language understanding in human-computer interaction system

Info

Publication number: CN112632962A
Application number: CN202011565278.5A
Authority: CN
Inventors: 王宝军; 张钊; 徐坤; 张宇洋; 尚利峰; 李林琳
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-05-20
Filing date: 2020-12-25
Publication date: 2021-04-09
Anticipated expiration: 2040-12-25
Also published as: CN112632962B; CN111737972A

Abstract

The application provides a method and a related device for realizing natural language understanding in the field of artificial intelligence. According to the technical scheme, the information of the entry in the sentence input by the user can be inquired from the dictionary, disambiguation processing is carried out on the entry information, and the intention and the related key information of the sentence are understood according to the entry information obtained through the disambiguation processing. The technical scheme provided by the application can improve the natural language understanding performance without additional data marking, and the system for realizing natural language understanding is simple to maintain, so that the user experience can be improved.

Description

Method and device for realizing natural language understanding in human-computer interaction system

The present application claims priority from the chinese patent application entitled "method and apparatus for achieving natural language understanding in a human-computer interaction system" filed by the chinese intellectual property office on 20/05/2020, application No. 202010429245.1, the entire contents of which are incorporated herein by reference.

Technical Field

The application relates to the field of artificial intelligence, in particular to a method and a device for realizing natural language understanding in a human-computer interaction system.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

With the continuous development of artificial intelligence technology, natural language human-computer interaction systems, which enable human-computer interaction through natural language, become more and more important. Human-computer interaction through natural language requires a system capable of recognizing specific meanings of human natural language. Identifying a specific meaning of human natural language is called Natural Language Understanding (NLU). NLU generally refers to recognizing the user's intention, extracting key information in the user's natural language.

The NLU is a part that is not interactive with a user, such as a smart speaker, a smart television, a smart car, or a smart phone, or, as it is said, is a key module of a human-computer interaction system.

For example, after a mobile phone user inputs a voice "buy an air ticket to beijing" to a mobile phone assistant, an NLU module in the mobile phone assistant needs to recognize that the user intends to buy the air ticket, and extracts key information "destination: beijing ", in this way, the cell phone assistant may further open a ticket booking application program for the user, and further implement the business of booking an air ticket.

Therefore, how to realize natural language understanding becomes a technical problem to be solved urgently.

With the continuous development of artificial intelligence technology, natural language human-computer interaction systems, which enable human-computer interaction through natural language, become more and more important. Interaction between human and machine through natural language requires that the system can recognize specific meaning of human natural language, which is called Natural Language Understanding (NLU). External dictionary knowledge is often required to be introduced into a task of natural language understanding, for example, a common entity recognition task, entities in natural language are changeable, in the case of a song name, any character or character string may be the song name, the song name may not have definite meaning or the song name is very long, and the boundary of the entity is difficult to be accurately framed in a sentence by a machine learning method; in addition, when the intention recognition and slot extraction are performed, it is difficult to determine the intention of the user if the type of the entity cannot be specified. External dictionary knowledge is therefore required as an aid. The method for introducing dictionary knowledge into the task of natural language understanding at present mainly has two problems, on one hand, the noise is very large only through word matching, and the dictionary knowledge is submerged; on the other hand, path matching can be performed only by pre-segmenting words, so that the performance of the system is limited by the accuracy of the segmentation method, and the model needs to be pre-trained according to the dictionary, so that the dictionary cannot be dynamically updated by the model.

Disclosure of Invention

The application provides a method and a related device for realizing natural language understanding in a human-computer interaction system, and the method can improve the natural language understanding performance, so that the user experience is improved.

In a first aspect, the present application provides a method for implementing natural language understanding in a human-computer interaction system. The method comprises the following steps: acquiring target entry information, wherein the target entry information is used for representing entries contained in a target sentence, and the target sentence is a sentence input to the human-computer interaction system by a user; using a term disambiguation model, and based on the target statement and the target term information, acquiring target indication information, wherein the target indication information is used for indicating whether the term indicated by the target term information conforms to the semantics of the target statement; and acquiring an understanding result based on the target sentence, the target entry information and the target indication information by using a natural language understanding model, wherein the understanding result comprises the target intention of the target sentence input by the user and key information of the target intention.

In the method, after the information of the entry contained in the target sentence is acquired, the information of the entry is not directly used for assisting the natural language understanding model to understand the target sentence, but the entry disambiguation model is used for judging whether the entry is the entry conforming to the semantic of the target sentence or not, or whether the entry can be used as the real entry of the target sentence or not is checked, and the judgment result is used for assisting the natural language understanding model to acquire the intention and the key information of the target sentence. This may improve the performance of natural language understanding, which may improve the user experience.

Generally, each model needs to be labeled when being trained. However, since the instruction information of the output of the term disambiguation model in the present application can be inferred from the intention and the key information in the training data of the natural language understanding model, the training data of the term message model and the natural language understanding model in the present application may be obtained by labeling only the intention and the key information of the sentence, that is, may be obtained by labeling only the training data of the natural language understanding model. The method can save the manual labeling cost of the training data and improve the acquisition efficiency of the training data, so that the training efficiency of the two models can be improved, especially under the scene that the natural language understanding function of the human-computer interaction system needs to be updated, the updating efficiency of the two models can be improved, the performance of the human-computer interaction system can be timely improved, and finally the user experience is improved.

In some possible implementation manners, the obtaining target entry information includes: and querying the vocabulary entry contained in the target sentence from a target dictionary to obtain the target vocabulary entry information, wherein the target dictionary comprises at least one vocabulary entry.

In the implementation mode, the target entry information is obtained according to the target dictionary query, so that under the condition that entries contained in sentences input by users are changed, the entries in the sentences input by the users can be recognized according to the target dictionary only by updating the entries in the target dictionary, the recognition rate of the entries in the sentences input by the users can be conveniently and rapidly improved, the natural language understanding performance can be improved, the performance of a human-computer interaction system is further improved, and finally the user experience can be improved.

In some possible implementations, the method is performed by an end-side device, wherein the target dictionary comprises a dictionary on the end-side device.

That is, the target entry information of the target sentence can be obtained from the dictionary lookup on the end-side device. Compared with the method that the target entry information of the target sentence is obtained through the cloud side equipment, the transmission time can be saved, so that the natural language understanding efficiency can be improved, the human-computer interaction efficiency is improved, and the user experience is finally improved.

In addition, the end-side device can acquire the target entry information of the target sentence based on the dictionary of the end-side device, so that the target entry information of the target sentence can be acquired under the condition that no cloud-side device exists or the cloud-side device cannot be connected, natural language understanding is realized, the application scene of the natural language understanding can be expanded, namely the application scene of the human-computer interaction system is expanded, and the user experience is improved.

In addition, the target entry information of the target sentence can be inquired on the end-side equipment according to the dictionary, so that the dictionary related to the user privacy can be configured on the end-side equipment, the user privacy can be protected, and the user experience can be improved.

In addition, the target entry information of the target sentence can be inquired on the end-side equipment according to the dictionary, so that the dictionary with high user inquiry frequency or commonly used can be configured on the end-side equipment, and compared with the method for acquiring the target entry information from the cloud-side equipment, the end-side equipment can quickly inquire the target entry information of the target sentence, so that the natural language understanding efficiency is improved, the man-machine interaction efficiency is improved, and the user experience is finally improved.

In some possible implementations, the target dictionary includes a dictionary on a cloud-side device, where the obtaining the target entry information according to the target dictionary and the target sentence includes: sending the target statement to the cloud side equipment; and receiving the target entry information from the cloud side equipment.

That is, the end-side device may acquire target entry information of the target sentence from the cloud-side device. Therefore, the storage space and the computing resources of the end-side device can be saved, namely, the capability requirement of the end-side device for realizing natural language understanding is reduced, for example, the end-side device with lower performance can also realize the natural language understanding with higher efficiency, so that the application scene of a man-machine interaction system can be enlarged, and the user experience is finally improved.

In some possible implementations, the method further includes: and acquiring the candidate intention of the target sentence by using an intention recognition model. Wherein the sending the target statement to the cloud-side device includes: and sending the target sentence to the cloud side equipment under the condition that the dictionary corresponding to the candidate intention is judged to be positioned in the cloud side equipment according to a preset corresponding relation, wherein the corresponding relation is used for indicating whether the intention is positioned in the cloud side equipment.

That is, when it is determined that the dictionary to be used for querying the target entry information of the target sentence is located in the cloud-side device according to the intention of the target sentence, the cloud-side device is requested to query the target entry information. The method can flexibly control the times of requesting the cloud side equipment to inquire the dictionary, avoid invalid inquiry, and improve the natural language understanding efficiency, thereby improving the human-computer interaction efficiency.

In some examples, whether a dictionary to be used for querying target entry information of the target sentence is located in the cloud-side device or the end-side device may be determined according to the intention of the target sentence, and the candidate entry of the target sentence may be acquired from the cloud-side device only if it is determined that the target dictionary is located in the cloud-side device. In this way, invalid queries can be avoided, thereby improving user experience.

In some possible implementations, the term disambiguation model is a two-classification model.

In a second aspect, the present application provides a model training method, comprising: acquiring first training data, wherein the first training data comprises training sentences, intentions of the training sentences and key information of the training sentences; obtaining entry information, wherein the entry information is used for representing entries contained in the training sentences; acquiring indicating information according to the first training data and the entry information, wherein the indicating information is used for indicating whether the entry represented by the entry information conforms to the intention and the semantics represented by the key information; and acquiring second training data according to the training sentences, the entry information and the indication information, wherein the second training data comprises the training sentences, the entry information and the indication information, and the second training data is used for training an entry disambiguation model, and the entry disambiguation model is used for judging whether the entries represented by the entry information to be processed conform to the semantics of the sentences to be processed or not based on the sentences to be processed and the entry information to be processed.

In the method, in the process of acquiring the training data of the vocabulary entry disambiguation model and the natural language understanding model, because the second training data of the vocabulary entry message model can be obtained by automatically labeling the first training data, only the first training data can be manually labeled in the process of acquiring the training data of the two models. The efficiency of acquiring the training data of the two models can be improved, so that the training efficiency of the two models can be improved, the performance of a human-computer interaction system can be improved in time, and the user experience can be improved finally.

In some possible implementations, the method may further include: and obtaining third training data according to the first training data, the entry information and the indication information, wherein the third training data comprises the first training data, the entry information and the indication information, and the third training data is used for training a natural language understanding model in the first aspect or any one implementation manner thereof.

In some possible implementation manners, the obtaining entry information includes: and querying the vocabulary entry contained in the sentence from a dictionary to obtain the vocabulary entry information, wherein the dictionary comprises at least one vocabulary entry.

In this implementation, since the entry information is obtained according to the dictionary lookup, the recognition rate of the entry information of the sentence can be updated by updating the dictionary. And because the dictionary is updated conveniently and quickly, the recognition rate of the entry information of the sentence can be improved conveniently and quickly, so that the accuracy of natural language understanding of the sentence can be improved conveniently and quickly, the accuracy of a human-computer interaction system can be improved, and the experience of a user is improved finally.

In a third aspect, the present application provides a model training method. The method comprises the following steps: acquiring second training data, wherein the second training data comprises training sentences, entry information and indicating information, the entry information is used for indicating entries contained in the training sentences, and the indicating information is used for indicating whether the entries indicated by the entry information meet the intention of the training sentences and the semantics indicated by the key information of the intention; and training an entry disambiguation model according to the second training data, wherein the entry disambiguation model is used for judging whether the entry represented by the entry information to be processed conforms to the semantics of the sentence to be processed or not based on the sentence to be processed and the entry information to be processed.

In some implementations, the second training data is second training data obtained using the second aspect or any one of the implementations.

In some implementations, the training results in the term disambiguation model of the first aspect or any one of its implementations.

Because the training data obtained in the second aspect or any one of the two ways is used, the training efficiency of the vocabulary entry disambiguation model can be improved, so that the performance of the human-computer interaction system can be effectively improved, and the user experience is improved.

In a fourth aspect, the present application provides a model training method. The method comprises the following steps: acquiring third training data, wherein the third training data comprises the training sentences, the intentions, the key information, the entry information and the indication information; and training a natural language understanding model according to the third training data, wherein the natural language understanding model is used for acquiring the intention of the sentence to be understood input by the user and key information of the intention based on the sentence to be understood, the first auxiliary information and the second auxiliary information, the first auxiliary information is used for representing entries contained in the sentence to be understood, and the second auxiliary information is used for indicating whether the entries represented by the first auxiliary information conform to the semantics of the sentence to be understood.

In some implementations, the third training data is third training data obtained using the second aspect or any one of the implementations.

In some implementations, the training results in a natural language understanding model of the first aspect or any one of its implementations.

Because the training data obtained in the second aspect or any one of the two ways is used, the training efficiency of the natural language understanding model can be improved, so that the performance of the human-computer interaction system can be effectively improved, and the user experience is improved.

In a fifth aspect, there is provided an apparatus for implementing natural language understanding in a human-computer interaction system, the apparatus comprising means for performing the method of the first aspect or any implementation manner thereof.

In a sixth aspect, there is provided a model training apparatus comprising means for performing the method of the second aspect or any one of its implementations.

In a seventh aspect, a model training apparatus is provided, which includes means for performing the method of the third aspect or any one of the implementations.

In an eighth aspect, there is provided a model training apparatus comprising means for performing the method of the fourth aspect or any one of its implementations.

In a ninth aspect, there is provided an apparatus for implementing natural language understanding in a human-computer interaction system, the apparatus comprising: a memory for storing a program; a processor for executing the program stored in the memory, the processor being configured to perform the method of the first aspect or any one of the implementations when the program stored in the memory is executed.

In a tenth aspect, there is provided an apparatus for acquiring training data, the apparatus comprising: a memory for storing a program; a processor for executing the memory-stored program, the processor being configured to perform the method of the second aspect or any one of the implementations when the memory-stored program is executed.

In an eleventh aspect, there is provided an apparatus for training a model, the apparatus comprising: a memory for storing a program; a processor for executing the program stored in the memory, the processor being configured to perform the method of the third aspect or any one of the implementations when the program stored in the memory is executed.

In a twelfth aspect, there is provided an apparatus for training a model, the apparatus comprising: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to perform the method of the fourth aspect or any one of the implementations.

In a thirteenth aspect, a computer readable medium is provided, which stores program code for execution by a device, the program code being for performing the method of the first aspect, the second aspect, the third aspect, the fourth aspect, or any implementation thereof.

In a fourteenth aspect, a computer program product containing instructions is provided, which when run on a computer causes the computer to perform the method of the first, second, third, fourth aspect or any implementation thereof.

In a fifteenth aspect, a chip is provided, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to perform the method in the first aspect, the second aspect, the third aspect, the fourth aspect, or any implementation manner thereof.

Optionally, as an implementation manner, the chip may further include a memory, where instructions are stored in the memory, and the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the method in the first aspect, the second aspect, the third aspect, or the fourth aspect, or any one of the implementation manners.

In a sixteenth aspect, there is provided a computing device comprising: a memory for storing a program; a processor configured to execute the memory-stored program, and when the memory-stored program is executed, the processor is configured to perform the method of the first aspect, the second aspect, the third aspect, the fourth aspect, or any implementation manner thereof.

In a seventeenth aspect, a method for implementing natural language understanding in a human-computer interaction system is provided, the method comprising: acquiring a plurality of target entries matched with target sentences in one or more dictionaries, wherein the target sentences are sentences input to a human-computer interaction system by a user; acquiring a plurality of sequences of a target statement according to the target entries, wherein each sequence in the sequences corresponds to one target entry; acquiring a plurality of first sequence representations corresponding to the plurality of sequences according to the plurality of sequences and the target sentence; determining whether each target entry of the plurality of target entries conforms to the semantics of the target sentence according to the plurality of first sequence representations; and performing natural language understanding processing on the target sentence according to one or more first sequence representations corresponding to the target entries conforming to the semantics of the target sentence to obtain a processing result.

According to the method for realizing natural language understanding in the man-machine interaction system, dictionary knowledge is embedded, and natural language understanding processing can be better realized. The natural language is matched with the dictionary and then converted into a sequence, and then the sequence representation of the sequence is obtained according to the sequence. Therefore, after the dictionary is updated or expanded, the neural network model does not need to be updated, and the generalization capability of the model is improved.

With reference to the seventeenth aspect, in some implementations of the seventeenth aspect, each of the plurality of sequences includes type information of one target entry and position information of the target entry in the target sentence.

The type information is used for indicating the type of the matching entity, and the position information is used for indicating the position of the matching entity in the sequence, so that the subsequent natural language understanding processing is facilitated.

With reference to the seventeenth aspect, in some implementations of the seventeenth aspect, obtaining a plurality of first sequence representations corresponding to a plurality of sequences from the plurality of sequences and the target sentence includes: obtaining a low-dimensional representation of each of a plurality of sequences; obtaining a context representation of a target statement; the low-dimensional representation of each sequence in the plurality of sequences and the context representation of the target sentence are fused to obtain a plurality of first sequence representations corresponding to the plurality of sequences.

With reference to the seventeenth aspect, in some implementations of the seventeenth aspect, determining whether each target entry of the plurality of target entries conforms to semantics of the target sentence according to the plurality of first sequence representations includes: acquiring attention of a second sequence representation to other first sequence representations except the second sequence representation in the plurality of first sequence representations, wherein the second sequence representation is one of the plurality of first sequence representations; and determining whether the target entry corresponding to the second sequence representation conforms to the semantics of the target sentence or not according to the second sequence representation and the attention.

Since the grammatical structure of the target term input by the user is limited, most of the obtained target terms are matching results that do not conform to the semantics of the target term, and most of the obtained first-sequence representations are also results that do not conform to the semantics of the target term, so it is necessary to perform disambiguation processing on the first-sequence representations. When disambiguation processing is carried out, not only the relation between the matching entity and the context thereof in one sequence is considered, but also the relation between different sequences is considered, so that the disambiguation result can be more accurate.

With reference to the seventeenth aspect, in some implementations of the seventeenth aspect, performing natural language understanding processing on the target sentence according to one or more first sequence representations corresponding to target terms that conform to semantics of the target sentence includes: obtaining a context representation of a target statement; fusing a first sequence representation corresponding to each target entry conforming to the semantics of the target sentence in a first sequence representation corresponding to the target entry conforming to the semantics of the target sentence and a context representation of the target sentence to obtain one or more third sequence representations corresponding to the target entries conforming to the semantics of the target sentence; performing natural language understanding processing on the target sentence according to the one or more third sequence representations.

In an eighteenth aspect, a method for training a neural network model, the neural network model including a first sub-network model, a second sub-network model, and a third sub-network model, the method comprising: acquiring first training data, wherein the first training data comprises training sentences and a plurality of first sequence representations matched with a target dictionary; training the first sub-network model according to the first training data to obtain a trained first sub-network model; acquiring second training data, wherein the second training data comprises an output result of the trained first sub-network model and a first sequence representation meeting preset requirements in the plurality of first sequence representations; training the second sub-network model according to the second training data to obtain a trained second sub-network model; acquiring third training data, wherein the third training data comprises an output result of the trained second sub-network model and a processing result of natural language understanding processing of the training sentences; and training the third sub-network model according to the third training data to obtain the trained third sub-network model.

The training of the neural network model in the embodiment of the application is an end-to-end training method, and is simple in process and high in training speed.

With reference to the eighteenth aspect, in some implementations of the eighteenth aspect, the first sub-network model is an entry matching model, and is configured to obtain, according to a target sentence and a target dictionary input by a user, a plurality of sequence representations matched with the target dictionary, where each sequence representation in the plurality of sequence representations corresponds to one target entry, and the target entry is an entry matched with the target sentence and the target dictionary; the second sub-network model is a term disambiguation model and is used for determining whether a target term corresponding to each sequence representation in the sequence representations conforms to the semantics of the target sentence according to the target sentence and the sequence representations; and the third sub-network model is a natural language understanding model and is used for performing natural language understanding processing on the target statement according to the sequence representation corresponding to the target entry conforming to the target statement semantics.

With reference to the eighteenth aspect, in some implementations of the eighteenth aspect, each of the plurality of first sequence representations corresponds to one first target entry, the first target entry is an entry matched with the target dictionary for the training sentence, and the first sequence representation meeting the preset requirement is a sequence representation corresponding to the first target entry conforming to the semantic meaning of the training sentence.

In a nineteenth aspect, an apparatus for implementing natural language understanding in a human-computer interaction system is provided, including: the acquisition module is used for acquiring a plurality of target entries matched with target sentences in one or more dictionaries, wherein the target sentences are sentences input to the human-computer interaction system by a user; the processing module is used for acquiring a plurality of sequences of the target statement according to the target entries, wherein each sequence in the sequences corresponds to one target entry; the processing module is further used for acquiring a plurality of first sequence representations corresponding to the sequences according to the sequences and the target statement; the processing module is further used for determining whether each target entry in the plurality of target entries conforms to the target sentence semantics according to the plurality of first sequence representations; and the processing module is also used for performing natural language understanding processing on the target statement according to the first sequence representation corresponding to one or more target entries conforming to the semantics of the target statement so as to obtain a processing result.

With reference to the nineteenth aspect, in some implementations of the nineteenth aspect, each sequence of the plurality of sequences contains type information of one target entry and position information of the target entry in the target sentence.

With reference to the nineteenth aspect, in some implementations of the nineteenth aspect, the acquiring, by the processing module, a plurality of first sequence representations corresponding to the plurality of sequences according to the plurality of sequences and the target sentence includes: obtaining a low-dimensional representation of each of a plurality of sequences; obtaining a context representation of a target statement; the low-dimensional representation of each sequence in the plurality of sequences and the context representation of the target sentence are fused to obtain a plurality of first sequence representations corresponding to the plurality of sequences.

With reference to the nineteenth aspect, in some implementations of the nineteenth aspect, the determining, by the processing module, whether each target entry of the plurality of target entries conforms to the target sentence semantic according to the plurality of first sequence representations includes: acquiring attention of a second sequence representation to other first sequence representations except the second sequence representation in the plurality of first sequence representations, wherein the second sequence representation is one of the plurality of first sequence representations; and determining whether the target entry corresponding to the second sequence representation conforms to the target sentence semantics according to the second sequence representation and the attention.

With reference to the nineteenth aspect, in some implementations of the nineteenth aspect, the performing, by the processing module, natural language understanding processing on the target sentence according to the first sequence representation corresponding to the one or more target terms conforming to the target sentence semantics includes: obtaining a context representation of a target statement; fusing a first sequence representation corresponding to each target entry conforming to the target sentence semantics in a first sequence representation corresponding to one or more target entries conforming to the target sentence semantics with the context representation of the target sentence to obtain one or more third sequence representations corresponding to the one or more target entries conforming to the target sentence semantics; and performing natural language understanding processing on the target sentence according to one or more third sequence representations.

In a twentieth aspect, there is provided a training apparatus of a neural network model including a first sub-network model, a second sub-network model, and a third sub-network model, the apparatus including: the acquisition module is used for acquiring first training data, wherein the first training data comprises training sentences and a plurality of first sequence representations matched with the training sentences and the target dictionary; the training module is used for training the first sub-network model according to the first training data to obtain a trained first sub-network model; the obtaining module is further configured to obtain second training data, where the second training data includes an output result of the trained first sub-network model and a first sequence representation that meets a preset requirement in the plurality of first sequence representations; the training module is further used for training the second sub-network model according to the second training data to obtain a trained second sub-network model; the acquisition module is further used for acquiring third training data, wherein the third training data comprise an output result of the trained second sub-network model and a processing result of natural language understanding processing of the training sentences; the training module is further configured to train the third sub-network model according to the third training data to obtain a trained third sub-network model.

With reference to the twentieth aspect, in some implementations of the twentieth aspect, the first sub-network model is an entry matching model, and is configured to obtain, according to a target sentence and a target dictionary input by a user, a plurality of sequence representations matching the target dictionary, where each sequence representation in the plurality of sequence representations corresponds to a target entry, and the target entry is an entry matching the target sentence and the target dictionary; the second sub-network model is a term disambiguation model and is used for determining whether a target term corresponding to each sequence representation in the sequence representations conforms to the semantics of the target sentence according to the target sentence and the sequence representations; and the third sub-network model is a natural language understanding model and is used for performing natural language understanding processing on the target statement according to the sequence representation corresponding to the target entry conforming to the target statement semantics.

With reference to the twentieth aspect, in some implementations of the twentieth aspect, each of the plurality of first sequence representations corresponds to a first target entry, the first target entry is an entry matched with the target dictionary for the training sentence, and the first sequence representation meeting the preset requirement is a sequence representation corresponding to the first target entry conforming to the semantic meaning of the training sentence.

In a twenty-first aspect, an apparatus for implementing natural language understanding in a human-computer interaction system is provided, which includes: a processor coupled with the memory; the memory is used for storing instructions; the processor is configured to execute the instructions stored in the memory to cause the apparatus to implement the method of any one of the implementations of the seventeenth aspect described above.

In a twenty-second aspect, there is provided a model training apparatus, comprising: a processor coupled with the memory; the memory is used for storing instructions; the processor is configured to execute instructions stored in the memory to cause the apparatus to implement the method of the eighteenth aspect.

A twenty-third aspect provides a computer readable medium storing program code for execution by a device, the program code for performing the method of any one of the implementations of the seventeenth and eighteenth aspects.

A twenty-fourth aspect provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the seventeenth and eighteenth aspects described above.

In a twenty-fifth aspect, a chip is provided, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to perform the methods in the seventeenth and eighteenth aspects.

Optionally, as an implementation manner, the chip may further include a memory, the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, and when the instructions are executed, the processor is configured to execute the methods in the seventeenth aspect and the eighteenth aspect.

In a twenty-sixth aspect, there is provided a computing device comprising: a memory for storing a program; a processor for executing the program stored in the memory, the processor being configured to perform the methods of the seventeenth and eighteenth aspects described above when the program stored in the memory is executed.

Drawings

FIG. 1 is a schematic flow chart diagram of a method of acquiring training data according to one embodiment of the present application;

FIG. 2 is a schematic flow chart diagram of a method of training a model according to one embodiment of the present application;

FIG. 3 is a schematic structural diagram of an entry disambiguation model according to an embodiment of the present application;

FIG. 4 is a schematic flow chart diagram of a method of training a model according to another embodiment of the present application;

FIG. 5 is a schematic flow chart diagram of a method of implementing natural language understanding of one embodiment of the present application;

FIG. 6 is a schematic flow chart diagram of a method of implementing natural language understanding of another embodiment of the present application;

FIG. 7 is a schematic diagram of a system architecture of one embodiment of the present application;

FIG. 8 is a schematic deployment diagram of an apparatus for training a model according to an embodiment of the present application;

FIG. 9 is an exemplary block diagram of a computing device according to one embodiment of the present application;

FIG. 10 is a schematic diagram of a system architecture of another embodiment of the present application;

FIG. 11 is a dictionary deployment diagram for one embodiment of the present application;

FIG. 12 is a schematic block diagram of an apparatus for acquiring training data according to an embodiment of the present application;

FIG. 13 is a schematic block diagram of an apparatus for training a model according to an embodiment of the present application;

FIG. 14 is a schematic block diagram of an apparatus for implementing natural language understanding according to an embodiment of the present application;

FIG. 15 is a schematic block diagram of an apparatus according to an embodiment of the present application;

FIG. 16 is a schematic block diagram of a computer program product of one embodiment of the present application;

FIG. 17 is a system architecture diagram of the human-computer interaction system according to the present application, wherein the system architecture diagram is involved in implementing a natural language understanding method;

FIG. 18 is a block diagram of a human-computer interaction system according to the present application, in which a natural language understanding method is implemented;

FIG. 19 is a schematic flow chart diagram of a method for implementing natural language understanding in the human-computer interaction system of the present application;

FIG. 20 is a schematic flow chart diagram of a method of training a neural network model of the present application;

FIG. 21 is a schematic diagram of the fusion of a low-dimensional representation with a contextual representation of the present application;

FIG. 22 is a schematic illustration of sequence disambiguation of the present application;

FIG. 23 is a schematic illustration of natural language processing of disambiguation results of the present application;

FIG. 24 is a schematic block diagram of an apparatus for implementing natural language understanding in a human-computer interaction system of the present application;

FIG. 25 is a schematic block diagram of a training apparatus of the neural network model of the present application;

fig. 26 is a hardware configuration diagram of an apparatus of the present application.

Detailed Description

To facilitate understanding of the embodiments of the present application, some terms or concepts used in the embodiments of the present application will be described below.

(1) Entry disambiguation model

The input of the entry disambiguation model comprises a sentence and entry information, the entry information is used for representing entries contained in the sentence, and the entry disambiguation model is used for judging whether the entries represented by the entry information accord with the semantics of the sentence or not.

The term information may indicate one or more terms contained in the sentence, and each term may be a reasonable term or an unreasonable term in the sentence. The term "reasonable" as used herein refers to a term that conforms to the semantics of the sentence, and conversely, an unreasonable term.

For example, the statement "I want to hear tomorrow and rain. When the term is used, the term in the sentence can comprise "tomorrow" and "tomorrow raining", but only "tomorrow raining" is a term conforming to semantics, namely a reasonable term, and "tomorrow" is not conforming to semantics, namely an unreasonable term.

The entry information may be the entry itself or the position of the entry in the sentence. For example, the statement "I want to hear tomorrow and rain. "and the position index of the word in the sentence counts from 0, the entry information may be" 3-8 ", that is, the content from index 3 to index 8 in the sentence is the entry.

(2) Natural language understanding model

The input of the natural language understanding model comprises a sentence, first auxiliary information and second auxiliary information, wherein the first auxiliary information is used for representing entries contained in the sentence, the second auxiliary information is used for indicating whether the entries represented by the first auxiliary information conform to the semantics of the sentence, and the natural language understanding model is used for acquiring the intention of the sentence input by a user and key information of the intention based on the sentence, the first auxiliary information and the second auxiliary information.

The first auxiliary information may be understood as the above-mentioned term information, and the second auxiliary information may be understood as information output by the above-mentioned term disambiguation model based on the sentence and the term information.

Here, the intention refers to the purpose that the user wants to express through the input sentence. For example, the user enters the statement "I want to listen to tomorrow with rain. "the user's intention is to listen to the song. In the embodiment of the present application, the intention of the user to input a sentence is also referred to as an intention of the sentence.

Wherein, the key information can also be understood as the slot position corresponding to the intention. For example, when the user intends to listen to a song, the name of the song is the slot.

The entries and slots in the embodiments of the present application are substantially different. The slot position is determined according to the corresponding intent. For example, the term "beijing" may be a slot "destination in some airline reservation intents: beijing ", may serve as a slot" starting point "in other airline ticket booking intentions: beijing ″.

(3) Dictionary for storing dictionary data

Refers to a collection of words, in the embodiments of the present application, a collection of words having common attributes collected or sorted for a particular use, such as a place name dictionary, a person name dictionary, a song dictionary, and the like. In some cases, it may also be extended to any set of word components, such as a word segmentation dictionary.

In the embodiment of the present application, the dictionary may also be referred to as a dictionary or a dictionary. The dictionary of the embodiment of the application contains one or more entries.

The term in the present application is also called a word, an entry, or a term, and may be a word, or a combination of a word and a word, or a word and a word.

Terms in the present application may include concrete things, well-known characters, abstract concepts, literature works, hot events, combinations of chinese words or specific topics, and the like. Wherein a specific transaction may also be understood as an entity.

(4) End-side device

The end-side device in the embodiment of the present application may be a mobile phone with computing capability, a Tablet Personal Computer (TPC), a media player, a smart home, a notebook computer (LC), a Personal Digital Assistant (PDA), a Personal Computer (PC), a camera, a camcorder, a smart watch, a Wearable Device (WD), a vehicle, or the like. It can be understood that the specific form of the end-side device is not limited in the embodiments of the present application.

(5) Sequence labeling

One of the common tasks of natural language processing is to give a sequence, and to mark or label each element in the sequence, including word segmentation task and entity recognition task.

(6) Sequence modeling

The sequence is represented as a representation of one or more low-dimensional dense vectors.

(7) Groove position

The key information corresponding to the intention in the statement of the user, for example, the intention is to order an airline ticket, and the corresponding slot position may be the takeoff time, the landing time, and the like.

(8) Language model

The probability distribution of the sentences is modeled, and whether the sentences are natural language sequences or not can be judged.

(9) Pre-trained language model

The method can be directly used for fine adjustment of downstream tasks, and the training time of the tasks is greatly shortened.

(10) Word representation (embedding)

Word representation in natural language processing refers to the low-dimensional vector representation of words or words by which the relationship between words can be calculated, the property of the low-dimensional vector being such that words corresponding to vectors of close distance have close meaning.

(11) Self-attentive mechanism (self-attention)

The attention mechanism mimics the internal process of biological observation behavior, a mechanism that aligns internal experience with external perception to increase the observation granularity of a partial region. The attention mechanism can quickly extract important features of sparse data, and therefore, the attention mechanism is widely applied to natural language processing tasks, particularly machine translation. The two-power mechanism is an improvement on the power mechanism, which reduces reliance on external information and is better at capturing internal correlations of data or features.

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic flow chart of a method of acquiring training data according to an embodiment of the present application. As shown in fig. 1, the method may include S110 to S150.

S110, obtaining first training data, wherein the first training data comprises training sentences, intentions of the training sentences and key information of the intentions.

Wherein the intention and the key information may be label data of the training sentences, i.e. data that plays a supervising role in the training process. The intent and key information may be manually labeled according to the semantics of the training sentence.

For example, for a training sentence "buy a ticket to beijing", the intention of the training sentence "buy the ticket" can be manually labeled, and the key information (or slot) manually labeling the intention of the training sentence is "destination: beijing ″.

And S120, obtaining entry information, wherein the entry information is used for representing entries contained in the training sentence.

For example, the training sentence is "I want to hear tomorrow and rain. When the term information is used, one term information may contain 1 to 2, another term information may contain 6 to 8, and another term information may contain 3 to 8.

In some examples, the entry information of the training sentence may be obtained through a dictionary query. In this example, it may include: and acquiring a dictionary, and acquiring entry information of the training sentence according to the dictionary and the training sentence. Generally, the entry information may include entries commonly included in dictionaries and training sentences.

For example, a dictionary matching may be performed on a training sentence input by a user through voice or text by a fast query method to obtain a matching vector, the matching vector may include a plurality of matching results, the longest matching processing may be performed on matching results overlapped with each other, and finally, matching results that are not included with each other are obtained. An example of a fast query method is the word-lookup tree method (TRIE).

It is to be understood that obtaining the entry information of the training sentence through the dictionary query is only an example, and the embodiment does not limit the implementation manner of obtaining the entry information of the training sentence. For example, entry information in a training sentence may be acquired using an entry recognition model obtained by machine learning.

The vocabulary entry information of the training sentence is obtained through the dictionary inquiring mode, compared with the vocabulary entry information of the training sentence obtained through a preset rule or a trained model, under the scene that the recognizable vocabulary entry needs to be updated, the vocabulary entry in the dictionary can be updated conveniently and quickly, and other modes need to rewrite the rule or perform the training of the model again, therefore, the vocabulary entry information of the training sentence is obtained through the dictionary inquiring mode, the accuracy rate of the vocabulary entry information can be improved more efficiently and conveniently, the accuracy rate of the training data can be improved, the performance of the trained model is improved, finally, the performance of a human-computer interaction system applying the model can be improved, and the user experience is improved.

And S130, acquiring indication information according to the first training data and the entry information, wherein the indication information is used for indicating whether the entry represented by the entry information conforms to the intention and the semantics represented by the key information.

In an example, if the entry indicated by the entry information is the same as the entry corresponding to the key information in the first training data, the entry represented by the entry information may be considered to be in accordance with the intention and the semantics represented by the key information, otherwise, the entry may be considered to be not in accordance with the intention and the semantics represented by the key information.

For example, the training sentence included in the first training data is "i want to hear tomorrow and rain. "the intention is" listen to song ", and the key information is" song name: and raining in the tomorrow weather, when the entry represented by the entry information is the song name 'raining in the tomorrow weather', the corresponding indication information indicates that the entry indicated by the entry information conforms to the intention and the semantics represented by the key information, or conforms to the semantics of the training sentence.

As another example, the training sentence included in the first training data is "i want to hear tomorrow and rain. "the intention is" listen to song ", and the key information is" song name: and raining in the tomorrow, if the entry represented by the entry information is the movie name "raining in the tomorrow", the corresponding indication information indicates that the entry indicated by the entry information does not accord with the intention and the semantics represented by the key information, or does not accord with the semantics of the training sentence.

S140, second training data is obtained according to the training sentences, the entry information and the indication information, the second training data comprises the training sentences, the entry information and the indication information, and the second training data is used for training an entry disambiguation model.

One implementation manner of obtaining the second training data according to the training sentence, the entry information, and the indication information includes: and combining the training sentences, the entry information and the indication information into second training data, wherein the indication information is used as the labeling data of the training sentences and the entry information.

In this embodiment, a second training datum may be referred to as a corpus.

For example, the training sentence is "I want to hear tomorrow and rain. "when, through dictionary query, the query result as shown in table 1 can be obtained.

TABLE 1 dictionary lookup results

Training sentences	Query results
		I want to listen to tomorrow and rain.	And (4) song: 1-2
I want to listen to tomorrow and rain.	And (4) song: 6-8
		I want to listen to tomorrow and rain.	And (4) song: 3-8
I want to listen to tomorrow and rain.	Movie: 3-8

According to the first training data: i want to listen to rainy weather, listen to a song, and rainy weather, and the contents in table 1 above can obtain the second training data. An example of the content included in the second training data is shown in table 2. In table 2, "0" indicates a semantic meaning not conforming to the training sentence, and "1" indicates a semantic meaning conforming to the training sentence.

TABLE 2 training corpora

Training sentences	Query results	Indicating information
			I want to listen to tomorrow and rain.	And (4) song: 1-2	0
I want to listen to tomorrow and rain.	And (4) song: 6-8	0
			I want to listen to tomorrow and rain.	And (4) song: 3-8	1
I want to listen to tomorrow and rain.	Movie: 3-8	0

S150, third training data is obtained according to the first training data, the entry information and the indication information, the third training data comprises the first training data, the entry information and the indication information, and the third training data is used for training a natural language understanding model.

One implementation manner of obtaining third training data according to the first training data, the entry information, and the indication information includes: and combining the first training data, the entry information and the indication information into third training data, wherein intention and key information are used as label data of the first training data, the entry information and the indication information.

In some examples, the method shown in fig. 1 may be performed by a cloud-side device, but may of course also be performed by an end-side device. The method shown in fig. 1 may be performed by a cloud-side device, and compared with the method performed by a peer-side device, the method is more efficient to perform, i.e., the training data may be acquired more efficiently, because the cloud-side device has more abundant storage resources and computing resources.

FIG. 2 is a schematic flow chart diagram of a method of training a model according to one embodiment of the present application. The method may include S210 and S220.

S210, obtaining second training data, where the second training data is obtained by using the method shown in fig. 1.

One way to obtain the second training data is to receive the second training data obtained by the device using the method shown in fig. 1 from the other device.

Another way to obtain the second training data is for the training device to obtain it using the method shown in fig. 1.

S220, training a preset first model according to the second training data to obtain an entry disambiguation model.

The first model may include an EMLo model, a multilayer perceptron (MLP), a Conditional Random Field (CRF) model, or a bert (bidirectional encoder representation from transformations) model. The first model may be a classification model, for example, a binary classification model, and of course, the first model may also be a multi-classification model, which is not limited in this embodiment. An example of a BERT model is TinyBERT.

An exemplary structure of the term disambiguation model is shown in FIG. 3. The term disambiguation model shown in fig. 3 is formed by any one of the ELMo model, the BERT model or the CRF model and the MLP, where the sentence and term information is input to the ELMo model, the BERT model or the CRF model, the output of the ELMo model, the BERT model or the CRF model is input to the MLP, the MLP outputs 0 or 1, 1 indicates that the term indicated by the term information conforms to the semantics of the sentence, and 0 indicates that the term indicated by the term information does not conform to the semantics of the sentence.

In some examples, the method shown in fig. 2 may be performed by a cloud-side device, but may of course also be performed by an end-side device. The method shown in fig. 2 may be performed by a cloud-side device, and compared with the method performed by an end-side device, the method is more efficient to perform because the cloud-side device has more abundant storage resources and computing resources, that is, the training entry disambiguation model can be obtained more efficiently.

When the entry disambiguation model is trained according to the second training data, the indication information in the second training data can be used as the tagging data to supervise the training, and the specific implementation manner can refer to the implementation manner of supervised training of the neural network model in the prior art.

FIG. 4 is a schematic flow chart diagram of a method of training a model according to another embodiment of the present application. The method may include S410 and S420.

S410, third training data is obtained, where the third training data is obtained by using the method shown in fig. 1.

One way to obtain the third training data is to receive the third training data obtained by the device using the method shown in fig. 1 from the other device.

Another way to obtain the third training data is for the training device to obtain it using the method shown in fig. 1.

And S420, training a preset second model according to the third training data to obtain a natural language understanding model.

In some examples of the natural language understanding model, the MLP may be constructed by any one of a BERT model or a long-short term memory network (LSTM).

In some examples, the method shown in fig. 4 may be performed by a cloud-side device, but may of course also be performed by an end-side device. The method shown in fig. 4 may be performed by a cloud-side device, and compared with the method performed by a peer-side device, the method is more efficient to perform because the cloud-side device has more abundant storage resources and computing resources, that is, the training natural language understanding model can be obtained more efficiently.

When the natural language understanding model is trained according to the third training data, the intention and the key information in the third training data can be used as the labeling data to supervise the training, and the specific implementation mode can refer to the implementation mode of supervised training of the neural network model in the prior art.

FIG. 5 is a schematic flow chart diagram of a method of implementing natural language understanding of one embodiment of the present application. The method may include S510 to S550.

S510, obtaining target entry information, wherein the target entry information is used for representing entries contained in a target sentence, and the target sentence is a sentence input to a man-machine interaction system by a user.

The user can input the target sentence to the man-machine interaction system through voice or words. The man-machine interaction system can be a man-machine interaction system on any intelligent device, for example, a man-machine interaction system on an intelligent device such as a smart phone, an intelligent vehicle and an intelligent sound box.

S530, acquiring target indication information based on the target sentence and the target entry information by using an entry disambiguation model, wherein the target indication information is used for indicating whether the entry indicated by the target entry information conforms to the semantics of the target sentence.

The term disambiguation model used in this embodiment may be trained by the method in the previous embodiment, for example, the method shown in fig. 2. For example, one implementation is: the cloud side equipment is trained by using the method shown in FIG. 2 to obtain an entry disambiguation model.

The other realization mode is as follows: the entry disambiguation model is received from the training device and used, which is trained using the method shown in fig. 2 to obtain the entry disambiguation model. For example, the cloud-side device is trained using the method shown in fig. 2 to obtain an entry disambiguation model, and the end-side device receives the entry disambiguation model from the cloud-side device and uses the entry disambiguation model.

The input of the entry disambiguation model comprises a to-be-processed sentence and to-be-processed entry information, the to-be-processed entry information is used for representing an entry contained in the to-be-processed sentence, and the entry disambiguation model is used for judging whether the entry represented by the to-be-processed information conforms to the semantic meaning of the to-be-processed sentence. For example, the target sentence and the target entry information are input into the entry disambiguation model, and target indication information output by the entry disambiguation model based on the target sentence and the target entry information is acquired.

In some examples, the target indication information may include a term; in other examples, the target indication information may include a position index of a first word of the term in the target sentence and a position index of a last word of the term in the target sentence. In still other examples, the target indication information may include not only a location index of the entry, but also a type of the entry, such as whether the entry is of a song name type or a movie name type, or of a place name type.

And S550, acquiring an understanding result based on the target sentence, the target entry information and the target indication information by using a natural language understanding model, wherein the understanding result comprises the target intention of the target sentence input by the user and key information of the target intention.

The natural language understanding model used in the present embodiment may be obtained by training the method in the foregoing embodiment, for example, using the method shown in fig. 4. For example, one implementation is: the cloud-side device is trained by using the method shown in fig. 4 to obtain a natural language understanding model and used.

The other realization mode is as follows: a natural language understanding model is received from a training device and used. The training apparatus obtains a natural language understanding model by training using the method shown in fig. 4. For example, the cloud-side device is trained using the method shown in fig. 4 to obtain a natural language understanding model, and the end-side device receives the natural language understanding model from the cloud-side device and uses the natural language understanding model.

The input of the natural language understanding model comprises a sentence to be understood, first auxiliary information and second auxiliary information, wherein the first auxiliary information is used for representing an entry contained in the sentence to be understood, the second auxiliary information is used for indicating whether the entry represented by the first auxiliary information conforms to the semantic meaning of the sentence to be understood, and the natural language understanding model is used for acquiring the intention of the sentence to be understood and the key information of the intention input by a user based on the sentence to be understood, the first auxiliary information and the second auxiliary information.

For example, the target sentence, the target entry information, and the target instruction information are input to the natural language understanding model, and an understanding result output by the natural language understanding model based on the target sentence, the target entry information, and the target instruction information is acquired.

In the implementation, the relation between the input and the output of the entry disambiguation model and the natural language understanding model ensures that the training data of the entry disambiguation model does not need additional data labeling, the cost can be saved, the training efficiency is improved, and the realization of the natural language understanding can be accelerated.

In this embodiment, as shown in fig. 6, S510 may include: s504, a target dictionary is obtained, and the target dictionary comprises at least one entry; and S508, querying the vocabulary entry contained in the target sentence from the target dictionary to obtain the target vocabulary entry information.

The dictionary in the present embodiment may include a music dictionary, a movie dictionary, an application dictionary, a place name dictionary, a person name dictionary, or a user-defined dictionary provided by a third-party service provider. One example of a user-defined dictionary includes a user defining the name of a smart speaker as "ironmen".

Compared with the existing method for acquiring the entry information, the method has the advantages of strong generalization and simplicity in maintenance.

In some implementations, the pre-set target dictionary may be loaded by the end-side device from the end-side device, or the target dictionary may be dynamically loaded by the end-side device from the end-side device. For example, a cell phone assistant on a smartphone may dynamically load a dictionary such as a contact address in a phone application, a singer dictionary in a music playback application, a song title dictionary, an album dictionary, or an actor dictionary, movie title dictionary in a video playback application.

In other implementations, the target dictionary may be obtained by the end-side device from the cloud-side device. For example, the end-side device transmits request information to the cloud-side device, the cloud-side device transmits a dictionary based on the request information, and the end-side device takes the received dictionary as a target dictionary.

In everyday end-side device applications, it is often desirable to deploy dictionaries on the end-side devices to reduce the latency and cost consumed by end-cloud device interactions. However, daily dictionaries such as music dictionaries, place name dictionaries, and the like are usually very large, storage space of gigabytes (G) may be required, and storage and computing resources of the end-side device are relatively limited, so that some dictionaries may be deployed on the end-side device, and other dictionaries may be deployed on the cloud-side device, that is, when target entry information is queried according to the dictionaries, end-cloud collaborative cooperation is required.

For example, as shown in fig. 11, a full dictionary may be deployed on the cloud side, while a hotword dictionary, a common dictionary, a personal dictionary, and the like may be deployed on the end-side device. In this way, the end-side device may send the target sentence to the cloud-side device in case that the dictionary needs to be queried; inquiring by the cloud side equipment to obtain an entry matching vector 1 corresponding to the target statement; meanwhile, the end-side equipment can also obtain an entry matching vector 2 corresponding to the target sentence based on dictionary query on the end-side equipment; after the end-side device receives the entry matching vector 1 from the cloud-side device, the entry disambiguation model is used for disambiguating the entry matching vector and an entry matching vector 2 obtained by the end-side device through self query, and the natural language understanding model is used for understanding the natural language according to the disambiguation result.

The result of the vocabulary entry matching vectors obtained by the cloud side device and the end side device through dictionary query can be multiple, some of the results are reasonable, and some of the results are only character matching and are unreasonable from the semantic level. Through disambiguation of the entry disambiguation model, unreasonable matching results equivalent to noise can be selected to help the natural language understanding model obtain more accurate understanding results.

Wherein, the full dictionary is a dictionary containing all entries; the hot word dictionary and the common dictionary refer to dictionaries with high use frequency; the personality dictionary indicates a dictionary specific to each end-side device, which may contain user-defined entries.

According to the method for collaborative cooperation of the cloud terminal, the abundant dictionary knowledge of the cloud terminal can be fully utilized under the condition that the vocabulary entry disambiguation model and the natural language understanding model are deployed on the terminal side equipment, the accuracy of natural language understanding is improved, and the operation cost and the deployment cost of the terminal side equipment can be saved; moreover, partial dictionaries such as a personal dictionary and an address book can be stored on the end-side equipment, so that the privacy of the user can be protected; the vocabulary entry disambiguation model and the natural language understanding model are deployed on the end-side equipment, so that the problems that time delay is caused by network transmission and natural language understanding cannot be carried out without a network can be solved.

In one example, an intent of a target sentence may be recognized by an end-side device using a lightweight intent recognition model, and whether a target dictionary corresponding to the intent is deployed on a cloud-side device or the end-side device may be determined according to a preset correspondence between the intent and the device. And if the target dictionary is deployed on the end-side equipment, inquiring a dictionary on the end-side equipment, and otherwise, inquiring a dictionary on the cloud-side equipment. The realization mode can more flexibly control the communication frequency with the cloud side equipment, and the user experience is improved.

FIG. 7 is a schematic diagram of a system architecture of one embodiment of the present application. As shown in FIG. 7, the system architecture 700 includes an execution device 710, a training device 720, a database 730, a client device 740, a data storage system 750, and a data collection system 760.

The data collection device 760 is used to collect training data. After the training data is collected, the data collection device 760 stores the training data in the database 730.

For example, the data collection apparatus 760 may read a preset first training data and a preset dictionary; executing the method shown in fig. 1, and obtaining second training data and third training data based on the first training data; the second training data and the third training data are then stored in database 730.

In some application scenarios, the training device 720 may train the designated neural network model using the second training data maintained in the database 730 to obtain the target model 701. For example, the training device 720 may perform the method shown in FIG. 2 to train for an entry disambiguation model. At this time, the target model 701 is an entry disambiguation model. In embodiments of the present application, the target model may also be referred to as a target rule.

In other application scenarios, the training device 720 may train the designated neural network model using the third training data maintained in the database 730 to obtain the target model 701. For example, the training device 720 may perform the method shown in FIG. 4 to train for an entry disambiguation model. At this time, the target model 701 is a natural language understanding model.

It should be noted that, in practical applications, the training data maintained in the database 730 does not necessarily come from the collection of the data collection device 760, and may be received from other devices. It should be noted that the training device 720 does not necessarily perform the training of the target model 701 based on the training data maintained by the database 730, and may also obtain the training data from the cloud or other places for performing the model training.

The target model 701 trained according to the training apparatus 720 may be applied to different systems or apparatuses, such as the execution apparatus 710 in fig. 7.

For example, after the training device 720 has trained the entry disambiguation model and the natural language understanding model, the two models may be deployed in the computing module 711 of the execution device 710. That is, the computing module 711 of the executing device 710 has the entry disambiguation model and the natural language understanding model trained by the training device 720 deployed therein.

The execution device 710 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR), a vehicle-mounted terminal, or a chip that can be applied to the above devices, or a server or a cloud.

In fig. 7, the execution device 710 configures an input/output (I/O) interface 712 for data interaction with an external device, and a user may input data to the I/O interface 412 via a client device 740. For example, a user may enter a speech statement, or a text statement, etc. through the client device 740.

In addition, the executing device 710 includes a calculating module 711, and the calculating module 711 includes the target model 701 trained by the training device 7207.

In the process that the computing module 711 of the executing device 710 performs related processing on the data to be processed by using the target model 701, the executing device 710 may call the data, the codes, and the like in the data storage system 750 for corresponding processing, and may store the data, the instructions, and the like obtained by corresponding processing into the data storage system 750. Finally, the I/O interface 412 presents the results of the processing to the client device 440 for presentation to the user.

For example, a target dictionary may be stored in the data storage system 750, the execution device 710 may perform the method shown in fig. 5, obtain the intention of the user and key information of the intention based on the target dictionary in the data storage system 750, execute the corresponding task according to the intention and the key information, and send the result obtained by executing the corresponding task to the client device 740 through the I/O interface, so that the client device 740 provides the execution result of the task to the user.

It is understood that the execution device 710 and the client device 740 in the embodiment of the present application may be the same device, for example, the same terminal device.

For example, in the case that the execution device 710 and the client device 740 are the same smartphone, the smartphone may acquire a target sentence input by a user through a microphone, a keyboard, or a handwriting screen, and the smartphone assistant of the smartphone may execute the method shown in fig. 5, acquire an intention of the target sentence and key information of the intention, call a corresponding third-party application (e.g., a ticket ordering application, a calling application, or a music playing application) according to the intention, and output the key information to the third-party application, so that the third-party application may perform a task according to the key information. After the third-party application program obtains the task result, the mobile phone assistant of the smart phone can display the task result to the user through a display screen or a loudspeaker and other devices.

In the system shown in fig. 7, the user may manually give input data, which may be operated through an interface provided by the I/O interface 712. Alternatively, the client device 740 may automatically send input data to the I/O interface 712, and if the client device 740 is required to automatically send input data in need of authorization from the user, the user may set the corresponding permissions in the client device 740. The user may view the results output by the execution device 710 at the client device 740, and the specific presentation form may be a display, a sound, an action, and the like. The client device 740 may also serve as a data collection terminal, and collect input data of the input I/O interface 712 and output results of the output I/O interface 712 as new sample data as shown in fig. 7, and store the new sample data in the database 730. Of course, the input data input to the I/O interface 712 and the output result output from the I/O interface 712 shown in fig. 7 may be directly stored in the database 730 as new sample data by the I/O interface 712 without being collected by the client device 740.

It is to be understood that fig. 7 is only a schematic diagram of one system architecture provided in the embodiments of the present application, and the position relationship between the devices, modules, etc. shown in the diagram does not constitute any limitation, for example, in fig. 7, the data storage system 750 is an external memory with respect to the execution device 710, and in other cases, the data storage system 750 may be disposed in the execution device 710.

Fig. 8 is a deployment diagram of an apparatus for training a model according to an embodiment of the present application, where the apparatus may be deployed in a cloud environment, and the cloud environment is a term for providing a cloud service to a user by using a base resource in a cloud computing mode. A cloud environment includes a cloud data center that includes a large number of infrastructure resources (including computing resources, storage resources, and network resources) owned by a cloud service provider, which may include a large number of computing devices (e.g., servers), and a cloud service platform.

In some examples, the apparatus may be a server in a cloud data center for training an entry disambiguation model; or a virtual machine created in the cloud data center for training the vocabulary entry disambiguation model; the method can also be a software device deployed on a server or a virtual machine in the cloud data center, the software device is used for training the term disambiguation model, and the software device can be deployed on a plurality of servers in a distributed mode, or a plurality of virtual machines in a distributed mode, or a virtual machine and a server in a distributed mode.

In other examples, the apparatus may be a server in a cloud data center for training a natural language understanding model; the method can also be a virtual machine which is created in the cloud data center and used for training the natural language understanding model; the method may further include deploying software devices on servers or virtual machines in the cloud data center, the software devices being used for training the natural language understanding model, and the software devices may be deployed on a plurality of servers in a distributed manner, or on a plurality of virtual machines in a distributed manner, or on the virtual machines and the servers in a distributed manner.

As shown in fig. 8, the apparatus may be abstracted by a cloud service provider at a cloud service platform to provide a user with a cloud service for training a term disambiguation model or a cloud service for training a natural language understanding model, and after the user purchases the cloud service at the cloud service platform, the cloud environment provides the user with the cloud service for training the term disambiguation model or providing the cloud service for training the natural language understanding model.

For example, a user may upload first training data to a cloud environment through an Application Program Interface (API) or a web interface provided by a cloud service platform, receive the first training data by a training device, acquire second training data and third training data by using the method shown in fig. 1, acquire an entry disambiguation model by using the method shown in fig. 2, acquire a natural language understanding model by using the method shown in fig. 4, and finally return the acquired entry disambiguation model and natural language understanding model to an execution device used by the user by the training device. Then, the user may input the target sentence to the execution device, and the execution device may perform the method illustrated in fig. 5, acquire the intention of the target sentence and key information of the intention, and execute the related task according to the intention and the key information.

When the training apparatus is a software apparatus, the training apparatus may also be deployed separately on one computing device in any environment, for example, on one computing device separately or on one computing device in a data center separately.

FIG. 9 is an exemplary block diagram of a computing device according to one embodiment of the present application. As shown in fig. 9, computing device 900 includes a bus 901, a processor 902, a communication interface 903, and a memory 904.

The processor 902, memory 904, and communication interface 903 communicate over a bus 901. Wherein, the processor 902 canTo be a central processing unit. The memory 904 may include a volatile memory (volatile memory), such as a Random Access Memory (RAM). The memory 904 may also include a non-volatile memory (2 NVM), such as a read-only memory (2 ROM), a flash memory, a Hard Disk Drive (HDD) or a Solid State Drive (SSD). The memory 904 stores executable code included in the training apparatus, and the processor 902 reads the executable code in the memory 904 to perform the training method. The memory 904 may also include other software modules required to run processes, such as an operating system. The operating system may be LINUX^TM，UNIX^TM，WINDOWS^TMAnd the like.

For example, the processor 902 of the computing device 900 may read executable code in the memory 904 to implement the method shown in fig. 1 to obtain the second training data and the third training data.

As another example, the processor 902 of the computing device 900 may read executable code in the memory 904 to implement the method illustrated in fig. 2 to obtain an entry disambiguation model.

As another example, the processor 902 of the computing device 900 may read executable code in the memory 904 to implement the method illustrated in fig. 4 to obtain the natural language understanding model.

For another example, the processor 902 of the computing device 900 may read executable code in the memory 904 to implement the method illustrated in fig. 5 to obtain the intent of the statement input by the user and key information of the intent.

Fig. 10 is a schematic diagram of a system architecture according to another embodiment of the present application. The execution device 1010 is implemented by one or more servers, optionally in cooperation with other computing devices, such as: data storage, routers, load balancers, and the like. The enforcement devices 1010 may be disposed on one physical site or distributed across multiple physical sites. The execution device 1010 may use data in the data storage system 1050 or call program code in the data storage system 1050 to implement a method as shown in at least one of fig. 1, 2, and 4.

For example, the execution device 1010 may have various dictionaries deployed therein, as well as first training data deployed therein, the first training data including intent and key information of training sentences and labels; furthermore, the executing device 1010 executes the method shown in fig. 1 based on the dictionary and the first training data to obtain second training data and third training data; thereafter, the performing device 710 performs the method shown in fig. 2 based on the second training data to obtain the entry disambiguation model, and performs the method shown in fig. 3 based on the third training data to obtain the natural language understanding model.

The user may operate respective user devices (e.g., local device 1001 and local device 1002) to interact with the execution device 1010. Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, and so forth.

The local devices of each user may interact with the enforcement device 710 via a communication network of any communication mechanism/standard, such as a wide area network, a local area network, a peer-to-peer connection, etc., or any combination thereof.

After the performing device 1010 has trained the entry disambiguation model and the natural language understanding model, the entry disambiguation model and the natural language understanding model are transmitted to the user devices (e.g., the local device 1001 and the local device 1002) over the communication network.

After the local device 1001 or the local device 1002 receives the entry disambiguation model and the natural language understanding model, the two models may be deployed, and in a case where a target sentence input by a user is received, the method shown in fig. 5 is executed based on the two models, and an intention of the target sentence and key information of the intention are acquired.

In another implementation, one or more aspects of the execution device 1010 may be implemented by each local device, e.g., the local device 1001 may provide the execution device 1010 with first training data, or with a dictionary, or with training sentences.

Fig. 12 is a schematic block diagram of an apparatus 1100 for acquiring training data according to an embodiment of the present application. The apparatus 1100 may include a behavior acquisition module 1110, a determination module 1120, and a generation module 1130. The apparatus 1100 may be used to implement the method shown in fig. 1.

For example, the obtaining module 1110 may be configured to perform S110 to S120, the determining module 1120 may be configured to perform S130, and the generating module 1130 may be configured to perform S140 and S150.

FIG. 13 is a schematic block diagram of an apparatus 1200 for training a model according to an embodiment of the present application. The apparatus 1200 may include a behavior acquisition module 1210 and a training module 1220. The apparatus 1200 may be used to implement the methods shown in fig. 2 or fig. 4.

For example, the obtaining module 1210 may be configured to perform S210, and the training module 1220 may be configured to perform S220. For another example, the obtaining module 1210 may be configured to perform S410, and the training module 1220 may be configured to perform S420.

Fig. 14 is a schematic block diagram of an apparatus 1300 for implementing natural language understanding according to an embodiment of the present application. The apparatus 1300 may include a behavior acquisition module 1310, a disambiguation module 1320, and an understanding module 1330. The apparatus 1300 may be used to implement the method shown in fig. 5.

For example, the obtaining module 1310 may be used to perform S510, the disambiguation module 1320 may be used to perform S530, and the understanding module 1330 may be used to perform S550.

Fig. 15 is a schematic block diagram of an apparatus 1400 according to an embodiment of the present application. The apparatus 1400 comprises a processor 1402, a communication interface 1403, and a memory 1404. One example of the apparatus 1400 is a chip.

The processor 1402, the memory 1404, and the communication interface 1403 may communicate with each other via a bus. The memory 1404 has stored therein executable code, which the processor 1402 reads from the memory 1404 to perform a corresponding method. Other software modules required to execute a process, such as an operating system, may also be included in memory 1404. The operating system may be LINUX^TM，UNIX^TM，WINDOWS^TMAnd the like.

For example, the executable code in the memory 1404 is used to implement the method shown in any of fig. 1, 2, 3, and 5, and the processor 1402 reads the executable code in the memory 1404 to perform the method shown in any of fig. 1, 2, 3, and 5.

The processor 1402 may be a CPU. The memory 1404 may include a volatile memory (volatile memory), such as a Random Access Memory (RAM). The memory 1404 may also include a non-volatile memory (2 NVM), such as a read-only memory (2 ROM), a flash memory, a Hard Disk Drive (HDD) or a Solid State Drive (SSD).

In some embodiments of the present application, the disclosed methods may be implemented as computer program instructions encoded on a computer-readable storage medium in a machine-readable format or encoded on other non-transitory media or articles of manufacture. Fig. 16 schematically illustrates a conceptual partial view of an example computer program product comprising a computer program for executing a computer process on a computing device, arranged in accordance with at least some embodiments presented herein. In one embodiment, the example computer program product 1500 is provided using a signal bearing medium 1501. The signal bearing medium 1501 may include one or more program instructions 1502 which, when executed by one or more processors, may provide the functions or portions of the functions described above with respect to the methods illustrated in any of fig. 1, 2, 3, and 5. Thus, for example, in the embodiment shown in fig. 5, one or more features of S510-S550 may be undertaken by one or more instructions associated with the signal bearing medium 1501. As another example, referring to the embodiment shown in fig. 4, one or more features of S410-S420 may be undertaken by one or more instructions associated with the signal bearing medium 1501.

In some examples, signal bearing medium 1501 may include a computer readable medium 1503, such as, but not limited to, a hard disk drive, a Compact Disc (CD), a Digital Video Disc (DVD), a digital tape, a memory, a read-only memory (ROM), a Random Access Memory (RAM), or the like. In some implementations, the signal bearing medium 1501 may include a computer recordable medium 1104 such as, but not limited to, memory, read/write (R/W) CDs, R/W DVDs, and the like. In some implementations, signal bearing medium 1501 may include a communication medium 1505 such as, but not limited to, digital and/or analog communication media (e.g., fiber optic cables, waveguides, wired communications links, wireless communications links, etc.). Thus, for example, the signal bearing medium 1,501 may be conveyed by a wireless form of communication medium 1505 (e.g., a wireless communication medium conforming to the IEEE 802.11 standard or other transmission protocol). The one or more program instructions 1502 may be, for example, computer-executable instructions or logic-implemented instructions. In some examples, the aforementioned computing devices may be configured to provide various operations, functions, or actions in response to program instructions 1502 communicated to the computing device through one or more of computer-readable media 1503, computer-recordable media 1504, and/or communication media 1505. It should be understood that the arrangements described herein are for illustrative purposes only. Thus, those skilled in the art will appreciate that other arrangements and other elements (e.g., machines, interfaces, functions, orders, and groupings of functions, etc.) can be used instead, and that some elements may be omitted altogether depending upon the desired results. In addition, many of the described elements are functional terms that may be implemented as discrete or distributed components or in conjunction with other components, in any suitable combination and location.

A priori knowledge needs to be combined in many scenarios of natural language understanding, such as intent classification and slot extraction scenarios in a mobile phone assistant: the 'same song' in the 'same song playing' can be video or audio, and the intention can not be judged if no external knowledge is introduced; the xxx in the xxx is called by a telephone is the name of a person in the address book, and the slot position information can not be accurately extracted if the address book is not given; at present, intelligent household appliances are more and more popular, for example, an intelligent sound box is a standard configuration of many families, people like to give the intelligent sound box a special name, and if no priori knowledge is input, the intelligent sound box cannot be awakened through the special name. Therefore, when the intent classification and slot extraction are required to be accurately performed, it is important to introduce external knowledge into the NLP model. According to the method for realizing natural language understanding in the man-machine interaction system, external knowledge can be introduced into natural language understanding, and the introduced external knowledge can assist tasks such as sequence marking and sequence classification, so that the accuracy and efficiency of sequence marking and sequence classification are remarkably improved.

Fig. 17 is a schematic diagram of a system architecture involved in implementing a natural language understanding method in a human-computer interaction system according to an embodiment of the present application, and as shown in fig. 17, the system architecture includes a data storage and algorithm training unit 1, an algorithm deployment unit 2, an algorithm deployment unit 3, and an algorithm deployment unit 4, where the number of the algorithm deployment units may be multiple, and is not limited to the number shown in fig. 1. The data storage and algorithm training unit 1 is used for storing a dictionary and supporting training of a model, and sending the trained model to algorithm deployment units (terminal devices) 2, 3, 4 and the like, and the algorithm deployment units are used for executing operation of an algorithm (namely running of the model).

Fig. 18 is a schematic diagram illustrating modules involved in implementing a natural language understanding method in a human-computer interaction system according to an embodiment of the present application, and as shown in fig. 18, the module includes a fast matching module, a disambiguation module, and a task correlation module. The rapid matching module is used for rapidly matching the sentence input by the user with the dictionary and then converting the matching result into a low-dimensional representation. Since the matching result generally comprises a plurality of results, and most of the results are unreasonable results, the disambiguation module classifies low-dimensional representations of the plurality of matching results to determine reasonable results and unreasonable results. Before the disambiguation module classifies the low-dimensional representations of the multiple matching results, the task correlation module enables statements input by a user to pass through an Encoder (such as a Bidirectional Encoder (BERT) model) so as to obtain context representations of the input statements, fuses the low-dimensional representations of the matching results and the context representations of the input statements, and then inputs the fused results into the disambiguation module; and the task correlation module fuses the reasonable result output by the disambiguation module with the context representation of the input statement to obtain a secondary fusion result, and finally, the secondary fusion result is input into the task module to perform intention classification and slot position extraction, so that the understanding of the natural language is realized.

Fig. 19 is a schematic flowchart of a method for implementing natural language understanding in the human-computer interaction system according to the embodiment of the present application, and as shown in fig. 19, the method includes steps 1901 to 1905, and the method for implementing natural language understanding in the human-computer interaction system according to the embodiment of the present application is an end-to-end method, which may be executed by the data storage and algorithm training unit 1 shown in fig. 17, and is described below separately.

S1901, a plurality of target entries matched with the target sentences in one or more dictionaries are obtained, and the target sentences are sentences input to the man-machine interaction system by the user.

Specifically, a target sentence input by a user to the human-computer interaction system is obtained, and the target sentence is matched with one or more dictionaries. The specific number and type of the dictionaries can be determined by human presets, for example, the man-machine interaction system is an intelligent sound box, the dictionary types can be determined to be music dictionaries, vocal book dictionaries and the like according to application scenes, and the dictionaries can be user-defined dictionaries. Matching the target sentence with one or more dictionaries may be performed by a fast lookup algorithm, such as a dictionary tree (trie tree) algorithm, where each word in the target sentence is looked up in the one or more dictionaries according to a sequence from left to right, and the target entry to be matched generally includes a plurality of words, for example, the target sentence "i want to listen to blue and white porcelain", the dictionaries are music dictionaries and movie dictionaries, "blue and white", "i" all have a match in the music dictionary, that is, the words may all be song names, "blue and white" has a match in the movie dictionary, that is, the word may be a movie name.

S1902, a plurality of sequences of the target sentence are obtained according to the plurality of target terms, and each of the plurality of sequences corresponds to one target term.

Specifically, a sequence corresponding to the target sentence is constructed for each target entry, and each sequence contains a matching entity, including the position and type of the matching entity. For example, taking "blue and white porcelain" matched in a music dictionary as an example, in a sequence constructed for "blue and white porcelain", the "blue and white porcelain" needs to be labeled as a song name and a position of the "blue and white porcelain" in the sequence, and specific reference may be made to table 1 below.

Alternatively, only the location of the matching entity may be labeled in the sequence, and the type of the matching entity is not labeled.

S1903, a plurality of first sequence representations corresponding to the plurality of sequences are acquired from the plurality of sequences and the target sentence.

Specifically, for the sequences obtained in S1902, before the subsequent natural language understanding, the sequences need to be expressed in a manner that can be understood by a computer before the subsequent operation can be performed, that is, the sequences need to be subjected to low-dimensional representation processing to obtain a low-dimensional representation of each sequence. Meanwhile, a context representation of the target statement is obtained, and specifically, the context representation can be obtained by processing the target statement through an encoder (e.g., a BERT model). And fusing the low-dimensional representation of each sequence and the context representation of the target statement, specifically, passing the low-dimensional representation of each sequence and the context representation of the target statement through a self-attention layer of a neural network model, so as to obtain a plurality of first sequence representations corresponding to the plurality of sequences.

Because the low-dimensional representation is obtained in advance, in the prior art, the entries are directly converted into the low-dimensional representation, so that the number of the entries needs to be obtained in advance, and when the dictionary is updated or expanded, the newly added entries do not have the low-dimensional representations corresponding to the entries in advance, the neural network model needs to be retrained. For the application, taking table 3 as an example for illustration, firstly, the sentence corresponding to each entry is converted into a sequence, then the sequence is converted into a low-dimensional representation, and the entities in the sequence are fixed in a limited number, for example, a plurality of entries in table 3 are only in the types of "song" and "movie", so even if other song names and movie names are updated in a music dictionary and a movie dictionary, the types of "song" and "movie" are also converted, so that the dictionary is updated without updating the neural network model, and the generalization capability of the neural network model is improved.

And S1904, determining whether each target entry in the target entries conforms to the semantic meaning of the target sentence according to the first sequence representations.

According to the semantic and syntactic structure of the target sentence input by the user, it can be understood that most of the target terms obtained in the above S1901 are matching results that do not conform to the semantic of the target sentence, and most of the first sequence representations obtained thereby are unreasonable results, and these unreasonable results become interference noise when input into the following neural network model, which may deteriorate the model, and therefore, it is necessary to perform disambiguation processing on the first sequence representations. Specifically, attention of the second sequence representation to other first sequence representations except the second sequence representation in the plurality of first sequence representations is obtained, the second sequence representation is one of the plurality of first sequence representations, and the attention is influence of the other first sequence representations on the second sequence representation. The specific mode is that the second sequence is processed through a self-attention layer of the neural network model, so that the attention of the second sequence to other first sequences is obtained, and therefore when disambiguation processing is carried out, not only is the relation between a matching entity and the context of the matching entity in one sequence considered, but also the relation between different sequences is considered, and the disambiguation result can be more accurate. And then carrying out disambiguation processing on the second sequence representation through a linear layer of the neural network, wherein the linear layer can carry out binary classification processing on the second sequence representation, and can obtain a result that the second sequence representation conforms to the semantics of the target statement or does not conform to the semantics of the target statement. For example, the two classifications may be a 0-1 classification, i.e., classifying a result that meets the semantics of the target sentence as 1 and classifying a result that does not meet the semantics of the target sentence as 0.

S1905, natural language understanding processing is carried out on the target sentence according to the first sequence expression corresponding to the one or more target entries conforming to the semantics of the target sentence, and a processing result is obtained.

Specifically, the natural language understanding processing task of the embodiment of the present application includes intention classification, slot extraction, sentence classification, entity identification, and the like, and natural language understanding processing may be performed on the target sentence according to the specific natural language understanding processing task. Here, the intention classification and slot extraction are taken as examples for explanation, and a first sequence corresponding to one or more reasonable target terms is processed by an intention classification layer and a slot extraction layer of a neural network model, so that intention information and slot information of a target statement are obtained.

Optionally, before performing natural language understanding processing on the target sentence according to the first sequence representation corresponding to the one or more reasonable target terms, the method for implementing natural language understanding in the human-computer interaction system according to the embodiment of the application further includes fusing the first sequence representation corresponding to each reasonable target term in the first sequence representation corresponding to the one or more reasonable target terms and the context representation of the target sentence to obtain one or more third sequence representations corresponding to the one or more reasonable target terms. And then performing natural language understanding processing on the target sentence according to the third sequence representation to obtain a processing result.

According to the method for realizing natural language understanding in the man-machine interaction system, dictionary knowledge is embedded, and natural language understanding processing can be better realized. The natural language is matched with the dictionary and then converted into a sequence, and then the low-dimensional representation of the sequence is obtained according to the sequence. Therefore, after the dictionary is updated or expanded, the neural network model does not need to be updated, and the generalization capability of the model is improved.

Fig. 20 is a schematic flowchart of a training method of a neural network model according to an embodiment of the present application, where the neural network model includes a first sub-network model, a second sub-network model, and a third sub-network model, the first sub-network model may be used to implement steps 1901 to 1903 in fig. 19, described above, the second sub-network model is used to implement step 1904 in fig. 19, described above, and the third sub-network model is used to implement step 1905 in fig. 19, described above. As shown in fig. 20, the training method of the neural network model according to the embodiment of the present application includes steps 2001 to 2006, which are described below.

S2001, first training data is obtained, wherein the first training data comprises training sentences and a plurality of first sequence representations of the training sentences matched with the target dictionary.

Specifically, a word matching result of a training sentence and a target dictionary is obtained to obtain a plurality of target entries; then, establishing a sequence of training sentences aiming at each target entry, wherein each sequence comprises the position and the type of an entity corresponding to one target entry; converting the plurality of sequences into low-dimensional representation, and performing context representation processing on the training sentences to obtain context representation of the training sentences; and finally, fusing each sequence in the sequences with the context representation of the training sentence to obtain a plurality of matching results of the training sentence matched with the target dictionary.

S2002, training the first sub-network model according to the first training data to obtain a trained first sub-network model.

Inputting a training sentence into a first sub-network model to be trained, wherein the first sub-network model already introduces a target dictionary, and training the first sub-network model by taking a plurality of matching results of the training sentence matched with the target dictionary as training targets of the first sub-network model.

And S2003, acquiring second training data, wherein the second training data comprises the output result of the trained first sub-network model and a first sequence representation meeting preset requirements in the plurality of first sequence representations.

Specifically, a matching result that meets the preset requirement is selected from the plurality of matching results in S2001. And each first sequence representation in the plurality of first sequence representations corresponds to a first target entry, the first target entry is an entry matched with the training sentence and the target dictionary, and the first sequence representation meeting the preset requirement is a sequence representation corresponding to the first target entry conforming to the semantic meaning of the training sentence.

And S2004, training the second sub-network model according to the second training data to obtain the trained second sub-network model.

And inputting the output result of the first sub-network model meeting the preset requirement into a second sub-network model to be trained, taking the matching result meeting the preset requirement as a training target of the second sub-network model, and training the second sub-network model.

And S2005, acquiring third training data, wherein the third training data comprises an output result of the trained second sub-network model and a processing result of natural language understanding processing of the training sentence.

And S2006, training the third sub-network model according to the third training data to obtain the trained third sub-network model.

And inputting the output result of the second sub-network model meeting the preset requirement into a third sub-network model to be trained, taking the processing result of the natural language understanding processing of the training sentence as a training target of the third sub-network model, and training the third sub-network model.

Therefore, training of the neural network model of the embodiment of the application can be completed through the steps 2001 to 2006, and the trained neural network model can be used for achieving the method for achieving natural language understanding in the human-computer interaction system of the embodiment of the application.

Optionally, the first sub-network model is an entry matching model, and is configured to obtain, according to a target sentence and a target dictionary input by a user, a plurality of sequence representations matched with the target dictionary, where each sequence representation in the plurality of sequence representations corresponds to one target entry, and the target entry is an entry matched with the target dictionary for the target sentence; the second sub-network model is a term disambiguation model and is used for determining whether a target term corresponding to each sequence representation in the sequence representations conforms to the semantics of the target sentence according to the target sentence and the sequence representations; and the third sub-network model is a natural language understanding model and is used for performing natural language understanding processing on the target statement according to the sequence representation corresponding to the target entry conforming to the target statement semantics.

The method for implementing natural language understanding in the human-computer interaction system according to the embodiment of the present application may be specifically implemented through the following steps, and the method for implementing natural language understanding in the human-computer interaction system according to the embodiment of the present application is specifically described below with reference to fig. 21 to 23.

(1) Dictionary fast matching

The dictionary comprises related entries and corresponding types thereof according to the requirements of downstream tasks, such as a music dictionary, a movie dictionary and the like, wherein the dictionary used for matching can be multiple, and the specific dictionary number and type can be preset manually. As shown in fig. 21, first, for each input sentence, entries and types appearing in the dictionary are searched from left to right in the order of the sentences and beginning with each word, each sentence can be searched in multiple dictionaries, and the same entry can correspond to multiple types. The entries appearing in the dictionary can be searched from left to right by traversing each word through a quick search algorithm, for example, a dictionary tree (trie tree) algorithm, and then the entries are sorted from large to small according to the length of the matched text, so that a quick matching set is obtained.

(2) Constructing labels for matching results

And constructing a characteristic sequence of the dictionary type corresponding to the word level for each result according to the matched dictionary type, wherein the characteristic sequence is in one-to-one correspondence with the sentence word sequence. Table 3 shows an example of the result of matching the sentence "I want to listen to blue and white porcelain" with the dictionary, each row represents a sequence, for example, "blue and white porcelain" is matched in the music dictionary in the first sequence of the dictionary query, where the O mark represents that there is no corresponding dictionary feature, the B-song mark represents that there is a start position of the song name entity, the I-song mark represents that there is a subsequent part of the song name entity, and each dictionary feature sequence contains the position and type information of an entity.

TABLE 3

(3) Obtaining a low-dimensional representation of a tag sequence of a converged context

For the mark sequence generated in (2), which is actually the ID of the mark, the mark sequence needs to be further processed into a word representation before being input into the model, and the mark sequence is processed by the word representation layer of the neural network to obtain a low-dimensional representation rep of the mark sequence_match(ii) a As shown in FIG. 5, an input sentence is processed by an encoder (e.g., BERT model) to obtain a context representation rep_inputUsing the self-attention mechanism to represent rep in the lower dimension of the marker sequence_matchAnd a contextual representation rep of the input sentence_inputMerging whereby a tagged representation rep of the merged context can be derived_c-match。

(4) Dictionary disambiguation

The dictionary matching results obtained above are mostly false results, if the subsequent models are input without difference, the models are interfered with noise, and the result is that not only the expression capability of the sequence model cannot be improved, but also the models can be deteriorated, so that the matching results need to be disambiguated. Most of the noise matching results can be suppressed by the disambiguation step.

As shown in FIG. 22, the above-mentioned label for obtaining the merged context indicates rep_c-matchThen get rep_c-matchCorresponding [ CLS]Mark the vector due to [ CLS]The position mark contains the information of the whole sequence through a self-attention mechanism, and for the screening disambiguation of results, the relation among the results is considered, and the [ CLS ] corresponding to the same sentence is processed]The marker groups are processed through a self-attention layer of the neural network model so as to obtain attention to other dictionary feature sequences, and then prediction classification is carried out through a linear layer of the neural network modelAnd (4) class. In the process, the vector finally used for classification encodes the type and position information of dictionary features and semantic information of an original sentence through a neural network, and semantically reasonable dictionary matching results can be judged through classification, so that the dictionary matching results are screened, wherein the reasonable results are judged to be 1, and the unreasonable results are judged to be 0.

And the classification result adopts 0-1 classification to indicate whether the dictionary label attached to the matching result is reasonable or not. If the dictionary categories are not of interest, only the 0-1 categories may be used in reconstructing the tags.

Table 4 shows the disambiguation results of the dictionary lookup sequence in table 3, in the sentence "i want to listen to celadon", for the matched "music: blue and white porcelain "," music: blue and white "," movie: blue and white "," music: i "of several results, only" music: the blue and white porcelain is a possibly correct entity in the original text, and the other results are not reasonable semantically. Unreasonable dictionary matching results can be filtered out by classification.

TABLE 4

(5) Feature fusion

As shown in FIG. 23, a more reasonable matching result of the tag sequence can be obtained by disambiguation, and the tag sequence is fused and input to the task related module. Specifically, rep corresponding to the sequence with the classification result of 1 is selected_c-matchAnd rep_inputFurther fusion is carried out through the self-attention layer, and context representation rep fused with dictionary knowledge is obtained_fuse，rep_fuseThe text representation of dictionary knowledge is fused, and related task networks can be input subsequently, so that the task effect is improved.

(6) Intent classification and slot extraction

Get rep_fuseThen, the context with the dictionary knowledge fused is represented by rep_fuseRep with the context representation of the input sentence_inputPerforming a fusion, and then respectively rep_fuseAnd inputting an intention classifier and a slot classifier, and performing intention classification and slot extraction.

Fig. 24 is a schematic block diagram of an apparatus for implementing natural language understanding in a human-computer interaction system according to an embodiment of the present application, and as shown in fig. 24, the apparatus includes an obtaining module 2410 and a processing module 2420, where:

an obtaining module 2410, configured to obtain multiple target entries matched in one or more dictionaries by a target sentence, where the target sentence is a sentence input by a user to the human-computer interaction system.

A processing module 2420, configured to obtain multiple sequences of the target sentence according to multiple target entries, where each sequence in the multiple sequences corresponds to one target entry.

The processing module 2420 is further configured to obtain a plurality of first sequence representations corresponding to the plurality of sequences according to the plurality of sequences and the target sentence.

The processing module 2420 is further configured to determine whether each target entry of the plurality of target entries is legitimate or illegitimate according to the first sequence representations.

The processing module 2420 is further configured to perform natural language understanding processing on the target sentence according to the first sequence representation corresponding to the one or more reasonable target terms, so as to obtain a processing result.

Optionally, each sequence in the plurality of sequences includes type information of a matching entity and position information of the matching entity in the sequence.

Optionally, the obtaining, by the processing module 2420, a plurality of first sequence representations corresponding to the plurality of sequences according to the plurality of sequences and the target statement includes: obtaining a low-dimensional representation of each of the plurality of sequences; obtaining a context representation of the target statement; fusing the low-dimensional representation of each sequence of the plurality of sequences with the context representation of the target sentence to obtain a plurality of first sequence representations corresponding to the plurality of sequences.

Optionally, the determining, by the processing module 2420, whether each target entry of the plurality of target entries is reasonable or unreasonable according to the plurality of first sequence representations includes: obtaining attention of a second sequence representation to other first sequence representations except the second sequence representation in the plurality of first sequence representations, wherein the second sequence representation is one of the plurality of first sequence representations; and determining that the target entry corresponding to the second sequence representation is reasonable or unreasonable according to the second sequence representation and the attention.

Optionally, before the processing module 2420 performs the natural language understanding processing on the target sentence according to the first sequence representation corresponding to the one or more reasonable target terms, the processing module 2420 is further configured to: obtaining a context representation of the target statement; and fusing the first sequence representation corresponding to each reasonable target entry in the first sequence representations corresponding to the one or more reasonable target entries with the context representation of the target sentence to obtain one or more third sequence representations corresponding to the one or more reasonable target entries.

It should be understood that the apparatus 2400 for implementing natural language understanding in the human-computer interaction system may be configured to implement the method for implementing natural language understanding in the human-computer interaction system in fig. 19 and fig. 21 to fig. 23, and for the specific implementation steps, reference may be made to the description of fig. 19 and fig. 21 to fig. 23, and for brevity, no further description is given here in this embodiment of the present application.

Fig. 25 shows a schematic block diagram of a training apparatus of a neural network model including a first sub-network model, a second sub-network model, and a third sub-network model according to an embodiment of the present application. As shown in fig. 25, the training apparatus includes an acquisition module 2510 and a training module 2520, wherein:

an obtaining module 2510, configured to obtain first training data, where the first training data includes a training sentence and a plurality of matching results of the training sentence and a target dictionary;

a training module 2520, configured to train the first sub-network model according to the first training data to obtain a trained first sub-network model;

the obtaining module 2510 is further configured to obtain second training data, where the second training data includes an output result of the trained first sub-network model and a matching result that meets a preset requirement in the plurality of matching results;

the training module 2520 is further configured to train the second sub-network model according to the second training data to obtain a trained second sub-network model;

the obtaining module 2510 is further configured to obtain third training data, where the third training data includes an output result of the trained second sub-network model and a processing result of the training sentence for natural language understanding processing;

the training module 2520 is further configured to train the third sub-network model according to the third training data to obtain a trained third sub-network model.

It should be understood that the training apparatus 2500 for a neural network model can be used to implement the above training method for a neural network model in fig. 20, and specific implementation steps can refer to the above description for fig. 20, and for brevity, no further description is given here in this application embodiment.

Fig. 26 is a hardware configuration diagram of an apparatus according to an embodiment of the present application. The apparatus 2600 shown in fig. 26 (which apparatus 2600 may be, in particular, a computer device) includes a memory 2601, a processor 2602, a communication interface 2603, and a bus 2604. The memory 2601, the processor 2602 and the communication interface 2603 are communicatively connected to each other via a bus 2604.

The memory 2601 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 2601 may store a program, and when the program stored in the memory 2601 is executed by the processor 2602, the processor 2602 is configured to perform the steps of the method for implementing natural language understanding and the training method of the neural network model in the human-computer interaction system according to the embodiment of the present application.

The processor 2602 may adopt a general Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more integrated circuits, to execute related programs, so as to implement the method for implementing natural language understanding and the training method for the neural network model in the human-computer interaction system according to the embodiment of the present disclosure.

The processor 2602 may also be an integrated circuit chip having signal processing capabilities. In the implementation process, the steps of the method for implementing natural language understanding and the training method for the neural network model in the human-computer interaction system of the present application may be implemented by integrated logic circuits of hardware in the processor 2602 or instructions in the form of software.

The processor 2602 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 2601, and the processor 2602 reads information in the memory 2601, and completes, in combination with hardware thereof, functions to be executed by units included in the apparatus for implementing natural language understanding and the training apparatus for neural network model in the human-computer interaction system according to the embodiment of the present application, or executes a method for implementing natural language understanding and a training method for neural network model in the human-computer interaction system according to the embodiment of the present application.

Communication interface 2603 enables communication between apparatus 2600 and other devices or communication networks using transceiver means such as, but not limited to, a transceiver. For example, target sentences or training data input by the user may be acquired through the communication interface 2603.

The bus 2604 may include a pathway to transfer information between various components of the device 2600 (e.g., the memory 2601, the processor 2602, the communication interface 2603).

It should be noted that although the apparatus 2600 described above shows only a memory, a processor, and a communication interface, in a specific implementation, those skilled in the art will appreciate that the apparatus 2600 may also include other devices necessary for normal operation. Also, those skilled in the art will appreciate that the apparatus 2600 may also include hardware devices to implement other additional functions, according to particular needs. Furthermore, those skilled in the art will appreciate that the apparatus 2600 may also include only the devices necessary to implement the embodiments of the present application, and not necessarily all of the devices shown in fig. 26.

According to the method provided by the embodiment of the present application, the present application further provides a computer program product, including: computer program code which, when run on a computer, causes the computer to perform the method of any one of the embodiments shown in figures 1, 2, 4, 5, 6, 19, 20.

It should be noted that, all or part of the computer program code may be stored in the first storage medium, where the first storage medium may be packaged with the processor or may be packaged separately from the processor, and this application is not limited in this respect.

According to the method provided by the embodiment of the present application, the present application also provides a chip system, including: a processor for calling and running the computer program from the memory so that the communication device on which the chip system is installed performs the method of any one of the embodiments shown in fig. 1, 2, 4, 5, 6, 19, 20.

According to the method provided by the embodiment of the present application, the present application further provides a computer readable medium, which stores program codes, and when the computer program codes run on a computer, the computer is caused to execute the method of any one of the embodiments shown in fig. 1, 2, 4, 5, 6, 19 and 20.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: u disk, removable hard disk, read only memory, random access memory, magnetic or optical disk, etc. for storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for realizing natural language understanding in a human-computer interaction system is characterized by comprising the following steps:

acquiring target entry information, wherein the target entry information is used for representing entries contained in a target sentence, and the target sentence is a sentence input to the human-computer interaction system by a user;

using a term disambiguation model, and based on the target statement and the target term information, acquiring target indication information, wherein the target indication information is used for indicating whether the term indicated by the target term information conforms to the semantics of the target statement;

and acquiring an understanding result based on the target sentence, the target entry information and the target indication information by using a natural language understanding model, wherein the understanding result comprises the target intention of the target sentence input by the user and key information of the target intention.

2. The method of claim 1, wherein the obtaining target entry information comprises:

and querying the vocabulary entry contained in the target sentence from a target dictionary to obtain the target vocabulary entry information, wherein the target dictionary comprises at least one vocabulary entry.

3. The method of claim 2, wherein the method is performed by an end-side device, wherein the target dictionary comprises a dictionary on the end-side device.

4. The method of claim 1, wherein the obtaining target entry information comprises:

sending the target statement to cloud side equipment;

and receiving the target entry information from the cloud side equipment.

5. The method of claim 4, further comprising:

acquiring candidate intentions of the target statement by using an intention recognition model; wherein the sending the target statement to the cloud-side device includes:

and sending the target sentence to the cloud side equipment under the condition that the dictionary corresponding to the candidate intention is judged to be positioned in the cloud side equipment according to a preset corresponding relation, wherein the corresponding relation is used for indicating whether the intention is positioned in the cloud side equipment.

6. The method of any of claims 1 to 5, wherein the term disambiguation model is a dichotomy model.

7. A method of model training, comprising:

acquiring first training data, wherein the first training data comprises training sentences, intentions input by users into the training sentences and key information of the intentions;

obtaining entry information, wherein the entry information is used for representing entries contained in the training sentences;

acquiring indicating information according to the first training data and the entry information, wherein the indicating information is used for indicating whether the entry represented by the entry information conforms to the intention and the semantics represented by the key information;

and acquiring second training data according to the training sentences, the entry information and the indication information, wherein the second training data comprises the training sentences, the entry information and the indication information, and the second training data is used for training an entry disambiguation model, and the entry disambiguation model is used for judging whether the entries represented by the entry information to be processed conform to the semantics of the sentences to be processed or not based on the sentences to be processed and the entry information to be processed.

8. The method of claim 7, further comprising:

and obtaining third training data according to the first training data, the entry information and the indication information, wherein the third training data comprises the first training data, the entry information and the indication information, and the third training data is used for training a natural language understanding model, wherein the natural language understanding model is used for obtaining the intention of the sentence to be understood input by the user and key information of the intention based on the sentence to be understood, first auxiliary information and second auxiliary information, the first auxiliary information is used for representing the entry contained in the sentence to be understood, and the second auxiliary information is used for indicating whether the entry represented by the first auxiliary information conforms to the semantic meaning of the sentence to be understood.

9. The method according to claim 7 or 8, wherein the obtaining entry information comprises:

and querying the vocabulary entry contained in the sentence from a dictionary to obtain the vocabulary entry information, wherein the dictionary comprises at least one vocabulary entry.

10. An apparatus for implementing natural language understanding in a human-computer interaction system, comprising:

the acquisition module is used for acquiring target entry information, wherein the target entry information is used for representing entries contained in a target sentence, and the target sentence is a sentence input to the human-computer interaction system by a user;

the disambiguation module is used for acquiring target indication information based on the target statement and the target entry information by using an entry disambiguation model, wherein the target indication information is used for indicating whether an entry indicated by the target entry information conforms to the semantics of the target statement;

and the understanding module is used for acquiring an understanding result based on the target statement, the target entry information and the target indication information by using a natural language understanding model, wherein the understanding result comprises the target intention of the target statement input by the user and key information of the target intention.

11. The apparatus of claim 10, wherein the obtaining module is specifically configured to:

12. The apparatus of claim 11, wherein the apparatus is included in an end-side device, and wherein the target dictionary comprises a dictionary on the end-side device.

13. The apparatus of claim 10, wherein the obtaining module is specifically configured to:

sending the target statement to cloud side equipment;

and receiving the target entry information from the cloud side equipment.

14. The apparatus of claim 13, wherein the obtaining module is specifically configured to:

acquiring candidate intentions of the target statement by using an intention recognition model;

15. The apparatus of any of claims 10 to 14, wherein the term disambiguation model is a dichotomy model.

16. A model training apparatus, comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring first training data, and the first training data comprises training sentences, intentions of users for inputting the training sentences and key information of the intentions;

the acquisition module is further used for acquiring entry information, and the entry information is used for representing entries contained in the training sentences;

a judging module, configured to obtain indication information according to the first training data and the entry information, where the indication information is used to indicate whether an entry represented by the entry information conforms to the intention and semantics represented by the key information;

and the generation module is used for acquiring second training data according to the training sentences, the entry information and the indication information, wherein the second training data comprises the training sentences, the entry information and the indication information, and the second training data is used for training an entry disambiguation model, and the entry disambiguation model is used for judging whether entries represented by the entry information to be processed conform to the semantics of the sentences to be processed or not based on the sentences to be processed and the entry information to be processed.

17. The apparatus according to claim 16, wherein the generating module is further configured to obtain third training data according to the first training data, the term information, and the indication information, the third training data includes the first training data, the term information, and the indication information, and the third training data is used to train a natural language understanding model, wherein the natural language understanding model is used to obtain an intention of a user to input the sentence to be understood and key information of the intention based on the sentence to be understood, first auxiliary information and second auxiliary information, the first auxiliary information is used to indicate a term included in the sentence to be understood, and the second auxiliary information is used to indicate whether the term represented by the first auxiliary information conforms to semantics of the sentence to be understood.

18. The apparatus according to claim 16 or 17, wherein the obtaining module is specifically configured to:

19. An apparatus for implementing natural language understanding in a human-computer interaction system, comprising: a processor coupled with a memory;

the memory is to store instructions;

the processor is configured to execute instructions stored in the memory to cause the apparatus to implement the method of any of claims 1 to 6.

20. A model training apparatus, comprising: a processor coupled with a memory;

the memory is to store instructions;

the processor is configured to execute instructions stored in the memory to cause the apparatus to implement the method of any of claims 7 to 9.

21. A computer-readable medium comprising instructions that, when executed on a processor, cause the processor to implement the method of any one of claims 1 to 9.

22. A method for realizing natural language understanding in a human-computer interaction system is characterized by comprising the following steps:

acquiring a plurality of target entries matched with target sentences in one or more dictionaries, wherein the target sentences are sentences input to the human-computer interaction system by a user;

acquiring a plurality of sequences of the target statement according to a plurality of target entries, wherein each sequence in the plurality of sequences corresponds to one target entry;

acquiring a plurality of first sequence representations corresponding to the plurality of sequences according to the plurality of sequences and the target statement;

determining whether each of the plurality of target terms conforms to semantics of the target sentence according to the plurality of first sequence representations;

and performing natural language understanding processing on the target sentence according to one or more first sequence representations corresponding to the target entries conforming to the semantics of the target sentence to obtain a processing result.

23. The method of claim 22, wherein each of the sequences comprises type information of the target entry and position information of the target entry in the target sentence.

24. The method of claim 22 or 23, wherein said obtaining a plurality of first sequence representations corresponding to the plurality of sequences from the plurality of sequences and the target sentence comprises:

obtaining a low-dimensional representation of each of the plurality of sequences;

obtaining a context representation of the target statement;

fusing the low-dimensional representation of each sequence of the plurality of sequences with the context representation of the target sentence to obtain a plurality of first sequence representations corresponding to the plurality of sequences.

25. The method of any of claims 22 to 24, wherein said determining from the first plurality of sequential representations whether each of the plurality of target terms conforms to the semantics of the target sentence comprises:

obtaining attention of a second sequence representation to other first sequence representations except the second sequence representation in the plurality of first sequence representations, wherein the second sequence representation is one of the plurality of first sequence representations;

and determining whether the target entry corresponding to the second sequence representation conforms to the semantics of the target sentence according to the second sequence representation and the attention.

26. The method according to any one of claims 22 to 25, wherein said natural language understanding processing of the target sentence according to the first sequence representation corresponding to one or more target terms conforming to the semantics of the target sentence comprises:

obtaining a context representation of the target statement;

fusing the first sequence representation corresponding to each target entry conforming to the semantics of the target sentence with the context representation of the target sentence to obtain one or more third sequence representations corresponding to the one or more target entries conforming to the semantics of the target sentence;

performing natural language understanding processing on the target sentence according to the one or more third sequence representations.

27. A method of training a neural network model, the neural network model comprising a first sub-network model, a second sub-network model, and a third sub-network model, the method comprising:

obtaining first training data, the first training data comprising a training sentence and a plurality of first sequence representations of the training sentence matching a target dictionary;

training the first sub-network model according to the first training data to obtain a trained first sub-network model;

acquiring second training data, wherein the second training data comprises an output result of the trained first sub-network model and a first sequence representation meeting preset requirements in the plurality of first sequence representations;

training the second sub-network model according to the second training data to obtain a trained second sub-network model;

acquiring third training data, wherein the third training data comprises an output result of the trained second sub-network model and a processing result of natural language understanding processing of the training sentences;

and training the third sub-network model according to the third training data to obtain a trained third sub-network model.

28. The method of claim 27, wherein the first sub-network model is a term matching model for obtaining a plurality of sequence representations matching a target dictionary according to the target sentence input by a user and the target dictionary, each sequence representation in the plurality of sequence representations corresponds to a target term, and the target term is a term matching the target sentence with the target dictionary;

the second sub-network model is an entry disambiguation model used for determining whether a target entry corresponding to each sequence representation in the sequence representations conforms to the semantics of the target sentence according to the target sentence and the sequence representations;

and the third sub-network model is a natural language understanding model and is used for performing natural language understanding processing on the target statement according to the sequence representation corresponding to the target entry conforming to the target statement semantics.

29. The method according to claim 27 or 28, wherein each of the plurality of first sequence representations corresponds to a first target entry, the first target entry is an entry of the training sentence matched with the target dictionary, and the first sequence representation satisfying the preset requirement is a sequence representation corresponding to the first target entry conforming to the semantic meaning of the training sentence.

30. An apparatus for implementing natural language understanding in a human-computer interaction system, comprising:

the acquisition module is used for acquiring a plurality of target entries matched with target sentences in one or more dictionaries, wherein the target sentences are sentences input to the human-computer interaction system by a user;

the processing module is used for acquiring a plurality of sequences of the target statement according to a plurality of target entries, wherein each sequence in the plurality of sequences corresponds to one target entry;

the processing module is further configured to obtain a plurality of first sequence representations corresponding to the plurality of sequences according to the plurality of sequences and the target sentence;

the processing module is further configured to determine whether each target entry of the plurality of target entries conforms to the semantics of the target sentence according to the plurality of first sequence representations;

the processing module is further configured to perform natural language understanding processing on the target sentence according to the first sequence representation corresponding to one or more target entries conforming to the semantics of the target sentence, so as to obtain a processing result.

31. The apparatus of claim 30, wherein each of the sequences comprises type information of the target entry and position information of the target entry in the target sentence.

32. The apparatus of claim 30 or 31, wherein the processing module obtains a plurality of first sequence representations corresponding to the plurality of sequences according to the plurality of sequences and the target sentence, and comprises:

obtaining a context representation of the target statement;

33. The apparatus of any of claims 30 to 32, wherein the processing module determines whether each of the plurality of target terms conforms to the semantics of the target sentence according to the first plurality of sequential representations, comprising:

34. The apparatus according to any one of claims 30 to 33, wherein the processing module performs natural language understanding processing on the target sentence according to one or more first sequence representations corresponding to target terms that conform to the semantics of the target sentence, including:

obtaining a context representation of the target statement;

35. An apparatus for training a neural network model, the neural network model including a first sub-network model, a second sub-network model, and a third sub-network model, the apparatus comprising:

an acquisition module configured to acquire first training data, the first training data including a training sentence and a plurality of first sequence representations of the training sentence matching a target dictionary;

the training module is used for training the first sub-network model according to the first training data to obtain a trained first sub-network model;

the obtaining module is further configured to obtain second training data, where the second training data includes an output result of the trained first sub-network model and a first sequence representation that meets a preset requirement in the plurality of first sequence representations;

the training module is further configured to train the second sub-network model according to the second training data to obtain a trained second sub-network model;

the obtaining module is further configured to obtain third training data, where the third training data includes an output result of the trained second sub-network model and a processing result of the training sentence for natural language understanding processing;

the training module is further configured to train the third sub-network model according to the third training data to obtain a trained third sub-network model.

36. The apparatus of claim 35, wherein the first sub-network model is a term matching model, and is configured to obtain a plurality of sequence representations matching a target dictionary according to the target sentence input by a user and the target dictionary, each sequence representation in the plurality of sequence representations corresponds to a target term, and the target term is a term matching the target sentence with the target dictionary;

37. The apparatus according to claim 35 or 36, wherein each of the plurality of first sequence representations corresponds to a first target entry, the first target entry is an entry matched with the target dictionary in the training sentence, and the first sequence representation satisfying the preset requirement is a sequence representation corresponding to the first target entry conforming to the semantic meaning of the training sentence.

38. An apparatus for implementing natural language understanding in a human-computer interaction system, comprising: a processor coupled with a memory;

the memory is to store instructions;

the processor is configured to execute instructions stored in the memory to cause the apparatus to implement the method of any of claims 22 to 26.

39. A model training apparatus, comprising: a processor coupled with a memory;

the memory is to store instructions;

the processor is configured to execute instructions stored in the memory to cause the apparatus to implement the method of any of claims 27 to 29.

40. A computer-readable medium comprising instructions that, when executed on a processor, cause the processor to implement the method of any one of claims 22 to 26 or 27 to 29.