CN113936660A

CN113936660A - Intelligent speech understanding system with multiple speech understanding engines and intelligent speech interaction method

Info

Publication number: CN113936660A
Application number: CN202111201895.1A
Authority: CN
Inventors: 武晓梅
Original assignee: Shuimu Think Tank Beijing Technology Co ltd
Current assignee: Shuimu Think Tank Beijing Technology Co ltd
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2022-01-14

Abstract

An intelligent speech understanding system having a plurality of speech understanding engines and an intelligent speech interaction method. The intelligent speech understanding system comprises: the voice recognition system comprises a first voice understanding engine which processes voice in a non-transcription mode, a second voice understanding engine which processes voice in a transcription mode and an understanding result judging unit, wherein the voice processing unit of the first voice understanding engine processes voice to obtain voice data in a coded sequence form, and a natural language understanding unit obtains intentions corresponding to the voice based on the voice data in the coded sequence form through a natural language understanding model; a voice processing unit of the second voice understanding engine transcribes the voice to obtain voice data in a text form, and a natural language understanding unit obtains an intention corresponding to the voice based on the voice data in the text form through a natural language understanding model; the understanding result judging unit judges the intention corresponding to the voice from the understanding results of the two voice understanding engines.

Description

Intelligent speech understanding system with multiple speech understanding engines and intelligent speech interaction method

Technical Field

The invention relates to a voice intelligent processing technology, in particular to an intelligent voice understanding system with a multi-voice understanding engine and an intelligent voice interaction method.

Background

The current intelligent speech processing (NLP) scheme implements natural language understanding by performing semantic analysis based on a text sequence obtained by speech recognition. Specifically, firstly, the voice is transcribed into words, and then the word sequence is analyzed by a keyword matching technology, or in combination with a context or a knowledge map, or by means of a deep learning technology, so as to obtain the meaning (semantic meaning) expressed by the voice. This approach is very dependent on the accuracy of the voice transcription.

The speech transcription technique itself has significant limitations. The recognition accuracy of speech recognition is affected by many factors, and a uniform speech recognition model cannot be established. For example, for the same word, phrase and sentence, due to the influence of the factors such as the voice characteristics of the speaker, the speaking habits, the speaking scene, the context, the real-time emotion, etc., the voices of different people are different in terms of volume, tone, duration, etc., and the external factors such as the distance between the speaking organ and the sound receiver (microphone), background noise, simultaneous speaking of multiple people (cocktail party problem), the bandwidth of the voice transmission channel (e.g., telephone voice), etc., also increase the complexity and difficulty of voice recognition. In particular, when the speaker pronunciations for some characters or words are similar or identical to the standard pronunciations for other characters or words, recognition deviation or recognition errors are liable to occur.

For chinese, the presence of a large number of polyphones increases the difficulty of obtaining the recognition rate. Although the current speech recognition technology can train specific speech recognition models for specific dialects, specific accents and even specific people, the models cannot be unified and automatically switched, so that the problem of inaccurate recognition caused by pronunciation heterogeneity cannot be fundamentally solved.

Current speech processing schemes include only one speech understanding engine, using one transcription unit/transcription model. This is because, for this scheme, the transcription accuracy determines the accuracy of semantic understanding, and the transcription result is unique, so that only one transcription model with higher transcription accuracy can be selected, for example, a transcription model optimized in a certain vertical domain, or a transcription model optimized for a certain dialect or accent. However, in a speech interaction scenario in which accents or dialects cannot be predicted, a better transcription model cannot be selected in advance or switched automatically, and therefore, when the transcription accuracy is reduced by the transcription model, the speech understanding effect is also poor. In addition, optimization of the transcription model also requires high training data and labor costs.

Disclosure of Invention

According to an aspect of the present invention, there is provided an intelligent speech understanding system, including: a first speech understanding engine for processing speech in a non-transcription manner, a second speech understanding engine for processing speech in a transcription manner, and an understanding result judging unit, wherein the first speech understanding engine comprises a speech processing unit and a natural language understanding unit, the speech processing unit of the first speech understanding engine processes speech to obtain speech data in a coded sequence form, the natural language understanding unit of the first speech understanding engine obtains intention corresponding to the speech based on the speech data in the coded sequence form through a natural language understanding model, the second speech understanding engine comprises a speech processing unit and a natural language understanding unit, the speech processing unit of the second speech understanding engine transcribes the speech to obtain speech data in a text form, the natural language understanding unit of the second speech understanding engine obtains the intention corresponding to the speech based on the speech data in the text form through the natural language understanding model, the understanding result determination unit determines an intention corresponding to the speech from the understanding result of the first speech understanding engine and the understanding result of the second speech understanding engine.

According to the intelligent speech understanding system of the embodiment of the present invention, optionally, the understanding result of the first speech understanding engine includes a confidence level that the speech corresponds to a certain pragmatic information classification node, the understanding result of the second speech understanding engine includes a confidence level that the same speech corresponds to a certain pragmatic information classification node, and the understanding result determining unit obtains the speech understanding result of the intelligent speech understanding system according to the set threshold value of the first speech understanding engine on the pragmatic information classification node and the set threshold value of the second speech understanding engine on the pragmatic information classification node.

According to the intelligent speech understanding system of the embodiment of the invention, optionally, the confidence coefficient is a probability corresponding to the pragmatic information classification node of the speech-speech interaction layer.

According to the intelligent speech understanding system of the embodiment of the present invention, optionally, the first speech understanding engine and the second speech understanding engine perform speech understanding based on the same hierarchically arranged pragmatic information classification nodes.

According to the intelligent speech understanding system of the embodiment of the present invention, optionally, the natural language understanding unit of the first speech understanding engine generates the speech understanding model by using the speech data in the form of the coded sequence and the paired data of the linguistic information classification node, the natural language understanding unit of the second speech understanding engine generates the speech understanding model by using the speech data in the form of the text and the paired data of the linguistic information classification node, and the natural language understanding unit of the first speech understanding engine and the natural language understanding unit of the second speech understanding engine, respectively, select a pragmatic information classification node of a certain layer of the current speech interaction to train the speech understanding model, or selecting a plurality of layers of pragmatic information classification nodes of the current voice interaction to train the voice understanding model, or all pragmatic information classification nodes of the current voice interaction are selected to train the voice understanding model.

According to the intelligent speech understanding system of the embodiment of the present invention, optionally, the natural language understanding unit of the first speech understanding engine selects the speech data in the form of the speech or the coding sequence corresponding to the pragmatic information classification node collected at a certain layer of the current speech interaction to train the speech understanding model, or selects the speech data in the form of the speech or the coding sequence corresponding to the pragmatic information classification node collected at multiple layers or all layers of the current speech interaction to train the speech understanding model, or selects the training data of the same pragmatic information classification node of other speech interactions except the current speech interaction to train the speech understanding model of the current speech interaction;

and the natural language understanding unit of the second speech understanding engine selects the speech data in the text form corresponding to the pragmatic information classification node collected at a certain layer of the current speech interaction to train the speech understanding model, or selects the speech data in the text form corresponding to the pragmatic information classification node collected at multiple layers or all layers of the current speech interaction to train the speech understanding model, or selects the training data of the same pragmatic information classification node of other speech interactions except the current speech interaction to train the speech understanding model of the current speech interaction.

According to the intelligent speech understanding system of the embodiment of the present invention, optionally, the obtaining of the speech understanding result of the intelligent speech understanding system by the understanding result determining unit includes performing weighting processing on the understanding result of the first speech understanding engine and the understanding result of the second speech understanding engine.

According to the intelligent speech understanding system of the embodiment of the present invention, optionally, the obtaining of the speech understanding result of the intelligent speech understanding system by the understanding result determining unit includes performing weighting processing on the understanding result of the first speech understanding engine and the understanding result of the second speech understanding engine according to the speech length.

According to the intelligent speech understanding system of the embodiment of the present invention, optionally, the intelligent speech understanding system includes an understanding result synchronizing unit, so that the understanding result determining unit determines the speech understanding result of the intelligent speech understanding system according to the speech understanding results of the same speech by the plurality of speech understanding engines.

According to the intelligent speech understanding system of the embodiment of the present invention, optionally, the intelligent speech understanding system records the speech data in the text form corresponding to the pragmatic information classification node, which is obtained by the transcription of the second speech understanding engine, as the entity word information.

According to another aspect of the present invention, an intelligent voice interaction method is provided, wherein the method comprises: receiving voice; the first speech understanding engine processes the speech to obtain speech data in a coding sequence form, and the speech data in the coding sequence form is understood through a natural language understanding model; the second speech understanding engine transcribes the speech to obtain speech data in a text form, and the speech data in the text form is understood through the natural language understanding model; judging an intention corresponding to the voice according to the understanding result of the first voice understanding engine and the understanding result of the second voice understanding engine; a response corresponding to the intent is made.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description only relate to some embodiments of the present invention and are not limiting on the present invention.

FIG. 1 illustrates an example of an intelligent speech understanding system having multiple speech understanding engines, according to an embodiment of the present invention;

FIG. 2 illustrates an example of a natural language understanding model generation method of an intelligent speech understanding system according to an embodiment of the present invention;

FIG. 3 illustrates an example of an intelligent voice interaction method of an intelligent voice understanding system according to an embodiment of the present invention;

fig. 4 shows an example of a preset pragmatic information classification node.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings of the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention.

Unless defined otherwise, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. The use of "first," "second," and similar terms in the description and claims of the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. Also, the use of the terms "a" or "an" and the like do not denote a limitation of quantity, but rather denote the presence of at least one.

The intelligent voice understanding system with the plurality of voice understanding engines comprises at least one voice understanding engine which processes voice in a non-transcription mode, and the intelligent voice understanding system judges to obtain a voice understanding result for voice interaction according to the understanding result of each voice understanding engine.

FIG. 1 illustrates an example of an intelligent speech understanding system having multiple speech understanding engines, according to an embodiment of the present invention. As shown in fig. 1, the first speech understanding engine includes a speech processing unit 101 and a natural language understanding unit 102; the second speech understanding engine comprises a speech processing unit 104 and a natural language understanding unit 105. The first speech understanding engine including the speech processing unit 101 and the natural language understanding unit 102 is a speech understanding engine that processes speech in a non-transcription manner.

The speech processing unit 101 processes speech to obtain speech data in the form of a coded sequence. These voices may be stored or real-time voices in the voice interaction, or training voices from the corpus database.

Alternatively, the speech is processed in the form of audio samples. For example, for telephone voice, audio sample data is obtained in an audio sample format of 8000Hz, 16bit and mono. The sampling rate of telephone voice at 8kHz is limited by the bandwidth of a telephone channel at 4kHz, and the limitation of the bandwidth of the telephone channel at 4kHz is to save bandwidth resources, and only sound in a narrow range of 300-3400 Hz in a voice spectrum is reserved to distinguish voice signals. Under the condition of 4G or even 5G mobile communication networks, the narrow-band transmission limit of 4kHz gradually exits the historical stage, and the frequency range of sound which can be heard by human ears is considered to be (16-20000 Hz). Higher sampling rates, e.g., 16kHz, 48kHz, may be considered to increase the fidelity of audio (speech) sampling and reduce the loss of speech information due to sampling. The audio sampling parameters may include sampling rate, bit rate, mono/binaural (typically mono), and sampling duration, among others. The sampling duration may be set according to the specific interaction scenario and network resource conditions, e.g., 8 seconds, 10 seconds, 15 seconds, 20 seconds, etc. Alternatively, after audio sampling, the audio sampling signal is processed by Voice Activity Detection (VAD) technology to detect a mute portion in a noise environment, so as to identify and cut a Voice portion, and obtain valid audio sampling data.

The audio sample data is then subjected to an encoding process, for example, using phonemes (phones) as encoding units, so that a sequence of phonemes can be obtained from the audio sample data using an acoustic model. For example, the English international phonetic alphabet has 48 phonemes, wherein 20 vowel phonemes and 28 consonant phonemes are included; modern pinyin (Latin scheme for phonetic notation of Chinese characters, ISO7098) has 32 phonemes. It is also possible to combine a language model with phonemes and combinations of phonemes as coding units. For example, a syllable sequence is used as a result of the encoding process; it can also be encoded in a manner suitable for certain language features, such as initials and finals of Chinese. Furthermore, a mixed coding mode is adopted according to different language characteristics, for example, the coding rule comprises English syllables and Chinese initials and finals, so that the coding of multi-language mixed voice can be realized. For example, for the language expression "i live in motel in suburbs for saving money" — if only the initial consonants and vowels of chinese are used for encoding, it can be encoded as "w-ei-l-e-sh-eng-q-i-an-w-o-zh-u-z-ai-j-i-ao-q-u-d-e-m-ou-t-ai-l-i", where english "motel" is approximated as "m-ou-t-ai-l" of chinese pinyin according to pronunciation, where the difference between chinese pronunciation syllables and english pronunciation syllables can be seen; if Chinese initial consonant and vowel are mixed with English syllable for coding, the Chinese initial consonant and vowel can be coded into

This can be closer to real speech and the loss of speech information amount is smaller. Chinese is also a syllable and Chinese is a single syllable, but if Chinese is coded with Chinese syllables and English is coded with English syllables, because of the difference between syllables of two languages (about 410 syllables in Chinese; two or three thousand syllables in English) and the difference between language models, a large deviation of the coding result is caused, for example, if Chinese is coded with Chinese syllables and English is coded with English syllables, if a deviation occurs in the recognition of syllables, it may be coded as "wei-le-sheng-wo-zhu-zai-jiao-qu-de-mou-tai-li", compared to the above mixed coding method

The difference from the actual speech is larger. This differential hybrid coding can reduce the error rate of syllable recognition. Further, other coding dimensions can be added according to the characteristics of different languages. For example, Chinese may be accented and coded with a combination of syllables plus tones. The mandarin chinese language has only about 410 syllables, wherein four tones are significant and the number of syllables is only 160, and other syllables are combined with partial tones, so that there are about 1250 tones and syllables combined for mandarin chinese language except some special syllables for language and local syllables for dialect. English has no tones, but accents can be added as a dimension of coding.

The above processing of the speech results in speech data in the form of a coded sequence. The code symbols here can be phonemes, syllables or a combination of phonemes and syllables. That is, the coding sequence may include a plurality of coding symbols, and the coding symbols may be phonemes, syllables, or a combination of phonemes and syllables. And these phones or syllables may also include phones or syllables of different languages, and may also be custom phone level or syllable level symbols independent of the particular language.

The natural language understanding unit 102 obtains semantics corresponding to the speech through a natural language understanding model.

According to the speech understanding engine which processes speech in a non-transcription mode, the natural language understanding model can be generated based on pragmatic information, and intelligent speech interaction can be achieved.

According to the information science or the generalized information theory, any information can be divided into three levels of syntactic information, semantic information and pragmatic information. Grammar, semantics and pragmatics are terms borrowed from the notation. Grammar refers to the relationship between symbols and is the rule of written language, spoken language, computer coding language or behavioral language; the semantics refers to the relationship between the symbols and the entities, and refers to the content and meaning expressed by the symbols; the language is used to refer to the relationship between the symbol and the user, and to the utility of the content and meaning expressed by the symbol to the user. The syntax information is also called technical information, and is used for researching the form relationship of the symbols and the symbols, and researching the coding, transmission, repetition and reproduction of the symbols and the like without considering the actual meaning and the utility of the symbols. Grammatical information is of an objective nature and is the most basic level of information. Semantic information is in contrast to grammatical information, which indicates the meaning of the content of the information as well as the authenticity, accuracy and reliability of the logic. The contact factual materials (grammatical information) are understood, and are related to the entity represented by the contact factual materials, and are analyzed and compared to form a certain concept, namely semantic information. It is a statement or representation of the state of motion of an object, the purpose of which is to ensure that the recipient gets the actual content of the information. The pragmatic information refers to the utility of the information to the receiver, i.e. the utility and value of the information.

In short, the syntax information is the structure of the output symbol of the information source or the expression of the objective characteristic thereof, and is irrelevant to the subjective requirement of the information sink; the semantic information considers the meaning of the symbol (set); the pragmatic information is information useful for the sink. The grammar information, the semantic information and the pragmatic information are related to each other. The same meaning can be expressed in different languages or words, and the grammatical information contained in the various expressions can be different; often, an expression implies something that is only partially useful to the recipient of the message or information. Thus, in general, the information rate of the semantic information may be less than the information rate of the grammar information, which in turn is less than the information rate of the semantic information.

The narrow information theory uses mathematical statistics to research information processing and information transmission, and it researches the common rules of information transmission existing in communication and control systems, and how to improve the effectiveness and reliability of each information transmission system. The aforementioned speech recognition technology is based on a narrow information theory, and first attempts to recover symbols (characters) output from a source from received speech, which is established at a syntactic information level, and then performs semantic analysis to obtain semantic information. As described above, the accuracy of semantic understanding is affected because of uncertainty in the speech recognition result due to differences in the source (speaker), condition restrictions of the channel (sound transmission path, medium), and the like.

According to shannon information theory, information is a deterministic increase. That is, the amount of information does not depend on the length of the information, but rather is in how much uncertainty is reduced. It follows that the speech recognition process described above adds uncertainty and thus reduces the amount of information and even loses the information originally transmitted by the source.

In order to solve the above problems, the present invention provides a novel speech processing method based on the linguistic information by combining the above narrow information theory and the broad information theory.

First, the relationship between voice, text and information is examined. Speech is generated prior to text and is an earlier information carrier. Under the ancient technical conditions, the speech can not be transmitted and stored remotely, so that characters are invented. Also an information carrier, compared with voice, characters can be transmitted and stored remotely. Further, the channel bandwidth required for transmitting information on a word basis can be made smaller by encoding (writing, composing), that is, the amount of information that can be carried by a unit word can be increased by encoding, and the information can be encrypted by encoding. Thus, text becomes the most important information carrier for recording civilization. Based on the path dependence on the information carried by the characters, the current natural voice processing technology adopts a technical path that firstly identifies the characters from voice and then analyzes the information carried by the characters.

However, under modern technology, long-distance transmission and storage of voice can be realized, and with the development of communication technology and network technology, bandwidth and cost are no longer the bottleneck of voice transmission, and people can interact with clearer voice. Under such conditions, when the information sent by the source is carried by speech, it is not necessary to recognize the text from the speech, but the information can be obtained directly at the sink (receiving end) by analyzing the received speech.

Further, analysis is made from the viewpoint of the amount of information. As mentioned before, the amount of information does not depend on the length of the information, but rather how much uncertainty is reduced. For sinks, the amount of information depends on the pragmatic information rather than the semantic information. As a simple example, the source utters "yes" speech or text and the semantic is "yes", but if the sink wants to know "today's day of the week", the semantic information does not reduce the uncertainty about "today's day of the week", and thus the amount of information is zero. If the sink knows that the speech or text "yes" spoken by the source is an answer to "do this day is tuesday", the information contains pragmatic information "affirm this day is tuesday", reducing the uncertainty about "day of the week" given that the information spoken by the source may not be true, and thus, the uncertainty about "day of the week" cannot be completely eliminated. As another example, the source may speak a voice or text "i' e will get married the next month to do" and the sink may obtain pragmatic information that the source does not want to pay for the debt immediately. The expression sent by the source is informative as to whether the sink is concerned about "willing to pay off the debt immediately". Therefore, the pragmatic information may not be included in the semantic information or the receiver is required to obtain the pragmatic information according to the subjective judgment of the receiver; moreover, the amount of information and the amount of information that a recipient obtains may be different for different recipients of the same information, or for the same recipient in different contexts for the same information. From the perspective of the purpose of information interaction, the more pragmatic information and information volume the receiver obtains, the more effective the information interaction is.

Based on the analysis of information, information carriers (voice and characters) and information quantity, the invention provides an intelligent voice processing method based on pragmatic information, which comprises a natural language understanding model generation/training method and an intelligent voice interaction method.

Fig. 2 illustrates an example of a natural language understanding model generating method according to an embodiment of the present invention. As shown in fig. 2, in step S201, a pragmatic information classification node is preset. Fig. 4 shows an example of a preset pragmatic information classification node. In the illustrative example shown in fig. 4, a multi-layer voice interaction structure is adopted, specifically comprising a layer a, a layer B and a layer C. In the A-layer structure, five preset language information classification nodes A01, A02, A03, A04 and A05 are included, and correspond to five types of language information respectively. The language information classification is set by the receiver based on the interaction requirement. For example, in fig. 4, the a-level interaction is identity confirmation, and may include a 01-is self, a 02-repeat query, a 03-non-self, a 04-complaint tendency, a 05-inconvenience, and five pragmatic information categories. That is, the receiver needs to know to which of the five categories of pragmatic information the information carried by the speech from the source belongs. According to different information interaction requirements, the preset language information classification nodes can comprise single-layer or multi-layer classification, and language information classification nodes are set in each layer. Wherein, the same pragmatic information classification node can be set in different layers.

In step S202, speech is processed to obtain speech data in the form of a coded sequence. Step S202 may be performed by the speech processing unit 101. The speech processing unit 101 processes the speech by the above-described method to obtain speech data in the form of a coded sequence.

The order of step S201 and step S202 is not particularly required for the natural language understanding model generation method, and step S201 may precede step S202.

In step S203, the speech data in the form of the coded sequence is associated with the pragmatic information classification node. The purpose of associating the speech data in the form of the encoded sequence with the pragmatic information classification node is to form a training corpus pair for training a natural language understanding model, i.e., a data pair of the speech data in the form of the encoded sequence and the pragmatic information classification node. When training the natural language understanding model, the speech data in the form of the coded sequence and the data pairs of the pragmatic information classification nodes are used as input data of model training. Since the speech data in the form of the encoded sequence may not be readable to a human, if the speech data in the form of the encoded sequence is manually associated with the linguistic information classification node, a feasible method is that the human listens to the speech before encoding, understands the linguistic information contained in the speech, and associates the speech with the corresponding linguistic information classification node, i.e., manually labels, so as to obtain the paired data of the speech data in the form of the encoded sequence and the linguistic information classification node.

In step S204, a natural language understanding model is generated by using the speech data in the form of the coding sequence and the paired data of the linguistic information classification node. The generation of the natural language understanding model may be performed by the natural language understanding unit 102. The number of pieces of pairing data used for training of the natural language understanding model may be a small number of pieces or a few pieces, or may be a large number of pieces. The information amount of the training paired data depends on the difference degree of the speech data in the form of the coding sequence, that is, the difference between the coding sequences obtained by the speech processing unit 101 processing different speeches corresponding to the same pragmatic information classification node. Generally, the more speech data in the form of a differentiated coding sequence, the better the training of the natural language understanding model and the higher the speech understanding accuracy. In other words, the more actual speech conditions that the corpus contains, the more speech the natural language understanding model can understand.

It should be noted that, because the natural language understanding unit 102 defaults that the paired data used for training is correct when performing the natural language understanding model, if the coding sequences obtained after coding the two pieces of speech data are the same or similar but the corresponding linguistic information classification nodes are different, the effect of model training is affected. This may occur because the association of the speech data with the pragmatic information classification node is incorrect, and from the viewpoint of manual labeling, it is "telling" the robot about the result or intention of the understanding of the speech error; it may also be because the two pieces of speech data are themselves very close but correspond to different understandings, which may occur between different dialects. The influence on the model training effect, specifically, when an approximate voice is recognized, the recognition accuracy of the voice on the related pragmatic information classification node is reduced, and the confidence value is reduced. For example, for two similar voices, labeled a 01-is the person and a 03-is not the person, labeled a 03-is not the person who is the person, and the person who is the. A large amount of matching data which are approximate to voice and are correctly labeled can be added for training when the natural language understanding model is trained, and the influence of individual training data which are wrongly labeled on the model is reduced. In general, the approximation between voices decreases as the length of the voices increases, so the above-mentioned annotation errors have a greater effect on the understanding of phrase sounds relative to the understanding of long voices.

The aim of speech understanding is to automatically judge the corresponding pragmatic information classification node aiming at a piece of speech. For example, in an intelligent voice interaction scenario based on a call center, when a robot asks whether a user is the owner of a called mobile phone number (corresponding to the layer a identity inquiry in fig. 4), the called user may have multiple answering modes, and the voice of the called user is understood as the intention corresponding to one of the pragmatic information classification nodes a 01-a 05. When understanding is performed by using the natural language understanding model, the natural language understanding model calculates the confidence of the understood speech with respect to each pragmatic information classification node, and then determines the understanding result according to the confidence values. According to the use mode of the natural language understanding model, pragmatic information classification nodes and corresponding training data can be selected when the natural language understanding model is trained. The method comprises the steps that a linguistic information classification node of a certain layer of current voice interaction can be selected to train a natural language understanding model, a plurality of layers of linguistic information classification nodes of the current voice interaction can be selected to train the natural language understanding model, and all linguistic information classification nodes of the current voice interaction can be selected to train the natural language understanding model; the speech data in the form of speech or coding sequence corresponding to the pragmatic information classification node collected at a certain layer of the current speech interaction can be selected to train the natural language understanding model, the speech data in the form of speech or coding sequence corresponding to the pragmatic information classification node collected at multiple layers or all layers of the current speech interaction can be selected to train the natural language understanding model, and the training data of the same pragmatic information classification node of other speech interactions except the current speech interaction can be selected to train the natural language understanding model of the current speech interaction.

The algorithm for generating the natural language understanding model may be a statistical-based algorithm, a deep learning/deep neural network algorithm, or a combination of both algorithms. The deep learning/deep neural network algorithm can be one or a combination of Word2vec, RNN, Transformer and BERT.

The second speech understanding engine may also train the natural language understanding model using the method of FIG. 2. Here, if the speech understanding engine is a transcription-based engine, the speech processing unit 104 performs transcription processing on the speech to generate a coding sequence having characters, words, punctuations, and spaces as coding units as speech data in text form. When associating speech data in text form with the pragmatic information classification node, manual labeling can be performed by looking at the words. It should be noted that, because the text transcribed by the speech may be inaccurate, the speech data in the text form that cannot correspond to the pragmatic information classification node is not used as training data for training the natural language understanding model.

The first speech understanding engine and the second speech understanding engine can be independently configured engines, or configured to work independently in a set of systems, or work in cooperation with each other.

According to an embodiment of the present invention, the natural language understanding unit 102 and the natural language understanding unit 105 may employ the same or similar natural language understanding model, employing the natural language understanding model generation/training method described above.

Fig. 3 illustrates an example of a smart voice interaction method according to an embodiment of the present invention. As shown in fig. 3, in step S301, a voice is received. The speech may be a question speech or a response speech. For example, for a question of a user, an intelligent voice interaction system (intelligent voice interaction robot) receives a voice from the user and then automatically responds according to an understanding result of the voice of the user; or the intelligent voice interactive robot asks questions, the user responds, and the robot then automatically responds according to the understanding result of the voice of the user. Such question-answering may be performed in multiple rounds. The user's voice is typically a sound made directly or indirectly by a natural person, and may even be the voice produced by a voice assistant/voice robot.

In step S302, the user speech is processed to obtain speech data in the form of a coded sequence. The method of step S201 may be employed to perform audio sampling and encoding on the user speech.

Step S303 and step S304, the speech data in the encoded sequence format is processed using the natural language understanding model, and the language information of the speech corresponding to the speech data in the encoded sequence format is obtained.

According to the speech understanding of the embodiment of the invention, specifically, for a piece of speech, the corresponding pragmatic information classification node is automatically judged. For example, in an intelligent voice interaction scenario based on a call center, when a robot asks whether a user is the owner of a called mobile phone number (corresponding to the layer a identity inquiry in fig. 4), the called user may have multiple answering modes, and the voice of the called user is understood as the intention corresponding to one of the pragmatic information classification nodes a 01-a 05.

The basis of the speech understanding according to the embodiment of the present invention is to generate and train a natural language understanding model by using the preset linguistic analysis classification node in the step S201 and the paired data of the speech data in the form of the coding sequence and the linguistic information classification node in the step S204. That is, speech understanding based on the linguistic information requires that the linguistic analysis classification nodes be preset and the natural language understanding models corresponding thereto be generated.

Specifically, when understanding is performed using the natural language understanding model, the natural language understanding model calculates the confidence of the understood speech with respect to each linguistic information classification node (step S303), and then determines the understanding result from these confidence values (step S304). Confidence here is understood to be the probability that a voice corresponds to a particular pragmatic information classification node. Because one or more than one language information classification nodes are preset in a certain layer of interaction, the natural language understanding model calculates the probability that the user voice obtained in the layer of interaction corresponds to each language information classification node of the layer of interaction, and the sum of the probabilities respectively corresponding to each language information classification node of the layer is 100% (normalized to 100 or 1.0) for one user voice. For example, through natural language understanding model calculation, the confidence of a01 is 80, the confidence of a02 is 0, the confidence of a03 is 20, and the confidence of a04 and a05 is 0.

There are many ways to determine the understanding result from these confidence values. For a single working speech understanding engine, the pragmatic information classification node with the highest confidence value can be used as a speech understanding result; the pragmatic information classification node with the confidence value exceeding the preset threshold value can also be used as a voice understanding result. As in the above example, the confidence of a01 was 80, the confidence of a02 was 0, the confidence of a03 was 20, and the confidence of a04 and a05 was 0. If the pragmatic information classification node with the highest confidence value is used as the speech understanding result, a01 may be determined as the speech understanding result. If the confidence threshold is set to 90, it cannot be determined which linguistic information classification node is the speech understanding result, in which case there may be further processing: for example, the user is requested to make a speech expression again, then a new piece of user speech is understood, if the user speech expression exceeds the set number of times (for example, twice) and still cannot reach the confidence threshold, the interaction ending processing is performed, or the linguistic information classification node with the highest confidence value corresponding to the last expression of the user is used as the speech understanding result, and manual processing can be further performed; or requesting the user to confirm the pragmatic information classification node with the highest confidence value corresponding to the expression as a voice understanding result; or directly transferred to manual processing.

Multiple thresholds may also be set, such as a first threshold 80, a second threshold 60. If the confidence coefficient of the understood speech about a certain pragmatic information classification node of the current interaction layer is higher than 80, taking the pragmatic information classification node as an understanding result; if the confidence degree of the understood voice about a certain pragmatic information classification node of the current interaction layer is between 60 and 80, requesting the user to confirm whether the expression corresponds to the pragmatic information classification node with the highest confidence degree value, if so, using the pragmatic information classification node as a voice understanding result, if not, requesting the user to make voice expression again, and then understanding a new piece of user voice; and if the confidence of the understood voice about a certain language information classification node of the current interaction layer is lower than 60, requesting the user to make a voice expression again, then understanding a new piece of user voice, and if the user expresses again for more than a set number of times (for example, twice) and still cannot reach 80, performing processing of ending the interaction or taking the language information classification node with the highest confidence value corresponding to the last expression of the user as a voice understanding result or performing manual processing.

The confidence threshold may be set according to various factors such as the training data amount of the natural language understanding model, the confidence distribution of the test/production data, the voice interaction hierarchy, and the voice interaction scenario requirements. For example, if the amount of training data is small, the confidence mean for test/production data is low, and the interaction level circulation requirement is high, a lower confidence threshold may be set. In addition, one or more confidence threshold values can be uniformly set in multiple layers of voice interaction, the confidence threshold value can also be set in a single layer, and the confidence threshold value can also be set in a single pragmatic information classification node. The setting of the confidence threshold value may change the process of specific voice interaction, so that the voice interaction is smoother, and better user experience is obtained.

For the case that a plurality of speech understanding engines work simultaneously, the confidence degrees calculated by the speech understanding engines are needed to be referred to determine which pragmatic information classification node of the current interaction layer corresponds to the understood speech. The understanding result determination unit 103 determines which pragmatic information classification node of the current interaction layer the understood speech corresponds to based on the confidence degrees calculated by the respective speech understanding engines.

According to an embodiment of the present invention, for a case where two speech understanding engines operate simultaneously, for example, the first speech understanding engine and the second speech understanding engine shown in fig. 1 operate simultaneously. The same speech to be understood is input to the speech processing unit 101 and the speech processing unit 102, and the speech processing unit 101 and the speech processing unit 102 respectively generate speech data in different coding sequence forms, for example, the coding elements of the coding sequence generated by the speech processing unit 101 are phonemes or syllables or combinations of phonemes and syllables, the coding elements of the coding sequence generated by the speech processing unit 102 are words, that is, the speech processing unit 102 transcribes the speech into a text, and the text may be composed of words or words, may include spaces and/or punctuations, or may be processed by a segmentation model and a grammar model.

The natural language understanding unit 102 and the natural language understanding unit 105 respectively calculate the speech data in the form of the coded sequence generated by the speech processing unit 101 and the speech processing unit 102 by using their respective natural language understanding models, and respectively obtain the confidence degrees corresponding to the speech to be understood and a certain linguistic information classification node. According to the speech understanding engine of the embodiment of the invention, in the operation process of obtaining the confidence coefficient corresponding to the speech to be understood and a certain pragmatic information classification node by using the natural language understanding model, the text corresponding to the speech obtained by the speech-text transcription tool is integrally understood as a coding sequence, and the semantics of the text is not required to be analyzed by keywords, a grammar library, a knowledge base and the like.

The generation method of the natural language understanding model of each of the natural language understanding unit 102 and the natural language understanding unit 105 may be different. For example, the natural language understanding unit 102 selects the linguistic information classification nodes of a certain layer of the current voice interaction to train the natural language understanding model, and the natural language understanding unit 105 selects all the linguistic information classification nodes of the current voice interaction to train the natural language understanding model. Optionally, when performing speech understanding, the hierarchical structure of the current speech interaction and the preset linguistic information classification nodes, that is, the hierarchically arranged linguistic information classification nodes, are the same for the two speech understanding engines. For example, the first speech understanding engine and the second speech understanding engine both use the speech interaction hierarchy and the pragmatic information classification node arrangement shown in fig. 4. In this way, the understanding results of the two speech understanding engines can be mutually verified when determining which linguistic information classification node of the current interaction layer the speech corresponds to. Different speech understanding engines can also adopt different speech interaction structures and different pragmatic information classification node sets, for example, one speech understanding engine adopts a layered pragmatic information classification node structure, and the corresponding confidence coefficient is calculated for the pragmatic information classification node of one layer each time; the other speech understanding engine adopts a single-layer language information classification node structure (not layering), and corresponding confidence degrees are calculated for all language information nodes or part of language information nodes each time.

Specifically, if two speech understanding engines produce the same result, the understanding result determination unit takes the understanding result as a speech understanding result; if the two speech understanding engines do not obtain the understanding result (for example, the confidence degrees respectively calculated by the two speech understanding engines do not reach the confidence degree threshold value capable of determining the understanding result), prompting the user to re-input the speech and then understanding the new speech; if only one of the two speech understanding engines obtains an understanding result, the understanding result is used as a speech understanding result; if the two speech understanding engines respectively obtain the understanding results, but the obtained understanding results are different, a condition needs to be added for judgment.

For the case that the two speech understanding engines respectively obtain the understanding result, but the obtained understanding results are different, the confidence degrees of the speech and the pragmatic information classification nodes calculated by the two speech understanding engines respectively and the confidence threshold value used by each speech understanding engine to judge the understanding result at this point are combined to judge the understanding result of the multiple engines. For example, if the difference between the confidence thresholds for judging the understanding results set by the two speech understanding engines is greater than a certain set value (for example, 20), for example, the confidence thresholds are 80 and 40, respectively, then the understanding result of the speech understanding engine with the higher confidence threshold is regarded as the multi-engine understanding result. If the difference between the confidence thresholds set by the two speech understanding engines for judging the understanding result is smaller than a certain set value (for example, 20), for example, the confidence thresholds are 80 and 70 respectively, the user is prompted to re-input the speech and understand the new speech.

There is a case where the speech of the user to be understood is short and has a length of only one or two syllables, which is common in chinese (because chinese characters are monosyllabic words), and if a certain speech understanding engine processes speech in a transcription manner and obtains an understanding result exceeding a confidence threshold, and another speech understanding engine that does not process speech in a transcription manner also obtains an understanding result exceeding a confidence threshold and is different from the understanding result of the speech understanding engine that processes speech in a transcription manner, the understanding result of the speech understanding engine that processes speech in a transcription manner is used as the understanding result of the multi-engine, without prompting the user to re-input speech and understanding new speech. This is because, for speech having only one or two syllable lengths, transcription of one or two words by speech results in more certainty and higher convergence than for longer code sequences (such as phoneme-level coded code sequences). Also, for the case of a transcription error, such as "may be" transcribed as "broad," for the pragmatic information classification node, for example, B01 "explicit need" in fig. 4, it may not happen that the aforementioned case where one speech understanding engine processes speech in a transcription manner and obtains an understanding result exceeding the confidence threshold while another speech understanding engine that does not process speech in a transcription manner also obtains an understanding result exceeding the confidence threshold and that the understanding result is different from the understanding result of the speech understanding engine that processes speech in a transcription manner is not occurred because the confidence threshold is not reached. Rather than saying that a speech understanding engine that processes speech using transcription is superior to a speech understanding engine that processes speech using no transcription, a speech understanding engine that processes speech using transcription is complementary to a speech understanding engine that processes speech using no transcription in one particular case. The logic behind the method is that the speech understanding engine understands incorrectly (obtains an understanding result reaching a confidence threshold but the result is incorrect) due to the interference of noise or background sound, incomplete speech cutting of a user, incomplete or unclear speech caused by network transmission, incorrect corpus training and the like, and in this case, mutual verification and error correction of the understanding result can be realized by adopting a plurality of speech understanding engines with different speech processing and/or encoding modes, so that the understanding accuracy of the whole system is improved. Of course, there may be a case where each engine outputs a wrong understanding result. Optionally, the speech understanding results of the speech understanding engines are weighted, and the speech understanding results of the speech understanding engines can also be weighted according to the speech lengths. The measure of the length of the speech may be the length of the phoneme, syllable, word, etc.

If a plurality of speech understanding engines work simultaneously, the processing time of different speech understanding engines may be inconsistent, and it is necessary to provide an understanding result synchronizing unit to ensure that the understanding result determining unit 103 determines that the speech understanding result of the same user speech is the speech understanding result. The result determination unit 103 may be provided before the result synchronization unit 103, or the result determination unit 103 may implement a function of understanding result synchronization.

The plurality of speech understanding engines can also work in a complementary mode, one speech understanding engine cannot obtain a speech understanding result, and the other speech understanding engine carries out speech understanding.

The speech understanding engine that processes speech in a non-transcription manner and the speech understanding engine that processes speech in a transcription manner work cooperatively, so that mutual verification of speech understanding results can be performed as described above, and division of labor can be performed by using respective characteristics. For example, a speech understanding engine that processes speech in a non-transcription manner is responsible for obtaining an intention corresponding to the pragmatic information classification node, and a speech understanding engine that processes speech in a transcription manner is responsible for obtaining a speech transcription word corresponding to the pragmatic information classification node. For example, one speech interaction layer includes three pragmatic information classification nodes "name", "don't want to speak" and "not hear clearly", if the result of understanding of the user speech by the speech understanding engine which processes speech in a non-transcription manner is "name", the speech understanding engine which processes speech in a transcription manner transcribes the speech of the user to obtain a transcription text corresponding to the pragmatic information classification node "name", such as "zhang", and the system records the transcription text ("zhang") as name information; if the speech understanding engine which processes the speech in a non-transcription mode has the result of understanding the speech of the user as "not wanting to speak" or "not hearing clearly" (not being "name"), the speech understanding engine which processes the speech in a transcription mode does not need to transcribe the speech of the user, and the system does not need to record the transcribed text as name information. According to the above scheme, both the speech understanding engine that processes speech in a non-transcription manner and the speech understanding engine that processes speech in a transcription manner understand speech to obtain a linguistic information classification node, and an understanding result is determined by the understanding result determining unit to obtain a linguistic information classification node corresponding to the speech, and then whether to record a text transcribed from the speech as text information corresponding to the linguistic information classification node is determined according to the understanding result.

According to an embodiment of the present invention, the foregoing intelligent speech understanding system may include more than two speech understanding engines, and each speech understanding engine may mutually verify a respective speech understanding result, or may cooperate with each other in a labor division manner, and the speech understanding engine that processes speech in a transcription manner obtains text information corresponding to speech through transcription, and takes the transcribed text information as entity word information corresponding to an intention. The entity word information may be the aforementioned name, address, date, identification number, telephone number, etc.

Having determined the pragmatic information classification node corresponding to the user' S voice, i.e., the voice understanding result, a response corresponding to the pragmatic information classification node is made at step S305.

Responses corresponding to the pragmatic information classification nodes may be made or completed at the current interaction layer; for a multi-layered voice interaction structure, the flow may go to the next layer.

When the interactive flow is transferred to a certain voice interaction layer, the user voice is understood in the interaction layer, the language information classification node corresponding to the user voice is determined, and a response corresponding to the language information classification node is made. These responses may be robotic voice announcements (announcement recordings or synthesized voice) to cause the user to make the voice again; or the robot finishes the current voice interaction after voice broadcasting. For example, in the example shown in fig. 4, if the voice understanding result obtained at the a layer is a02, a jump is made inside the a layer, the voice of the a layer robot is repeatedly broadcasted or the dialogues corresponding to a02 are broadcasted, and then voice understanding is still performed at the a layer according to the user voice inputted again; if the voice understanding result obtained at the layer A is A03, performing 'on-hook' processing, such as hanging up after broadcasting the voice of finishing the voice interaction, sending short messages and the like; and so on.

The multi-layer voice interaction structure is used for multiple rounds of voice interaction, as shown in FIG. 4. Wherein the arrows indicate the flow between the layers. For example, in the interactive structure shown in fig. 4, if the speech understanding result obtained in the a layer is a01, the robot speech broadcast (broadcast recording or synthesized speech) corresponding to a layer a01 is performed, the speech interactive flow is switched to the B layer, speech understanding is performed again in the B layer, and it is confirmed that the user speech of the user answering the robot speech broadcast corresponds to which linguistic information classification node in the B layer; if the speech understanding result obtained at the B layer is B02, B03, B04, B05 or a02, robot speech corresponding to these pragmatic information classification nodes of the B layer is broadcasted, the speech interaction flow goes to the C layer, speech understanding is performed again at the C layer, and it is confirmed to which pragmatic information classification node of the C layer the user answer speech corresponds. The multilayer voice interaction structure is arranged corresponding to the layered language information nodes, and compared with a non-layered language information node arrangement mode, the layered language information node arrangement mode enables the number of the language information nodes on each layer to be small, and understanding accuracy can be improved.

It may be noted that the classification nodes may respond differently to the same pragmatic information at different levels. For example, the response of the AP02 of the a layer is to jump inside the a layer, and the response of the a02 of the B layer is to broadcast the robot voice corresponding to the a02 of the B layer, and to flow to the C layer after receiving the new user voice made by the user to the robot voice. Therefore, on one hand, the nodes can be classified by sharing the pragmatic information in multiple layers, and on the other hand, the response can be flexibly configured according to the specific needs of voice interaction.

User speech generated in the speech interaction can be used as training data of the natural language understanding model after being associated with the language information classification nodes.

In a multilayer voice interaction structure, each layer can adopt the same pragmatic information classification node; a plurality of different voice interaction structures may also employ the same pragmatic information classification node. In this way, the user voice data collected under different voice interaction services corresponding to different voice interaction structures can be associated with the same pragmatic information classification node, thereby being used as training data of a natural language understanding model.

According to the intelligent voice understanding system with the plurality of voice understanding engines, which is disclosed by the embodiment of the invention, the voice understanding engine which processes voice in a transcription mode and the voice understanding engine which processes voice in a non-transcription mode can realize the cooperation and division operation of the plurality of voice understanding engines, and the mutual verification of the understanding results can be synchronously or step by step, so that not only can the intention corresponding to the voice be obtained, but also the text (literal) information corresponding to the voice can be obtained. The method has the advantages that training corpora and a natural language understanding model are collected according to the hierarchical structure of the pragmatic information classification nodes of voice interaction, so that the data volume required by the natural language understanding model is greatly reduced, and the voice understanding accuracy is rapidly improved; through simple association operation, the voice data obtained in the voice interaction is used for fast iteration of the natural language understanding model, and the cost for optimizing the natural language understanding model is greatly reduced. The speech understanding engine which processes the speech in a non-transcription mode directly understands the information for speech from the speech, so that information loss caused by the speech transcription to characters is avoided; because the voice interaction structure is not limited by characters, a set of voice interaction structure and a corresponding model can support various language environments such as different dialects, small languages, mixed languages and the like.

The above description is intended to be illustrative of the present invention and not to limit the scope of the invention, which is defined by the claims appended hereto.

Claims

1. An intelligent speech understanding system, comprising:

rather than a first speech understanding engine that processes speech in transcription,

a second speech understanding engine that processes speech in transcription,

and an understanding result judging unit (103),

wherein the content of the first and second substances,

the first speech understanding engine includes a speech processing unit (101) and a natural language understanding unit (102), the speech processing unit (101) of the first speech understanding engine processes speech to obtain speech data in a coded sequence, the natural language understanding unit (102) of the first speech understanding engine obtains an intention corresponding to the speech based on the speech data in the coded sequence through a natural language understanding model,

the second speech understanding engine includes a speech processing unit (104) and a natural language understanding unit (105), the speech processing unit (104) of the second speech understanding engine performs transcription processing on the speech to obtain speech data in a text form, the natural language understanding unit (105) of the second speech understanding engine obtains an intention corresponding to the speech based on the speech data in the text form through a natural language understanding model,

the understanding result determination unit (103) determines an intention corresponding to the speech from the understanding result of the first speech understanding engine and the understanding result of the second speech understanding engine.

2. The intelligent speech understanding system according to claim 1, wherein the understanding result of the first speech understanding engine includes a confidence degree that the speech corresponds to a certain pragmatic information classification node, the understanding result of the second speech understanding engine includes a confidence degree that the same speech corresponds to a certain pragmatic information classification node, and the understanding result determining unit (103) obtains the speech understanding result of the intelligent speech understanding system based on a set threshold value of the first speech understanding engine with respect to the pragmatic information classification node and a set threshold value of the second speech understanding engine with respect to the pragmatic information classification node.

3. The intelligent speech understanding system of claim 2, wherein the confidence level is a probability that the phonetic information classification nodes of the speech-to-speech interaction layer correspond to.

4. The intelligent speech understanding system of claim 1, wherein the first speech understanding engine and the second speech understanding engine perform speech understanding based on the same hierarchically arranged pragmatic information classification nodes.

5. The intelligent speech understanding system according to claim 1, wherein the natural language understanding unit (102) of the first speech understanding engine generates a speech understanding model using the speech data in the form of the coded sequence and the paired data of the pragmatic information classification node, the natural language understanding unit (105) of the second speech understanding engine generates a speech understanding model using the speech data in the form of text and the paired data of the pragmatic information classification node, and the natural language understanding unit (102) of the first speech understanding engine and the natural language understanding unit (105) of the second speech understanding engine respectively select a pragmatic information classification node of a certain layer of a current speech interaction to train a speech understanding model or select a multi-layer pragmatic information classification node of a current speech interaction to train a speech understanding model, or all pragmatic information classification nodes of the current voice interaction are selected to train the voice understanding model.

6. The intelligent speech understanding system of claim 1, wherein,

the natural language understanding unit (102) of the first speech understanding engine selects speech data in the form of speech or coding sequences corresponding to the pragmatic information classification nodes collected at a certain layer of the current speech interaction to train the speech understanding model, or selects speech data in the form of speech or coding sequences corresponding to the pragmatic information classification nodes collected at multiple layers or all layers of the current speech interaction to train the speech understanding model, or selects training data of the same pragmatic information classification nodes of other speech interactions except the current speech interaction to train the speech understanding model of the current speech interaction;

and a natural language understanding unit (105) of the second speech understanding engine selects the speech data in text form corresponding to the pragmatic information classification node collected in a certain layer of the current speech interaction to train the speech understanding model, or selects the speech data in text form corresponding to the pragmatic information classification node collected in multiple layers or all layers of the current speech interaction to train the speech understanding model, or selects the training data of the same pragmatic information classification node of other speech interactions except the current speech interaction to train the speech understanding model of the current speech interaction.

7. The intelligent speech understanding system according to claim 1, wherein the understanding result determination unit (103) obtaining the speech understanding result of the intelligent speech understanding system includes performing weighting processing on the understanding result of the first speech understanding engine and the understanding result of the second speech understanding engine.

8. The intelligent speech understanding system according to claim 1, wherein the understanding result determination unit (103) obtaining the speech understanding result of the intelligent speech understanding system includes performing weighting processing on the understanding result of the first speech understanding engine and the understanding result of the second speech understanding engine according to a speech length.

9. The intelligent speech understanding system according to claim 1, wherein the intelligent speech understanding system includes an understanding result synchronizing unit such that the understanding result deciding unit (103) determines the speech understanding result of the intelligent speech understanding system from speech understanding results of a plurality of speech understanding engines with respect to the same speech.

10. The intelligent speech understanding system according to claim 1, wherein the intelligent speech understanding system records the speech data in text form corresponding to the pragmatic information classification node, which is obtained by transcription by the second speech understanding engine, as entity word information.

11. An intelligent voice interaction method is provided, wherein,

the method comprises the following steps:

receiving voice;

the first speech understanding engine processes the speech to obtain speech data in a coding sequence form, and the speech data in the coding sequence form is understood through a natural language understanding model;

the second speech understanding engine transcribes the speech to obtain speech data in a text form, and the speech data in the text form is understood through a natural language understanding model;

judging an intention corresponding to the voice according to an understanding result of the first voice understanding engine and an understanding result of the second voice understanding engine;

a response corresponding to the intent is made.