WO2019029352A1

WO2019029352A1 - Intelligent voice interaction method and system

Info

Publication number: WO2019029352A1
Application number: PCT/CN2018/096705
Authority: WO
Inventors: 李锐; 陈志刚; 王智国; 胡国平
Original assignee: 科大讯飞股份有限公司
Priority date: 2017-08-09
Filing date: 2018-07-23
Publication date: 2019-02-14
Also published as: CN107437415A; CN107437415B

Abstract

Disclosed in the present invention are an intelligent voice interaction method and system. The method comprises: receiving user interaction voice; performing voice recognition and semantic understanding on the interaction voice to obtain recognized text and a semantic understanding result; determining whether the current voice segment is voice of a single person; if yes, responding according to the semantic understanding result; otherwise, determining an instruction relationship between roles in the current voice segment according to the current voice segment and the corresponding semantic understanding result, and then responding according to the instruction relationship between the roles. The present invention can improve the responding accuracy rate in a human-machine interaction environment in which multiple persons participate, and improve the user experience.

Description

Intelligent voice interaction method and system

[Technical Field]

The present invention relates to the field of speech signal processing and natural language understanding, and in particular to an intelligent voice interaction method and system.

【Background technique】

With the continuous advancement of artificial intelligence technology, human-machine voice interaction has also made great progress. Various voice assistant APPs and human-computer interaction robots have emerged, and people's desire for natural and convenient human-computer interaction has reached an unprecedented height. . The existing human-computer interaction methods are based on the endpoint detection technology to determine the user's effective interactive speech, and then the interactive speech recognition and semantic understanding. Finally, the system responds to the semantic understanding result. However, human-computer interaction often involves multiple people participating in the interaction. In this case, the voices of different roles may be mutual interference, or may be supplemental or different interactive instructions, but the existing human-computer interaction. The method will recognize and semantically understand the voice data of multiple people as a voice instruction data, and finally respond, which may eventually lead to a wrong interaction.

[Summary of the Invention]

The embodiment of the invention provides an intelligent voice interaction method and system, so as to avoid erroneous understanding and response in an interaction scenario involving multiple people.

To this end, the present invention provides the following technical solutions:

An intelligent voice interaction method, the method comprising:

Receiving user interaction voice data;

Performing speech recognition and semantic understanding on the interactive speech data to obtain recognition text and semantic understanding results;

Determine whether the current voice segment is a single voice;

If yes, respond according to the semantic understanding result;

Otherwise, the command relationship between the characters in the current voice segment is determined according to the current voice segment and its corresponding semantic understanding result, and then responded according to the command relationship between the roles.

Preferably, the method further comprises: constructing a speaker turning point judgment model in advance, and the constructing process of the speaker turning point judgment model comprises:

Determining the topological structure of the speaker turning point judgment model;

Collecting a large amount of interactive voice data including multiple participants, and marking the interactive voice data by turning points;

Using the interactive voice data and the annotation information to train to obtain a speaker turning point judgment model parameter;

The determining whether the current voice segment is a single voice includes:

For each frame of speech in the current speech segment, extract its spectral features;

Inputting the extracted spectral features into the speaker turning point judgment model, and determining whether there is a turning point in each frame voice according to the output of the speaker turning point judgment model;

If at least one frame of speech in the current speech segment has a turning point, it is determined that the current speech segment is not a single speech; otherwise, it is determined that the current speech segment is a single speech.

Preferably, if at least one frame of the current voice segment has a turning point, it is determined that the current voice segment is not a single voice; otherwise, determining that the current voice segment is a single voice includes:

If there are turning points in the continuous multi-frame speech in the current speech segment, it is determined that the current speech segment is not single-person speech; otherwise, it is determined that the current speech segment is single-person speech.

Preferably, the determining, according to the current voice segment and the corresponding semantic understanding result, the command relationship between the roles in the current voice segment includes:

Extracting instruction association features from current speech segments and their corresponding semantic understanding results;

Determining an inter-role command relationship in the current speech segment according to the instruction association feature.

Preferably, the instruction association feature comprises: an acoustic feature and a semantic relevance feature; the acoustic feature comprises any one or more of the following: an average volume level of the speech segment, a signal to noise ratio of the speech segment, a speech segment and a main microphone The angle of the relationship is the angle between the line connecting the sound source and the main microphone and the horizontal line; the semantic relevance feature is a semantic relevance value;

Extracting the instruction association feature from the current speech segment and its corresponding semantic understanding result includes:

Extracting the acoustic features from the current speech segment;

The semantic relevance value of the current speech segment is determined according to the semantic understanding result corresponding to the current speech segment.

Preferably, the method further comprises: pre-establishing a semantic relevance model, wherein the construction process of the semantic relevance model comprises:

Determining the topology of the semantic relevance model;

Collecting a large amount of interactive voice data including multiple participants as training data, and performing semantic relevance labeling on the training data;

Extracting semantically related features of the training data;

Using the semantic related features and the annotation information to train to obtain a semantic relevance model;

Determining the semantic relevance value of the current speech segment according to the semantic understanding result corresponding to the current speech segment includes:

Extracting semantic related features from semantic understanding results corresponding to the current speech segment;

And inputting the semantic related feature into the semantic relevance model, and obtaining a semantic relevance value of the current speech segment according to the output of the semantic relevance model.

Preferably, the semantic related feature comprises: a text word vector corresponding to the interaction voice data, and a service type involved in the user instruction in the interaction voice data.

Preferably, the method further includes: pre-building an instruction association recognition model, where the instruction association recognition model construction process includes:

Determining the topology of the instruction association recognition model;

Collecting a large amount of interactive voice data including multiple participants as training data, and labeling the training data with inter-character relationship;

Extracting instruction association features of the training data;

Using the instruction association feature and the annotation information to train to obtain an instruction association recognition model;

Determining, according to the instruction association feature, an instruction relationship between each character in the current voice segment, including:

And inputting the instruction association feature into the instruction association recognition model, and obtaining an instruction relationship between each character in the current voice segment according to the output of the instruction association recognition model.

Preferably, the inter-character instruction relationships include: interference, supplementation, and independence.

An intelligent voice interaction system, the system comprising:

a receiving module, configured to receive user interaction voice data;

a voice recognition module, configured to perform voice recognition on the interactive voice data to obtain a recognized text;

a semantic understanding module, configured to perform semantic understanding on the recognized text, and obtain a semantic understanding result;

a determining module, configured to determine whether the current voice segment is a single voice;

a response module, configured to respond to the semantic understanding result after the determining module determines that the current voice segment is a single voice;

The command relationship identification module is configured to determine, after the determining module determines that the current voice segment is not a single voice, determine an instruction relationship between the characters in the current voice segment according to the current voice segment and the corresponding semantic understanding result;

The response module is further configured to respond according to the inter-role command relationship determined by the instruction relationship identification module.

Preferably, the system further includes: a speaker turning point judgment model building module, configured to pre-build a speaker turning point judgment model; and the speaker turning point judgment model building module includes:

a first topology determining unit, configured to determine a topology structure of the speaker turning point judgment model;

a first data collecting unit, configured to collect a plurality of interactive voice data including multiple participants, and perform turning point labeling on the interactive voice data;

a first parameter training unit, configured to use the interactive voice data and the annotation information to obtain a speaker turning point judgment model parameter;

The determining module includes:

a spectrum feature extraction unit, configured to extract a spectral feature for each frame of speech in the current speech segment;

a turning point determining unit, configured to input the extracted spectral feature into the speaker turning point judgment model, and determine, according to the output of the speaker turning point judgment model, whether there is a turning point in each frame of voice;

The determining unit is configured to determine that the current voice segment is not a single voice when there is at least one frame of voice in the current voice segment; otherwise, determine that the current voice segment is a single voice.

Preferably, the determining unit is specifically configured to determine that the current voice segment is not a single voice when there are consecutive multi-frame voices in the current voice segment; otherwise, determine that the current voice segment is a single voice.

Preferably, the instruction relationship identification module comprises:

An instruction association feature extraction unit, configured to extract an instruction association feature from a current speech segment and a corresponding semantic understanding result thereof;

The command relationship determining unit is configured to determine an instruction relationship between the roles in the current voice segment according to the instruction association feature.

The instruction association feature extraction unit includes:

An acoustic feature extraction subunit for extracting the acoustic feature from a current speech segment;

The semantic relevance feature extraction sub-unit is configured to determine a semantic relevance value of the current speech segment according to a semantic understanding result corresponding to the current speech segment.

Preferably, the system further comprises: a semantic relevance model building module, configured to pre-build a semantic relevance model; the semantic relevance model building module comprises:

a second topology determining unit, configured to determine a topology of the semantic relevance model;

a second data collecting unit, configured to collect a plurality of interactive voice data including multiple participants as training data, and perform semantic relevance labeling on the training data;

a semantic correlation feature extraction unit, configured to extract semantic related features of the training data;

a second training unit, configured to use the semantic related feature and the annotation information to obtain a semantic relevance model;

The semantic relevance feature extraction sub-unit is specifically configured to extract semantic-related features from semantic understanding results corresponding to the current speech segment; input the semantic-related features into the semantic relevance model, according to the semantic relevance model The output gets the semantic relevance value of the current speech segment.

Preferably, the system further includes: an instruction association identification model building module, configured to pre-build an instruction association recognition model; and the instruction association recognition model construction module includes:

a third topology determining unit, configured to determine a topology of the instruction association identification model;

The third data collecting unit collects a plurality of interactive voice data including multiple participants as training data, and performs the relationship between the roles of the training data;

An instruction association feature extraction unit, configured to extract an instruction association feature of the training data;

a third training unit, configured to use the instruction association feature and the annotation information to train the instruction association recognition model;

The instruction relationship determining unit is specifically configured to input the instruction association feature into the instruction association recognition model, and obtain an instruction relationship between each character in the current voice segment according to the output of the instruction association recognition model.

An intelligent voice interaction device, including an interconnected processor and a memory;

The memory is configured to store program instructions;

The processor is configured to execute the program instructions to perform:

Receiving user interaction voice data;

Determine whether the current voice segment is a single voice;

If yes, respond according to the semantic understanding result;

Preferably, the processor is further configured to: construct a speaker turning point judgment model in advance, and the constructing process of the speaker turning point judgment model comprises:

The determining whether the current voice segment is a single voice includes:

If at least one frame of the current voice segment has a turning point, it is determined that the current voice segment is not a single voice; otherwise, determining that the current voice segment is a single voice;

Determining, by the processor, the instruction relationship between the roles in the current voice segment according to the current voice segment and its corresponding semantic understanding result includes:

In another embodiment, the processor is configured to implement any of the above intelligent voice interaction methods.

The intelligent voice interaction method and system provided by the embodiments of the present invention, for the characteristics of the interactive scenes in which the plurality of people participate, determine whether the received user interaction voice data is a single voice; if not, the interaction data is more detailed. Accurate analysis, get the relationship between each role in the interaction of multiple people, and make an interactive response according to the relationship between each role command, thus solving the problem that the traditional voice interaction scheme is caused by not considering multi-person participation interaction. The intention to understand the error, the system interaction response error, effectively improve the user experience.

[Description of the Drawings]

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments will be briefly described below. It is obvious that the drawings in the following description are merely described in the present invention. Some of the embodiments of the present invention can also be obtained from those of ordinary skill in the art from the drawings.

1 is a flowchart of an intelligent voice interaction method according to an embodiment of the present invention;

2 is a flow chart of constructing a speaker turning point judgment model in an embodiment of the present invention;

3 is a timing diagram of a speaker turning point judgment model in an embodiment of the present invention;

4 is a flow chart of constructing a semantic relevance model in an embodiment of the present invention;

FIG. 5 is a schematic diagram of a topology structure of a semantic relevance model according to an embodiment of the present invention; FIG.

6 is a flowchart of constructing an instruction association recognition model in an embodiment of the present invention;

7 is a schematic structural diagram of an intelligent voice interaction system according to an embodiment of the present invention;

8 is a schematic diagram of a specific structure of an instruction relationship identification module in an embodiment of the present invention;

9 is a schematic diagram of an angle between a voice segment and a main microphone in an embodiment of the present invention;

10 is another schematic diagram of an angle between a voice segment and a main microphone in an embodiment of the present invention;

FIG. 11 is another schematic structural diagram of an intelligent voice interaction system according to an embodiment of the present invention.

【Detailed ways】

The embodiments of the present invention are further described in detail below with reference to the accompanying drawings and embodiments.

In the existing voice interaction system, only one user voice instruction is determined according to the endpoint detection technology, and the situation in which multiple people speak is not considered. Therefore, the latter half of the round instruction may be the interference of the first half, or the first half. A supplement to a sentence, or two sub-instructions that are completely independent. If you do not distinguish between them, you may get the wrong instruction, which will cause the system to respond incorrectly and affect the user experience. In response to this situation, an embodiment of the present invention provides an intelligent voice interaction method. For the characteristics of an interactive scene in which multiple people participate, a more detailed and accurate analysis and judgment of the interactive voice data is performed, and various character commands are obtained in the case of multiple people participating in the interaction. Inter-relationship and reasonable interaction based on the relationship between the various instructions.

As shown in FIG. 1 , it is a flowchart of an intelligent voice interaction method according to an embodiment of the present invention, which includes the following steps:

Step 101: Receive user interaction voice data.

Specifically, the audio stream can be detected based on the existing endpoint detection technology, and the effective voice in the audio stream is obtained as the interactive voice of the user. The endpoint detection technique needs to set a pause duration threshold eos (usually 0.5s-1s). If the voice pause time is greater than the pause duration threshold, the audio stream is cut off, and the segment voice is used as an effective user interaction voice.

Step 102: Perform speech recognition and semantic understanding on the interactive speech data to obtain a recognition text and a semantic understanding result.

The speech recognition can be performed in real time, that is, the content spoken by the user as of the current time is recognized in real time. Specifically, the decoding network is composed of an acoustic model and a language model. The decoding network includes all candidate recognition result paths up to the current time, and the recognition result path with the largest decoding score is selected as the recognition result of the current time from the current time. After receiving the new user interaction voice data, the path of the recognition result with the largest score is re-selected, and the previous recognition result is updated.

The semantic understanding of speech recognition results may be based on prior art techniques, such as semantic understanding based on grammar rules, semantic understanding based on ontology knowledge base, semantic understanding based on models, etc., and the present invention is not limited thereto.

Step 103: Determine whether the current voice segment is single voice. If yes, go to step 104; otherwise, go to step 105.

In determining whether the current voice segment is a single voice, prior art techniques such as multi-talker recognition techniques may be employed.

Step 104: respond according to the semantic understanding result.

The specific response manner may be, for example, generating a response text, and feeding back the response text to the user, or a specific action on the semantic understanding result, which is not limited by the embodiment of the present invention. If it is a response text, the response text can be fed back to the user by means of voice broadcast; if it is a specific operation, the result of the operation can be presented to the user.

Step 105: Determine an instruction relationship between the roles in the current voice segment according to the current voice segment and its corresponding semantic understanding result.

Specifically, the instruction association feature may be first extracted from the current speech segment and its corresponding semantic understanding result; and then the inter-role instruction relationship in the current speech segment is determined according to the instruction association feature.

Step 106: respond according to the instruction relationship between the roles.

Specifically, the response relationship may be responded according to the command relationship between the roles and the preset response strategy. For example, if the interference in the first half is the first half, the second half is only the first half, and the second half is the supplement to the first half. The intent of the sentence, the first half of the paragraph (ie restarting a new round of dialogue) only responds to the second half of the intent.

Further, in the foregoing step 103, when determining whether the current voice segment is a single-person voice, the embodiment of the present invention may also adopt a method based on the speaker turning point judgment model. Specifically, the speaker turning point judgment model may be constructed in advance, and based on the speaker turning point judgment model, whether the current voice segment is a single voice is determined.

As shown in FIG. 2, it is a construction flow of a speaker turning point judgment model in the embodiment of the present invention, which includes the following steps:

Step 201: Determine a topology structure of the speaker turning point judgment model.

The topology of the speaker turning point judgment model may use a neural network, such as DNN (Deep Neural Network), RNN (Circular Neural Network), CNN (Convolutional Neural Network), etc., taking BiLSTM (Two-way Long-term and Short-term Memory Network) as an example. Considering that BiLSTM can utilize both historical information and the advantages of future information, it can better judge the turning point of the speaker.

The topological structure of the speaker turning point judgment model mainly includes the input layer, the hidden layer and the output layer, wherein the input of the input layer is the spectral feature of each frame of speech, such as a 39-dimensional PLP (Perceptual Linear Predictive) feature; For example, there are 2 layers; the output layer has 2 nodes, which is a 2D vector judged whether there is a turning point, and there is a turning point of 1, and no turning point is 0.

FIG. 3 is a timing diagram showing a speaker turning point judging model, wherein F1 to Ft represent spectral feature vectors input by the input layer node, and h1 to ht are output vectors of each node of the hidden layer.

Step 202: Collect a plurality of interactive voice data including multiple participants, and perform turning point labeling on the interactive voice data.

Step 203: Train the speech data and the annotation information to obtain a speaker turning point judgment model parameter.

The specific training method of the model parameters may adopt a prior art, such as a BPTT (Back Propagation) algorithm, and will not be described in detail herein.

Correspondingly, based on the speaker turning point judgment model, when determining whether the current voice segment is a single voice, the corresponding spectral feature may be extracted from each frame of the current voice segment, and the extracted spectral feature is input into the speaker turning point. Judging the model, according to the model output, it can be determined whether there is a turning point in each frame of speech. If there is a turning point, it indicates that the turning point is different from the speaker's voice. Correspondingly, if there is a turning point in the current voice segment, the current point is determined. The voice segment is not a single voice. Of course, in order to avoid misjudgment, it is also determined that the current voice segment is not a single voice when there are consecutive multiple frames (such as five consecutive frames) in the current voice segment, otherwise, the current voice segment is determined to be a single voice. .

As mentioned above, when determining the instruction relationship between the characters in the current speech segment, the instruction association feature may be extracted from the current speech segment and its corresponding semantic understanding result, and then the roles in the current speech segment are determined according to the instruction association feature. Inter-instruction relationship.

The instruction association feature includes: an acoustic feature and a semantic relevance feature; wherein the acoustic feature comprises any one or more of the following: an average volume level of the voice segment, a signal to noise ratio of the voice segment, a voice segment and a main microphone Relationship angle, the relationship angle refers to the angle between the sound source and the main microphone connection line and the horizontal line, as shown in FIG. 9 and FIG. 10, respectively, for the linear microphone and the ring microphone array, The angle θ between the sound source of the voice segment and the line connecting the main microphone and the horizontal line. These acoustic features can be derived from the current speech segment. The semantic relevance feature may be represented by a value between 0-1, that is, a semantic relevance value, which may be determined according to a semantic understanding result corresponding to the current speech segment and a pre-constructed semantic relevance model.

As shown in FIG. 4, it is a flowchart of constructing a semantic relevance model in the embodiment of the present invention, which includes the following steps:

Step 401: Determine a topology structure of the semantic relevance model;

The topology of the semantic relevance model may use a neural network, for example, taking DNN as an example. As shown in FIG. 5, the text word vector is subjected to convolution and linear transformation layers to obtain low-order word vector features, and then with the business type feature. Splicing, sent to the DNN regression network, and finally output a semantic correlation value between 0-1.

Step 402: Collect a plurality of interactive voice data including multiple participants as training data, and perform semantic relevance labeling on the training data;

Step 403, extract semantic related features of the training data;

The semantic related features include a text word vector corresponding to the user interaction voice data, and a service type involved in the user instruction. Wherein, the extraction of the text word vector can adopt the prior art, for example, using a known word embedding matrix to extract a word vector (such as 50 dimensions) for identifying each word in the text, and then before and after the two speech segments. The word vectors are spliced to form a vector of fixed length, which is not enough to complement 0, such as a total of 50*20=1000 dimensions. The type of service involved in the user instruction can be, for example, a 6-dimensional vector consisting of a chat, a reservation, a weather, a navigation, a music, and a mess.

Step 404, using the instruction association feature and the annotation information to train to obtain an instruction association recognition model

Further, in the embodiment of the present invention, the determination of the command relationship between the characters in the voice segment may also be implemented by using a pre-training model, that is, the pre-training instruction association recognition model, and the extracted instruction association feature is input into the model, according to The output of the model gets the command relationship between the characters in the current speech segment.

As shown in FIG. 6, it is a flowchart of constructing an instruction association identification model in the embodiment of the present invention, which includes the following steps:

Step 601: Determine a topology structure of the instruction association identification model.

The instruction association recognition model may adopt a neural network model, taking DNN as an example. The model topology mainly includes an input layer, a hidden layer, and an output layer, wherein each node of the input layer inputs corresponding acoustic features and semantic relevance features, such as The above three acoustic features may be preferred, the input layer has 4 nodes; the hidden layer is the same as the common DNN hidden layer, generally takes 3-7 layers; the output layer is 3 nodes, respectively outputting three instruction association relationships, that is, interference , supplement and independence.

Step 602: Collect a plurality of interactive voice data including multiple participants as training data, and mark the relationship between the training data;

The relationship between roles is: interference, supplementation and independence.

Step 603: Extract an instruction association feature of the training data.

The instruction association feature is the aforementioned acoustic feature and semantic relevance feature; the acoustic feature includes: an average volume level of the voice segment, a signal to noise ratio of the voice segment, and a relationship between the voice segment and the main microphone; The semantic relevance feature is a semantic relevance value, which can be extracted from each speech segment of the training data and the corresponding semantic understanding result. The semantic relevance feature can be extracted based on the semantic relevance model. The specific process can be Referring to the previous description, it will not be described again here.

Step 604: Train the instruction association recognition model by using the instruction association feature and the annotation information.

The specific training method of the model can adopt the prior art and will not be described in detail herein.

And determining, according to the instruction correlation identification model, the instruction association feature extracted from the current speech segment and the corresponding semantic understanding result thereof, when the instruction relationship between the roles in the current speech segment is determined, according to the instruction The instruction association identifies the output of the model to obtain the command relationship between the characters in the current speech segment.

The intelligent voice interaction method provided by the embodiment of the present invention, for the characteristics of the interactive scene in which the plurality of people participate, determines whether the received user interaction voice data is single-person voice; if not, the data is more detailed and accurate. Analyze, get the relationship between each role in the interaction of multiple people, and make an interactive response according to the relationship between each role instruction, thus solving the user's intention understanding brought by the traditional voice interaction scheme without considering multi-person participation interaction. Errors, system interaction response errors, effectively improve the user experience.

Correspondingly, an embodiment of the present invention further provides an intelligent voice interaction system, as shown in FIG. 7, which is a schematic structural diagram of the system, and the system includes the following modules:

The receiving module 701 is configured to receive user interaction voice data.

The voice recognition module 702 is configured to perform voice recognition on the interactive voice data to obtain the recognized text.

The semantic understanding module 703 is configured to perform semantic understanding on the recognized text to obtain a semantic understanding result;

The determining module 704 is configured to determine whether the current voice segment is a single voice;

The response module 705 is configured to respond to the semantic understanding result after the determining module 704 determines that the current voice segment is a single voice.

The command relationship identification module 706 is configured to determine, after the determining module 704, that the current voice segment is not a single voice, determine an instruction relationship between the characters in the current voice segment according to the current voice segment and the corresponding semantic understanding result;

Correspondingly, in the embodiment, the response module 705 is further configured to respond according to the inter-role command relationship determined by the instruction relationship identification module 706.

That is to say, in the case that the current voice is single-person voice, the response module 705 directly responds to the semantic understanding result, otherwise responds according to the command relationship between the roles in the semantic recognition result. If the second half is the interference to the first half, it only responds to the first half of the intent. The second half is the supplement to the first half. The response to the whole sentence is intent, and the front and the back are independent (ie, restarting a new round of dialogue). Segment intent, thus avoiding the problem of responding to errors in the case of multiple people participating in the interaction, improving the user experience.

It should be noted that, when the determining module 704 determines whether the current voice segment is a single voice, the prior art may be used, for example, a multi-talker recognition technology, etc., or a model-based manner may be used, for example, a speaker turning point. The judging model building module pre-constructs the speaker turning point judging model, and the speaker turning point judging model building module can be used as a part of the system of the present invention, and can also be independent of the system of the present invention.

As described above, the speaker turning point judgment model may adopt a deep neural network, such as DNN, RNN, CNN, etc., and a specific structure of the speaker turning point judgment model building module may include the following units:

The first parameter training unit is configured to use the interactive voice data and the annotation information to obtain a speaker turning point judgment model parameter.

Correspondingly, based on the speaker turning point judgment model, a specific structure of the determining module 704 may include the following units:

The command relationship identification module 706 may specifically extract the instruction association feature from the current speech segment and its corresponding semantic understanding result, and then use the features to determine the command relationship between the roles in the current speech segment. As shown in FIG. 8, a specific structure of the instruction relationship identification module 706 includes: an instruction association feature extraction unit 761 and an instruction relationship determination unit 762, wherein: the instruction association feature extraction unit 761 is configured to use the current speech segment and An instruction association feature is extracted from the corresponding semantic understanding result; the instruction relationship determining unit 762 is configured to determine an inter-role command relationship in the current speech segment according to the instruction association feature.

The instruction association feature includes: an acoustic feature and a semantic relevance feature; the acoustic feature includes any one or more of the following: an average volume of the voice segment, a signal to noise ratio of the voice segment, and a relationship between the voice segment and the primary microphone. The semantic relevance feature is a semantic relevance value. Correspondingly, the instruction association feature extraction unit may include the following subunits:

An acoustic feature extraction subunit, configured to extract the acoustic feature from a current speech segment, specifically using a prior art;

The semantic relevance feature extraction sub-unit is configured to determine a semantic relevance value of the current speech segment according to a semantic understanding result corresponding to the current speech segment, and specifically may adopt a model-based manner, for example, constructing a semantic correlation by a semantic relevance model building module. Degree model.

A specific structure of the semantic relevance model building module includes the following units:

The second training unit is configured to use the semantic related feature and the annotation information to train the instruction association recognition model.

Correspondingly, based on the semantic relevance model, the semantic relevance feature extraction sub-unit may first extract semantic-related features from semantic understanding results corresponding to the current speech segment; and then input the semantic-related features into the semantic relevance model. According to the output of the semantic relevance model, the semantic relevance value of the current speech segment can be obtained.

It should be noted that the above-mentioned semantic relevance model building module may be used as a part of the system of the present invention, or may be independent of the system of the present invention.

The command relationship determining unit 762 may specifically determine the command relationship between the roles in the current voice segment by using a model-based manner. For example, the command association recognition model is pre-built by the instruction association recognition model building module.

A specific structure of the instruction association identification model building module includes the following units;

The third training unit is configured to train the instruction association recognition model by using the instruction association feature and the annotation information.

Correspondingly, based on the instruction association identification model instruction relationship determining unit 762, the instruction association feature may be input into the instruction association recognition model, and the command relationship between the roles in the current speech segment may be obtained according to the output of the instruction association recognition model. .

The intelligent voice interaction system provided by the embodiment of the present invention determines whether the received user interaction voice data is single-person voice for the characteristics of the interaction scene of the multi-person participation; if not, the data is more detailed and accurate. Analyze, get the relationship between each role in the interaction of multiple people, and make an interactive response according to the relationship between each role instruction, thus solving the user's intention understanding brought by the traditional voice interaction scheme without considering multi-person participation interaction. Errors, system interaction response errors, effectively improve the user experience. The intelligent voice interaction system of the invention can be applied to various human-computer interaction devices or devices, has strong adaptability to the interactive environment, and has high response accuracy.

Another embodiment of the present invention provides another text line identification system, as shown in FIG. 11, which is another schematic structural diagram of the intelligent voice interaction system according to the embodiment of the present invention.

In this embodiment, the system includes a processor 111 and a memory 112 that are interconnected. The memory 112 is used to store program instructions and can also be used to store data of the processor 111 during processing. The processor 111 is configured to execute the program instructions to perform the intelligent voice interaction method in the above embodiments.

Specifically, the intelligent voice interaction system can be any device with information processing capability such as a robot, a mobile phone, or a computer. The processor 111 may also be referred to as a CPU (Central Processing Unit). The processor 111 may be an integrated circuit chip with signal processing capabilities. The processor 111 can also be a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, and discrete hardware components. . The general purpose processor may be a microprocessor or the processor or any conventional processor or the like.

The various embodiments in the specification are described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. Moreover, the system embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie It can be located in one place or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without any creative effort.

The embodiments of the present invention have been described in detail above, and the present invention has been described with reference to the specific embodiments thereof. The description of the above embodiments is only for facilitating understanding of the method and apparatus of the present invention. Meanwhile, for those skilled in the art, The present invention is not limited by the scope of the present invention.

Claims

An intelligent voice interaction method, wherein the method comprises:

Receiving user interaction voice data;

Performing speech recognition and semantic understanding on the interactive speech data to obtain recognition text and semantic understanding results;

Determine whether the current voice segment is a single voice;

If yes, respond according to the semantic understanding result;

Otherwise, the command relationship between the characters in the current voice segment is determined according to the current voice segment and its corresponding semantic understanding result, and then responded according to the command relationship between the roles.
The method according to claim 1, wherein the method further comprises: pre-establishing a speaker turning point judgment model, wherein the constructing process of the speaker turning point judgment model comprises:

Determining the topological structure of the speaker turning point judgment model;

Collecting a large amount of interactive voice data including multiple participants, and marking the interactive voice data by turning points;

Using the interactive voice data and the annotation information to train to obtain a speaker turning point judgment model parameter;

The determining whether the current voice segment is a single voice includes:

For each frame of speech in the current speech segment, extract its spectral features;

Inputting the extracted spectral features into the speaker turning point judgment model, and determining whether there is a turning point in each frame voice according to the output of the speaker turning point judgment model;

If at least one frame of speech in the current speech segment has a turning point, it is determined that the current speech segment is not a single speech; otherwise, it is determined that the current speech segment is a single speech.
The method according to claim 2, wherein if the at least one frame of speech in the current speech segment has a turning point, determining that the current speech segment is not a single speech; otherwise, determining that the current speech segment is a single speech, comprising:

If there are turning points in the continuous multi-frame speech in the current speech segment, it is determined that the current speech segment is not single-person speech; otherwise, it is determined that the current speech segment is single-person speech.
The method according to claim 1, wherein the determining the inter-role command relationship in the current speech segment according to the current speech segment and its corresponding semantic understanding result comprises:

Extracting instruction association features from current speech segments and their corresponding semantic understanding results;

Determining an inter-role command relationship in the current speech segment according to the instruction association feature.
The method of claim 4, wherein the instruction association feature comprises: an acoustic feature and a semantic relevance feature; the acoustic feature comprises any one or more of the following: an average volume level of the speech segment, a letter of the speech segment The ratio of the noise ratio, the relationship between the voice segment and the main microphone, the angle between the sound source and the main microphone connected to the horizontal line; the semantic correlation feature is a semantic correlation value;

Extracting the instruction association feature from the current speech segment and its corresponding semantic understanding result includes:

Extracting the acoustic features from the current speech segment;

The semantic relevance value of the current speech segment is determined according to the semantic understanding result corresponding to the current speech segment.
The method according to claim 5, wherein the method further comprises: pre-establishing a semantic relevance model, the construction process of the semantic relevance model comprising:

Determining the topology of the semantic relevance model;

Collecting a large amount of interactive voice data including multiple participants as training data, and performing semantic relevance labeling on the training data;

Extracting semantically related features of the training data;

Using the semantic related features and the annotation information to train to obtain a semantic relevance model;

Determining the semantic relevance value of the current speech segment according to the semantic understanding result corresponding to the current speech segment includes:

Extracting semantic related features from semantic understanding results corresponding to the current speech segment;

And inputting the semantic related feature into the semantic relevance model, and obtaining a semantic relevance value of the current speech segment according to the output of the semantic relevance model.
The method according to claim 6, wherein the semantic related feature comprises: a text word vector corresponding to the interactive voice data, and a service type involved in the user instruction in the interactive voice data.
The method according to claim 4, wherein the method further comprises: pre-building an instruction association recognition model, wherein the instruction association recognition model construction process comprises:

Determining the topology of the instruction association recognition model;

Collecting a large amount of interactive voice data including multiple participants as training data, and labeling the training data with inter-character relationship;

Extracting instruction association features of the training data;

Using the instruction association feature and the annotation information to train to obtain an instruction association recognition model;

Determining, according to the instruction association feature, an instruction relationship between each character in the current voice segment, including:

And inputting the instruction association feature into the instruction association recognition model, and obtaining an instruction relationship between each character in the current voice segment according to the output of the instruction association recognition model.
The method of claim 4 wherein said inter-role command relationships comprise: interference, supplementation, and independence.
An intelligent voice interaction system, wherein the system comprises:

a receiving module, configured to receive user interaction voice data;

a voice recognition module, configured to perform voice recognition on the interactive voice data to obtain a recognized text;

a semantic understanding module, configured to perform semantic understanding on the recognized text, and obtain a semantic understanding result;

a determining module, configured to determine whether the current voice segment is a single voice;

a response module, configured to respond to the semantic understanding result after the determining module determines that the current voice segment is a single voice;

The command relationship identification module is configured to determine, after the determining module determines that the current voice segment is not a single voice, determine an instruction relationship between the characters in the current voice segment according to the current voice segment and the corresponding semantic understanding result;

The response module is further configured to respond according to the inter-role command relationship determined by the instruction relationship identification module.
The system according to claim 10, wherein the system further comprises: a speaker turning point judgment model building module, configured to pre-build a speaker turning point judgment model; and the speaker turning point judgment model building module comprises:

a first topology determining unit, configured to determine a topology structure of the speaker turning point judgment model;

a first data collecting unit, configured to collect a plurality of interactive voice data including multiple participants, and perform turning point labeling on the interactive voice data;

a first parameter training unit, configured to use the interactive voice data and the annotation information to obtain a speaker turning point judgment model parameter;

The determining module includes:

a spectrum feature extraction unit, configured to extract a spectral feature for each frame of speech in the current speech segment;

a turning point determining unit, configured to input the extracted spectral feature into the speaker turning point judgment model, and determine, according to the output of the speaker turning point judgment model, whether there is a turning point in each frame of voice;

The determining unit is configured to determine that the current voice segment is not a single voice when there is at least one frame of voice in the current voice segment; otherwise, determine that the current voice segment is a single voice.
The system according to claim 11, wherein the determining unit is configured to determine that the current speech segment is not a single speech when there are consecutive multi-frame speech in the current speech segment; otherwise, determining that the current speech segment is a single Human voice.
The system of claim 10 wherein said instruction relationship identification module comprises:

An instruction association feature extraction unit, configured to extract an instruction association feature from a current speech segment and a corresponding semantic understanding result thereof;

The command relationship determining unit is configured to determine an instruction relationship between the roles in the current voice segment according to the instruction association feature.
The system of claim 13 wherein said instruction association feature comprises: an acoustic feature and a semantic relevance feature; said acoustic feature comprising any one or more of the following: an average volume level of the speech segment, a letter of the speech segment The ratio of the noise ratio, the relationship between the voice segment and the main microphone, the angle between the sound source and the main microphone connected to the horizontal line; the semantic correlation feature is a semantic correlation value;

The instruction association feature extraction unit includes:

An acoustic feature extraction subunit for extracting the acoustic feature from a current speech segment;

The semantic relevance feature extraction sub-unit is configured to determine a semantic relevance value of the current speech segment according to a semantic understanding result corresponding to the current speech segment.
The system of claim 14, wherein the system further comprises: a semantic relevance model building module, configured to pre-build a semantic relevance model; the semantic relevance model building module comprises:

a second topology determining unit, configured to determine a topology of the semantic relevance model;

a second data collecting unit, configured to collect a plurality of interactive voice data including multiple participants as training data, and perform semantic relevance labeling on the training data;

a semantic correlation feature extraction unit, configured to extract semantic related features of the training data;

a second training unit, configured to use the semantic related feature and the annotation information to obtain a semantic relevance model;

The semantic relevance feature extraction sub-unit is specifically configured to extract semantic-related features from semantic understanding results corresponding to the current speech segment; input the semantic-related features into the semantic relevance model, according to the semantic relevance model The output gets the semantic relevance value of the current speech segment.
The system according to claim 15, wherein the semantic related features comprise: a text word vector corresponding to the interactive voice data, and a service type involved in the user instruction in the interactive voice data.
The system according to claim 13, wherein the system further comprises: an instruction association identification model building module, configured to pre-build an instruction association recognition model; and the instruction association recognition model construction module comprises:

a third topology determining unit, configured to determine a topology of the instruction association identification model;

The third data collecting unit collects a plurality of interactive voice data including multiple participants as training data, and performs the relationship between the roles of the training data;

An instruction association feature extraction unit, configured to extract an instruction association feature of the training data;

a third training unit, configured to use the instruction association feature and the annotation information to train the instruction association recognition model;

The instruction relationship determining unit is specifically configured to input the instruction association feature into the instruction association recognition model, and obtain an instruction relationship between each character in the current voice segment according to the output of the instruction association recognition model.
The system of claim 17 wherein said inter-role command relationships comprise: interference, supplementation, and independence.
An intelligent voice interaction system including an interconnected processor and a memory;

The memory is configured to store program instructions;

The processor is configured to execute the program instructions to perform:

Receiving user interaction voice data;

Performing speech recognition and semantic understanding on the interactive speech data to obtain recognition text and semantic understanding results;

Determine whether the current voice segment is a single voice;

If yes, respond according to the semantic understanding result;

Otherwise, the command relationship between the characters in the current voice segment is determined according to the current voice segment and its corresponding semantic understanding result, and then responded according to the command relationship between the roles.
The system according to claim 19, wherein the processor is further configured to: pre-build a speaker turning point judgment model, and the constructing process of the speaker turning point judgment model comprises:

Determining the topological structure of the speaker turning point judgment model;

Collecting a large amount of interactive voice data including multiple participants, and marking the interactive voice data by turning points;

Using the interactive voice data and the annotation information to train to obtain a speaker turning point judgment model parameter;

The determining whether the current voice segment is a single voice includes:

For each frame of speech in the current speech segment, extract its spectral features;

Inputting the extracted spectral features into the speaker turning point judgment model, and determining whether there is a turning point in each frame voice according to the output of the speaker turning point judgment model;

If at least one frame of the current voice segment has a turning point, it is determined that the current voice segment is not a single voice; otherwise, determining that the current voice segment is a single voice;

Determining, by the processor, the instruction relationship between the roles in the current voice segment according to the current voice segment and its corresponding semantic understanding result includes:

Extracting instruction association features from current speech segments and their corresponding semantic understanding results;

Determining an inter-role command relationship in the current speech segment according to the instruction association feature.