WO2019029352A1 - Intelligent voice interaction method and system - Google Patents

Intelligent voice interaction method and system Download PDF

Info

Publication number
WO2019029352A1
WO2019029352A1 PCT/CN2018/096705 CN2018096705W WO2019029352A1 WO 2019029352 A1 WO2019029352 A1 WO 2019029352A1 CN 2018096705 W CN2018096705 W CN 2018096705W WO 2019029352 A1 WO2019029352 A1 WO 2019029352A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
segment
semantic
current
instruction
Prior art date
Application number
PCT/CN2018/096705
Other languages
French (fr)
Chinese (zh)
Inventor
李锐
陈志刚
王智国
胡国平
Original Assignee
科大讯飞股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 科大讯飞股份有限公司 filed Critical 科大讯飞股份有限公司
Publication of WO2019029352A1 publication Critical patent/WO2019029352A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present invention relates to the field of speech signal processing and natural language understanding, and in particular to an intelligent voice interaction method and system.
  • the embodiment of the invention provides an intelligent voice interaction method and system, so as to avoid erroneous understanding and response in an interaction scenario involving multiple people.
  • An intelligent voice interaction method comprising:
  • the command relationship between the characters in the current voice segment is determined according to the current voice segment and its corresponding semantic understanding result, and then responded according to the command relationship between the roles.
  • the method further comprises: constructing a speaker turning point judgment model in advance, and the constructing process of the speaker turning point judgment model comprises:
  • the determining whether the current voice segment is a single voice includes:
  • At least one frame of speech in the current speech segment has a turning point, it is determined that the current speech segment is not a single speech; otherwise, it is determined that the current speech segment is a single speech.
  • determining that the current voice segment is a single voice includes:
  • the determining, according to the current voice segment and the corresponding semantic understanding result, the command relationship between the roles in the current voice segment includes:
  • the instruction association feature comprises: an acoustic feature and a semantic relevance feature;
  • the acoustic feature comprises any one or more of the following: an average volume level of the speech segment, a signal to noise ratio of the speech segment, a speech segment and a main microphone
  • the angle of the relationship is the angle between the line connecting the sound source and the main microphone and the horizontal line;
  • the semantic relevance feature is a semantic relevance value;
  • Extracting the instruction association feature from the current speech segment and its corresponding semantic understanding result includes:
  • the semantic relevance value of the current speech segment is determined according to the semantic understanding result corresponding to the current speech segment.
  • the method further comprises: pre-establishing a semantic relevance model, wherein the construction process of the semantic relevance model comprises:
  • Determining the semantic relevance value of the current speech segment according to the semantic understanding result corresponding to the current speech segment includes:
  • the semantic related feature comprises: a text word vector corresponding to the interaction voice data, and a service type involved in the user instruction in the interaction voice data.
  • the method further includes: pre-building an instruction association recognition model, where the instruction association recognition model construction process includes:
  • the inter-character instruction relationships include: interference, supplementation, and independence.
  • An intelligent voice interaction system comprising:
  • a receiving module configured to receive user interaction voice data
  • a voice recognition module configured to perform voice recognition on the interactive voice data to obtain a recognized text
  • a semantic understanding module configured to perform semantic understanding on the recognized text, and obtain a semantic understanding result
  • a determining module configured to determine whether the current voice segment is a single voice
  • a response module configured to respond to the semantic understanding result after the determining module determines that the current voice segment is a single voice
  • the command relationship identification module is configured to determine, after the determining module determines that the current voice segment is not a single voice, determine an instruction relationship between the characters in the current voice segment according to the current voice segment and the corresponding semantic understanding result;
  • the system further includes: a speaker turning point judgment model building module, configured to pre-build a speaker turning point judgment model; and the speaker turning point judgment model building module includes:
  • a first topology determining unit configured to determine a topology structure of the speaker turning point judgment model
  • a first data collecting unit configured to collect a plurality of interactive voice data including multiple participants, and perform turning point labeling on the interactive voice data
  • a first parameter training unit configured to use the interactive voice data and the annotation information to obtain a speaker turning point judgment model parameter
  • the determining module includes:
  • a turning point determining unit configured to input the extracted spectral feature into the speaker turning point judgment model, and determine, according to the output of the speaker turning point judgment model, whether there is a turning point in each frame of voice;
  • the determining unit is configured to determine that the current voice segment is not a single voice when there is at least one frame of voice in the current voice segment; otherwise, determine that the current voice segment is a single voice.
  • the determining unit is specifically configured to determine that the current voice segment is not a single voice when there are consecutive multi-frame voices in the current voice segment; otherwise, determine that the current voice segment is a single voice.
  • the instruction relationship identification module comprises:
  • An instruction association feature extraction unit configured to extract an instruction association feature from a current speech segment and a corresponding semantic understanding result thereof
  • the command relationship determining unit is configured to determine an instruction relationship between the roles in the current voice segment according to the instruction association feature.
  • the instruction association feature comprises: an acoustic feature and a semantic relevance feature;
  • the acoustic feature comprises any one or more of the following: an average volume level of the speech segment, a signal to noise ratio of the speech segment, a speech segment and a main microphone
  • the angle of the relationship is the angle between the line connecting the sound source and the main microphone and the horizontal line;
  • the semantic relevance feature is a semantic relevance value;
  • the instruction association feature extraction unit includes:
  • An acoustic feature extraction subunit for extracting the acoustic feature from a current speech segment
  • the semantic relevance feature extraction sub-unit is configured to determine a semantic relevance value of the current speech segment according to a semantic understanding result corresponding to the current speech segment.
  • the system further comprises: a semantic relevance model building module, configured to pre-build a semantic relevance model; the semantic relevance model building module comprises:
  • a second topology determining unit configured to determine a topology of the semantic relevance model
  • a second data collecting unit configured to collect a plurality of interactive voice data including multiple participants as training data, and perform semantic relevance labeling on the training data;
  • a semantic correlation feature extraction unit configured to extract semantic related features of the training data
  • a second training unit configured to use the semantic related feature and the annotation information to obtain a semantic relevance model
  • the semantic relevance feature extraction sub-unit is specifically configured to extract semantic-related features from semantic understanding results corresponding to the current speech segment; input the semantic-related features into the semantic relevance model, according to the semantic relevance model The output gets the semantic relevance value of the current speech segment.
  • the semantic related feature comprises: a text word vector corresponding to the interaction voice data, and a service type involved in the user instruction in the interaction voice data.
  • the third data collecting unit collects a plurality of interactive voice data including multiple participants as training data, and performs the relationship between the roles of the training data;
  • An instruction association feature extraction unit configured to extract an instruction association feature of the training data
  • a third training unit configured to use the instruction association feature and the annotation information to train the instruction association recognition model
  • the instruction relationship determining unit is specifically configured to input the instruction association feature into the instruction association recognition model, and obtain an instruction relationship between each character in the current voice segment according to the output of the instruction association recognition model.
  • the inter-character instruction relationships include: interference, supplementation, and independence.
  • An intelligent voice interaction device including an interconnected processor and a memory
  • the memory is configured to store program instructions
  • the processor is configured to execute the program instructions to perform:
  • the command relationship between the characters in the current voice segment is determined according to the current voice segment and its corresponding semantic understanding result, and then responded according to the command relationship between the roles.
  • the processor is further configured to: construct a speaker turning point judgment model in advance, and the constructing process of the speaker turning point judgment model comprises:
  • the determining whether the current voice segment is a single voice includes:
  • At least one frame of the current voice segment has a turning point, it is determined that the current voice segment is not a single voice; otherwise, determining that the current voice segment is a single voice;
  • Determining, by the processor, the instruction relationship between the roles in the current voice segment according to the current voice segment and its corresponding semantic understanding result includes:
  • the processor is configured to implement any of the above intelligent voice interaction methods.
  • the intelligent voice interaction method and system provided by the embodiments of the present invention, for the characteristics of the interactive scenes in which the plurality of people participate, determine whether the received user interaction voice data is a single voice; if not, the interaction data is more detailed. Accurate analysis, get the relationship between each role in the interaction of multiple people, and make an interactive response according to the relationship between each role command, thus solving the problem that the traditional voice interaction scheme is caused by not considering multi-person participation interaction. The intention to understand the error, the system interaction response error, effectively improve the user experience.
  • FIG. 1 is a flowchart of an intelligent voice interaction method according to an embodiment of the present invention
  • FIG. 2 is a flow chart of constructing a speaker turning point judgment model in an embodiment of the present invention
  • FIG. 3 is a timing diagram of a speaker turning point judgment model in an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a topology structure of a semantic relevance model according to an embodiment of the present invention.
  • FIG. 6 is a flowchart of constructing an instruction association recognition model in an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of an intelligent voice interaction system according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of a specific structure of an instruction relationship identification module in an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of an angle between a voice segment and a main microphone in an embodiment of the present invention.
  • FIG. 10 is another schematic diagram of an angle between a voice segment and a main microphone in an embodiment of the present invention.
  • FIG. 11 is another schematic structural diagram of an intelligent voice interaction system according to an embodiment of the present invention.
  • an embodiment of the present invention provides an intelligent voice interaction method. For the characteristics of an interactive scene in which multiple people participate, a more detailed and accurate analysis and judgment of the interactive voice data is performed, and various character commands are obtained in the case of multiple people participating in the interaction. Inter-relationship and reasonable interaction based on the relationship between the various instructions.
  • FIG. 1 it is a flowchart of an intelligent voice interaction method according to an embodiment of the present invention, which includes the following steps:
  • Step 101 Receive user interaction voice data.
  • the audio stream can be detected based on the existing endpoint detection technology, and the effective voice in the audio stream is obtained as the interactive voice of the user.
  • the endpoint detection technique needs to set a pause duration threshold eos (usually 0.5s-1s). If the voice pause time is greater than the pause duration threshold, the audio stream is cut off, and the segment voice is used as an effective user interaction voice.
  • Step 102 Perform speech recognition and semantic understanding on the interactive speech data to obtain a recognition text and a semantic understanding result.
  • the speech recognition can be performed in real time, that is, the content spoken by the user as of the current time is recognized in real time.
  • the decoding network is composed of an acoustic model and a language model.
  • the decoding network includes all candidate recognition result paths up to the current time, and the recognition result path with the largest decoding score is selected as the recognition result of the current time from the current time. After receiving the new user interaction voice data, the path of the recognition result with the largest score is re-selected, and the previous recognition result is updated.
  • the semantic understanding of speech recognition results may be based on prior art techniques, such as semantic understanding based on grammar rules, semantic understanding based on ontology knowledge base, semantic understanding based on models, etc., and the present invention is not limited thereto.
  • Step 103 Determine whether the current voice segment is single voice. If yes, go to step 104; otherwise, go to step 105.
  • Step 104 respond according to the semantic understanding result.
  • the specific response manner may be, for example, generating a response text, and feeding back the response text to the user, or a specific action on the semantic understanding result, which is not limited by the embodiment of the present invention. If it is a response text, the response text can be fed back to the user by means of voice broadcast; if it is a specific operation, the result of the operation can be presented to the user.
  • Step 105 Determine an instruction relationship between the roles in the current voice segment according to the current voice segment and its corresponding semantic understanding result.
  • the instruction association feature may be first extracted from the current speech segment and its corresponding semantic understanding result; and then the inter-role instruction relationship in the current speech segment is determined according to the instruction association feature.
  • Step 106 respond according to the instruction relationship between the roles.
  • the response relationship may be responded according to the command relationship between the roles and the preset response strategy. For example, if the interference in the first half is the first half, the second half is only the first half, and the second half is the supplement to the first half.
  • the intent of the sentence, the first half of the paragraph ie restarting a new round of dialogue only responds to the second half of the intent.
  • the embodiment of the present invention may also adopt a method based on the speaker turning point judgment model.
  • the speaker turning point judgment model may be constructed in advance, and based on the speaker turning point judgment model, whether the current voice segment is a single voice is determined.
  • FIG. 2 it is a construction flow of a speaker turning point judgment model in the embodiment of the present invention, which includes the following steps:
  • Step 201 Determine a topology structure of the speaker turning point judgment model.
  • the topology of the speaker turning point judgment model may use a neural network, such as DNN (Deep Neural Network), RNN (Circular Neural Network), CNN (Convolutional Neural Network), etc., taking BiLSTM (Two-way Long-term and Short-term Memory Network) as an example.
  • a neural network such as DNN (Deep Neural Network), RNN (Circular Neural Network), CNN (Convolutional Neural Network), etc.
  • BiLSTM Two-way Long-term and Short-term Memory Network
  • the topological structure of the speaker turning point judgment model mainly includes the input layer, the hidden layer and the output layer, wherein the input of the input layer is the spectral feature of each frame of speech, such as a 39-dimensional PLP (Perceptual Linear Predictive) feature; For example, there are 2 layers; the output layer has 2 nodes, which is a 2D vector judged whether there is a turning point, and there is a turning point of 1, and no turning point is 0.
  • PLP Personal Linear Predictive
  • FIG. 3 is a timing diagram showing a speaker turning point judging model, wherein F1 to Ft represent spectral feature vectors input by the input layer node, and h1 to ht are output vectors of each node of the hidden layer.
  • Step 202 Collect a plurality of interactive voice data including multiple participants, and perform turning point labeling on the interactive voice data.
  • Step 203 Train the speech data and the annotation information to obtain a speaker turning point judgment model parameter.
  • the specific training method of the model parameters may adopt a prior art, such as a BPTT (Back Propagation) algorithm, and will not be described in detail herein.
  • BPTT Back Propagation
  • the corresponding spectral feature may be extracted from each frame of the current voice segment, and the extracted spectral feature is input into the speaker turning point. Judging the model, according to the model output, it can be determined whether there is a turning point in each frame of speech. If there is a turning point, it indicates that the turning point is different from the speaker's voice. Correspondingly, if there is a turning point in the current voice segment, the current point is determined.
  • the voice segment is not a single voice. Of course, in order to avoid misjudgment, it is also determined that the current voice segment is not a single voice when there are consecutive multiple frames (such as five consecutive frames) in the current voice segment, otherwise, the current voice segment is determined to be a single voice. .
  • the instruction association feature may be extracted from the current speech segment and its corresponding semantic understanding result, and then the roles in the current speech segment are determined according to the instruction association feature. Inter-instruction relationship.
  • the instruction association feature includes: an acoustic feature and a semantic relevance feature; wherein the acoustic feature comprises any one or more of the following: an average volume level of the voice segment, a signal to noise ratio of the voice segment, a voice segment and a main microphone Relationship angle, the relationship angle refers to the angle between the sound source and the main microphone connection line and the horizontal line, as shown in FIG. 9 and FIG. 10, respectively, for the linear microphone and the ring microphone array, The angle ⁇ between the sound source of the voice segment and the line connecting the main microphone and the horizontal line.
  • the semantic relevance feature may be represented by a value between 0-1, that is, a semantic relevance value, which may be determined according to a semantic understanding result corresponding to the current speech segment and a pre-constructed semantic relevance model.
  • FIG. 4 it is a flowchart of constructing a semantic relevance model in the embodiment of the present invention, which includes the following steps:
  • Step 401 Determine a topology structure of the semantic relevance model
  • the topology of the semantic relevance model may use a neural network, for example, taking DNN as an example.
  • the text word vector is subjected to convolution and linear transformation layers to obtain low-order word vector features, and then with the business type feature. Splicing, sent to the DNN regression network, and finally output a semantic correlation value between 0-1.
  • Step 402 Collect a plurality of interactive voice data including multiple participants as training data, and perform semantic relevance labeling on the training data;
  • Step 403 extract semantic related features of the training data
  • the semantic related features include a text word vector corresponding to the user interaction voice data, and a service type involved in the user instruction.
  • the extraction of the text word vector can adopt the prior art, for example, using a known word embedding matrix to extract a word vector (such as 50 dimensions) for identifying each word in the text, and then before and after the two speech segments.
  • the type of service involved in the user instruction can be, for example, a 6-dimensional vector consisting of a chat, a reservation, a weather, a navigation, a music, and a mess.
  • Step 404 using the instruction association feature and the annotation information to train to obtain an instruction association recognition model
  • the determination of the command relationship between the characters in the voice segment may also be implemented by using a pre-training model, that is, the pre-training instruction association recognition model, and the extracted instruction association feature is input into the model, according to The output of the model gets the command relationship between the characters in the current speech segment.
  • a pre-training model that is, the pre-training instruction association recognition model
  • FIG. 6 it is a flowchart of constructing an instruction association identification model in the embodiment of the present invention, which includes the following steps:
  • Step 601 Determine a topology structure of the instruction association identification model.
  • the instruction association recognition model may adopt a neural network model, taking DNN as an example.
  • the model topology mainly includes an input layer, a hidden layer, and an output layer, wherein each node of the input layer inputs corresponding acoustic features and semantic relevance features, such as The above three acoustic features may be preferred, the input layer has 4 nodes; the hidden layer is the same as the common DNN hidden layer, generally takes 3-7 layers; the output layer is 3 nodes, respectively outputting three instruction association relationships, that is, interference , supplement and independence.
  • Step 602 Collect a plurality of interactive voice data including multiple participants as training data, and mark the relationship between the training data;
  • Step 603 Extract an instruction association feature of the training data.
  • the instruction association feature is the aforementioned acoustic feature and semantic relevance feature;
  • the acoustic feature includes: an average volume level of the voice segment, a signal to noise ratio of the voice segment, and a relationship between the voice segment and the main microphone;
  • the semantic relevance feature is a semantic relevance value, which can be extracted from each speech segment of the training data and the corresponding semantic understanding result.
  • the semantic relevance feature can be extracted based on the semantic relevance model.
  • Step 604 Train the instruction association recognition model by using the instruction association feature and the annotation information.
  • the instruction association identifies the output of the model to obtain the command relationship between the characters in the current speech segment.
  • the intelligent voice interaction method for the characteristics of the interactive scene in which the plurality of people participate, determines whether the received user interaction voice data is single-person voice; if not, the data is more detailed and accurate. Analyze, get the relationship between each role in the interaction of multiple people, and make an interactive response according to the relationship between each role instruction, thus solving the user's intention understanding brought by the traditional voice interaction scheme without considering multi-person participation interaction. Errors, system interaction response errors, effectively improve the user experience.
  • an embodiment of the present invention further provides an intelligent voice interaction system, as shown in FIG. 7, which is a schematic structural diagram of the system, and the system includes the following modules:
  • the receiving module 701 is configured to receive user interaction voice data.
  • the voice recognition module 702 is configured to perform voice recognition on the interactive voice data to obtain the recognized text.
  • the semantic understanding module 703 is configured to perform semantic understanding on the recognized text to obtain a semantic understanding result
  • the determining module 704 is configured to determine whether the current voice segment is a single voice
  • the response module 705 is configured to respond to the semantic understanding result after the determining module 704 determines that the current voice segment is a single voice.
  • the command relationship identification module 706 is configured to determine, after the determining module 704, that the current voice segment is not a single voice, determine an instruction relationship between the characters in the current voice segment according to the current voice segment and the corresponding semantic understanding result;
  • the response module 705 is further configured to respond according to the inter-role command relationship determined by the instruction relationship identification module 706.
  • the response module 705 directly responds to the semantic understanding result, otherwise responds according to the command relationship between the roles in the semantic recognition result. If the second half is the interference to the first half, it only responds to the first half of the intent. The second half is the supplement to the first half.
  • the response to the whole sentence is intent, and the front and the back are independent (ie, restarting a new round of dialogue). Segment intent, thus avoiding the problem of responding to errors in the case of multiple people participating in the interaction, improving the user experience.
  • the determining module 704 determines whether the current voice segment is a single voice
  • the prior art may be used, for example, a multi-talker recognition technology, etc., or a model-based manner may be used, for example, a speaker turning point.
  • the judging model building module pre-constructs the speaker turning point judging model, and the speaker turning point judging model building module can be used as a part of the system of the present invention, and can also be independent of the system of the present invention.
  • the speaker turning point judgment model may adopt a deep neural network, such as DNN, RNN, CNN, etc.
  • a specific structure of the speaker turning point judgment model building module may include the following units:
  • a first topology determining unit configured to determine a topology structure of the speaker turning point judgment model
  • a first data collecting unit configured to collect a plurality of interactive voice data including multiple participants, and perform turning point labeling on the interactive voice data
  • the first parameter training unit is configured to use the interactive voice data and the annotation information to obtain a speaker turning point judgment model parameter.
  • a specific structure of the determining module 704 may include the following units:
  • a spectrum feature extraction unit configured to extract a spectral feature for each frame of speech in the current speech segment
  • a turning point determining unit configured to input the extracted spectral feature into the speaker turning point judgment model, and determine, according to the output of the speaker turning point judgment model, whether there is a turning point in each frame of voice;
  • the determining unit is configured to determine that the current voice segment is not a single voice when there is at least one frame of voice in the current voice segment; otherwise, determine that the current voice segment is a single voice.
  • the command relationship identification module 706 may specifically extract the instruction association feature from the current speech segment and its corresponding semantic understanding result, and then use the features to determine the command relationship between the roles in the current speech segment.
  • a specific structure of the instruction relationship identification module 706 includes: an instruction association feature extraction unit 761 and an instruction relationship determination unit 762, wherein: the instruction association feature extraction unit 761 is configured to use the current speech segment and An instruction association feature is extracted from the corresponding semantic understanding result; the instruction relationship determining unit 762 is configured to determine an inter-role command relationship in the current speech segment according to the instruction association feature.
  • the instruction association feature includes: an acoustic feature and a semantic relevance feature; the acoustic feature includes any one or more of the following: an average volume of the voice segment, a signal to noise ratio of the voice segment, and a relationship between the voice segment and the primary microphone.
  • the semantic relevance feature is a semantic relevance value.
  • the instruction association feature extraction unit may include the following subunits:
  • An acoustic feature extraction subunit configured to extract the acoustic feature from a current speech segment, specifically using a prior art
  • a specific structure of the semantic relevance model building module includes the following units:
  • a second topology determining unit configured to determine a topology of the semantic relevance model
  • a second data collecting unit configured to collect a plurality of interactive voice data including multiple participants as training data, and perform semantic relevance labeling on the training data;
  • a semantic correlation feature extraction unit configured to extract semantic related features of the training data
  • the second training unit is configured to use the semantic related feature and the annotation information to train the instruction association recognition model.
  • the semantic relevance feature extraction sub-unit may first extract semantic-related features from semantic understanding results corresponding to the current speech segment; and then input the semantic-related features into the semantic relevance model. According to the output of the semantic relevance model, the semantic relevance value of the current speech segment can be obtained.
  • semantic relevance model building module may be used as a part of the system of the present invention, or may be independent of the system of the present invention.
  • the command relationship determining unit 762 may specifically determine the command relationship between the roles in the current voice segment by using a model-based manner.
  • the command association recognition model is pre-built by the instruction association recognition model building module.
  • a third topology determining unit configured to determine a topology of the instruction association identification model
  • the third data collecting unit collects a plurality of interactive voice data including multiple participants as training data, and performs the relationship between the roles of the training data;
  • An instruction association feature extraction unit configured to extract an instruction association feature of the training data
  • the third training unit is configured to train the instruction association recognition model by using the instruction association feature and the annotation information.
  • the instruction association feature may be input into the instruction association recognition model, and the command relationship between the roles in the current speech segment may be obtained according to the output of the instruction association recognition model. .
  • the intelligent voice interaction system determines whether the received user interaction voice data is single-person voice for the characteristics of the interaction scene of the multi-person participation; if not, the data is more detailed and accurate. Analyze, get the relationship between each role in the interaction of multiple people, and make an interactive response according to the relationship between each role instruction, thus solving the user's intention understanding brought by the traditional voice interaction scheme without considering multi-person participation interaction. Errors, system interaction response errors, effectively improve the user experience.
  • the intelligent voice interaction system of the invention can be applied to various human-computer interaction devices or devices, has strong adaptability to the interactive environment, and has high response accuracy.
  • FIG. 11 is another schematic structural diagram of the intelligent voice interaction system according to the embodiment of the present invention.
  • the system includes a processor 111 and a memory 112 that are interconnected.
  • the memory 112 is used to store program instructions and can also be used to store data of the processor 111 during processing.
  • the processor 111 is configured to execute the program instructions to perform the intelligent voice interaction method in the above embodiments.
  • the intelligent voice interaction system can be any device with information processing capability such as a robot, a mobile phone, or a computer.
  • the processor 111 may also be referred to as a CPU (Central Processing Unit).
  • the processor 111 may be an integrated circuit chip with signal processing capabilities.
  • the processor 111 can also be a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, and discrete hardware components.
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • User Interface Of Digital Computer (AREA)
  • Machine Translation (AREA)

Abstract

Disclosed in the present invention are an intelligent voice interaction method and system. The method comprises: receiving user interaction voice; performing voice recognition and semantic understanding on the interaction voice to obtain recognized text and a semantic understanding result; determining whether the current voice segment is voice of a single person; if yes, responding according to the semantic understanding result; otherwise, determining an instruction relationship between roles in the current voice segment according to the current voice segment and the corresponding semantic understanding result, and then responding according to the instruction relationship between the roles. The present invention can improve the responding accuracy rate in a human-machine interaction environment in which multiple persons participate, and improve the user experience.

Description

一种智能语音交互方法及系统Intelligent voice interaction method and system 【技术领域】[Technical Field]
本发明涉及语音信号处理及自然语言理解领域,具体涉及一种智能语音交互方法及系统。The present invention relates to the field of speech signal processing and natural language understanding, and in particular to an intelligent voice interaction method and system.
【背景技术】【Background technique】
随着人工智能技术的不断进步,人机语音交互也取得了长足的发展,各种语音助手APP和人机交互机器人大肆兴起,随之人们对自然便捷的人机交互渴望也达到了空前的高度。现有的人机交互方法多是基于端点检测技术确定用户有效交互语音,再对所述交互语音进行识别、语义理解,最后系统针对语义理解结果做出相应的响应。然而,人机交互常存在多人参与交互的情况,在该种情况下,不同角色的语音,可能是相互间的干扰、也有可能是补充或者是不同的交互指令,但现有的人机交互方法,会将多人的语音数据同作为一条语音指令数据进行识别、语义理解,最后进行响应,最终可能导致一次错误的交互。With the continuous advancement of artificial intelligence technology, human-machine voice interaction has also made great progress. Various voice assistant APPs and human-computer interaction robots have emerged, and people's desire for natural and convenient human-computer interaction has reached an unprecedented height. . The existing human-computer interaction methods are based on the endpoint detection technology to determine the user's effective interactive speech, and then the interactive speech recognition and semantic understanding. Finally, the system responds to the semantic understanding result. However, human-computer interaction often involves multiple people participating in the interaction. In this case, the voices of different roles may be mutual interference, or may be supplemental or different interactive instructions, but the existing human-computer interaction. The method will recognize and semantically understand the voice data of multiple people as a voice instruction data, and finally respond, which may eventually lead to a wrong interaction.
【发明内容】[Summary of the Invention]
本发明实施例提供一种智能语音交互方法及系统,以避免在有多人参与的交互场景下,产生错误的理解及响应。The embodiment of the invention provides an intelligent voice interaction method and system, so as to avoid erroneous understanding and response in an interaction scenario involving multiple people.
为此,本发明提供如下技术方案:To this end, the present invention provides the following technical solutions:
一种智能语音交互方法,所述方法包括:An intelligent voice interaction method, the method comprising:
接收用户交互语音数据;Receiving user interaction voice data;
对所述交互语音数据进行语音识别及语义理解,得到识别文本及语义理解结果;Performing speech recognition and semantic understanding on the interactive speech data to obtain recognition text and semantic understanding results;
确定当前语音段是否为单人语音;Determine whether the current voice segment is a single voice;
如果是,则根据所述语义理解结果进行响应;If yes, respond according to the semantic understanding result;
否则,根据当前语音段及其对应的语义理解结果确定所述当前语音段中各角色间指令关系,然后根据所述各角色间指令关系进行响应。Otherwise, the command relationship between the characters in the current voice segment is determined according to the current voice segment and its corresponding semantic understanding result, and then responded according to the command relationship between the roles.
优选地,所述方法还包括:预先构建说话人转折点判断模型,所述说话人转折点判断模型的构建过程包括:Preferably, the method further comprises: constructing a speaker turning point judgment model in advance, and the constructing process of the speaker turning point judgment model comprises:
确定说话人转折点判断模型的拓扑结构;Determining the topological structure of the speaker turning point judgment model;
收集大量包含多人参与的交互语音数据,并对所述交互语音数据进行转折点标注;Collecting a large amount of interactive voice data including multiple participants, and marking the interactive voice data by turning points;
利用所述交互语音数据及标注信息训练得到说话人转折点判断模型参数;Using the interactive voice data and the annotation information to train to obtain a speaker turning point judgment model parameter;
所述确定当前语音段是否为单人语音包括:The determining whether the current voice segment is a single voice includes:
对于当前语音段中的每帧语音,提取其频谱特征;For each frame of speech in the current speech segment, extract its spectral features;
将提取的频谱特征输入所述说话人转折点判断模型,根据所述说话人转折点判断模型的输出确定每帧语音是否有转折点;Inputting the extracted spectral features into the speaker turning point judgment model, and determining whether there is a turning point in each frame voice according to the output of the speaker turning point judgment model;
如果当前语音段中有至少一帧语音有转折点,则确定当前语音段不是单人语音;否则,确定当前语音段是单人语音。If at least one frame of speech in the current speech segment has a turning point, it is determined that the current speech segment is not a single speech; otherwise, it is determined that the current speech segment is a single speech.
优选地,所述如果当前语音段中有至少一帧语音有转折点,则确定当前语音段不是单人语音;否则,确定当前语音段是单人语音,包括:Preferably, if at least one frame of the current voice segment has a turning point, it is determined that the current voice segment is not a single voice; otherwise, determining that the current voice segment is a single voice includes:
如果当前语音段中有连续多帧语音均有转折点,则确定当前语音段不是单人语音;否则,确定当前语音段是单人语音。If there are turning points in the continuous multi-frame speech in the current speech segment, it is determined that the current speech segment is not single-person speech; otherwise, it is determined that the current speech segment is single-person speech.
优选地,所述根据当前语音段及其对应的语义理解结果确定所述当前语音段中各角色间指令关系包括:Preferably, the determining, according to the current voice segment and the corresponding semantic understanding result, the command relationship between the roles in the current voice segment includes:
从当前语音段及其对应的语义理解结果中提取指令关联特征;Extracting instruction association features from current speech segments and their corresponding semantic understanding results;
根据所述指令关联特征确定当前语音段中各角色间指令关系。Determining an inter-role command relationship in the current speech segment according to the instruction association feature.
优选地,所述指令关联特征包括:声学特征和语义相关度特征;所述声学特征包括以下任意一种或多种:语音段的平均音量大小、语音段的信噪比、语音段与主麦克风的关系夹角,所述关系夹角是指语音段所属声源与主麦克风连线与水平线之间的夹角;所述语义相关度特征为语义相关度值;Preferably, the instruction association feature comprises: an acoustic feature and a semantic relevance feature; the acoustic feature comprises any one or more of the following: an average volume level of the speech segment, a signal to noise ratio of the speech segment, a speech segment and a main microphone The angle of the relationship is the angle between the line connecting the sound source and the main microphone and the horizontal line; the semantic relevance feature is a semantic relevance value;
所述从当前语音段及其对应的语义理解结果中提取指令关联特征包括:Extracting the instruction association feature from the current speech segment and its corresponding semantic understanding result includes:
从当前语音段中提取所述声学特征;Extracting the acoustic features from the current speech segment;
根据当前语音段对应的语义理解结果确定当前语音段的语义相关度值。The semantic relevance value of the current speech segment is determined according to the semantic understanding result corresponding to the current speech segment.
优选地,所述方法还包括:预先构建语义相关度模型,所述语义相关度模型的构建过程包括:Preferably, the method further comprises: pre-establishing a semantic relevance model, wherein the construction process of the semantic relevance model comprises:
确定语义相关度模型的拓扑结构;Determining the topology of the semantic relevance model;
收集大量包含多人参与的交互语音数据作为训练数据,并对所述训练数据进行语义相关度标注;Collecting a large amount of interactive voice data including multiple participants as training data, and performing semantic relevance labeling on the training data;
提取所述训练数据的语义相关特征;Extracting semantically related features of the training data;
利用所述语义相关特征及标注信息训练得到语义相关度模型;Using the semantic related features and the annotation information to train to obtain a semantic relevance model;
所述根据当前语音段对应的语义理解结果确定当前语音段的语义相关度值包括:Determining the semantic relevance value of the current speech segment according to the semantic understanding result corresponding to the current speech segment includes:
从当前语音段对应的语义理解结果中提取语义相关特征;Extracting semantic related features from semantic understanding results corresponding to the current speech segment;
将所述语义相关特征输入所述语义相关度模型,根据所述语义相关度模型的输出得到当前语音段的语义相关度值。And inputting the semantic related feature into the semantic relevance model, and obtaining a semantic relevance value of the current speech segment according to the output of the semantic relevance model.
优选地,所述语义相关特征包括:交互语音数据对应的文本词向量、交互语音数据中的用户指令涉及的业务类型。Preferably, the semantic related feature comprises: a text word vector corresponding to the interaction voice data, and a service type involved in the user instruction in the interaction voice data.
优选地,所述方法还包括:预先构建指令关联识别模型,所述指令关联识别模型的构建过程包括;Preferably, the method further includes: pre-building an instruction association recognition model, where the instruction association recognition model construction process includes:
确定指令关联识别模型的拓扑结构;Determining the topology of the instruction association recognition model;
收集大量包含多人参与的交互语音数据作为训练数据,并对所述训练数据进行角色间关联关系标注;Collecting a large amount of interactive voice data including multiple participants as training data, and labeling the training data with inter-character relationship;
提取所述训练数据的指令关联特征;Extracting instruction association features of the training data;
利用所述指令关联特征及标注信息训练得到指令关联识别模型;Using the instruction association feature and the annotation information to train to obtain an instruction association recognition model;
所述根据所述指令关联特征确定当前语音段中各角色间指令关系包括:Determining, according to the instruction association feature, an instruction relationship between each character in the current voice segment, including:
将所述指令关联特征输入所述指令关联识别模型,根据所述指令关联识别模型的输出得到当前语音段中各角色间指令关系。And inputting the instruction association feature into the instruction association recognition model, and obtaining an instruction relationship between each character in the current voice segment according to the output of the instruction association recognition model.
优选地,所述各角色间指令关系包括:干扰、补充和独立。Preferably, the inter-character instruction relationships include: interference, supplementation, and independence.
一种智能语音交互系统,所述系统包括:An intelligent voice interaction system, the system comprising:
接收模块,用于接收用户交互语音数据;a receiving module, configured to receive user interaction voice data;
语音识别模块,用于对所述交互语音数据进行语音识别,得到识别文本;a voice recognition module, configured to perform voice recognition on the interactive voice data to obtain a recognized text;
语义理解模块,用于对所述识别文本进行语义理解,得到语义理解结果;a semantic understanding module, configured to perform semantic understanding on the recognized text, and obtain a semantic understanding result;
判断模块,用于判断当前语音段是否为单人语音;a determining module, configured to determine whether the current voice segment is a single voice;
响应模块,用于在所述判断模块判断当前语音段是单人语音后,对所述语义理解结果进行响应;a response module, configured to respond to the semantic understanding result after the determining module determines that the current voice segment is a single voice;
指令关系识别模块,用于在所述判断模块判断当前语音段不是单人语音后,根据当前语音段及其对应的语义理解结果确定所述当前语音段中各角色间指令关系;The command relationship identification module is configured to determine, after the determining module determines that the current voice segment is not a single voice, determine an instruction relationship between the characters in the current voice segment according to the current voice segment and the corresponding semantic understanding result;
所述响应模块,还用于根据所述指令关系识别模块确定的各角色间指令关系进行响应。The response module is further configured to respond according to the inter-role command relationship determined by the instruction relationship identification module.
优选地,所述系统还包括:说话人转折点判断模型构建模块,用于预先构建说话人转折点判断模型;所述说话人转折点判断模型构建模块包括:Preferably, the system further includes: a speaker turning point judgment model building module, configured to pre-build a speaker turning point judgment model; and the speaker turning point judgment model building module includes:
第一拓扑结构确定单元,用于确定说话人转折点判断模型的拓扑结构;a first topology determining unit, configured to determine a topology structure of the speaker turning point judgment model;
第一数据收集单元,用于收集大量包含多人参与的交互语音数据,并对所述交互语音数据进行转折点标注;a first data collecting unit, configured to collect a plurality of interactive voice data including multiple participants, and perform turning point labeling on the interactive voice data;
第一参数训练单元,用于利用所述交互语音数据及标注信息训练得到说话人转折点判断模型参数;a first parameter training unit, configured to use the interactive voice data and the annotation information to obtain a speaker turning point judgment model parameter;
所述判断模块包括:The determining module includes:
频谱特征提取单元,用于对于当前语音段中的每帧语音,提取其频谱特征;a spectrum feature extraction unit, configured to extract a spectral feature for each frame of speech in the current speech segment;
转折点确定单元,用于将提取的频谱特征输入所述说话人转折点判断模型,根据所述说话人转折点判断模型的输出确定每帧语音是否有转折点;a turning point determining unit, configured to input the extracted spectral feature into the speaker turning point judgment model, and determine, according to the output of the speaker turning point judgment model, whether there is a turning point in each frame of voice;
判断单元,用于在当前语音段中有至少一帧语音有转折点时,确定当前语音段不是单人语音;否则,确定当前语音段是单人语音。The determining unit is configured to determine that the current voice segment is not a single voice when there is at least one frame of voice in the current voice segment; otherwise, determine that the current voice segment is a single voice.
优选地,所述判断单元具体用于在当前语音段中有连续多帧语音均有转折点时,确定当前语音段不是单人语音;否则,确定当前语音段是单人语音。Preferably, the determining unit is specifically configured to determine that the current voice segment is not a single voice when there are consecutive multi-frame voices in the current voice segment; otherwise, determine that the current voice segment is a single voice.
优选地,所述指令关系识别模块包括:Preferably, the instruction relationship identification module comprises:
指令关联特征提取单元,用于从当前语音段及其对应的语义理解结果中提取指令关联特征;An instruction association feature extraction unit, configured to extract an instruction association feature from a current speech segment and a corresponding semantic understanding result thereof;
指令关系确定单元,用于根据所述指令关联特征确定当前语音段中各角色间指令关系。The command relationship determining unit is configured to determine an instruction relationship between the roles in the current voice segment according to the instruction association feature.
优选地,所述指令关联特征包括:声学特征和语义相关度特征;所述声学特征包括以下任意一种或多种:语音段的平均音量大小、语音段的信噪比、语音段与主麦克风的关系夹角,所述关系夹角是指语音段所属声源与主麦克风连线与水平线之间的夹角;所述语义相关度特征为语义相关度值;Preferably, the instruction association feature comprises: an acoustic feature and a semantic relevance feature; the acoustic feature comprises any one or more of the following: an average volume level of the speech segment, a signal to noise ratio of the speech segment, a speech segment and a main microphone The angle of the relationship is the angle between the line connecting the sound source and the main microphone and the horizontal line; the semantic relevance feature is a semantic relevance value;
所述指令关联特征提取单元包括:The instruction association feature extraction unit includes:
声学特征提取子单元,用于从当前语音段中提取所述声学特征;An acoustic feature extraction subunit for extracting the acoustic feature from a current speech segment;
语义相关度特征提取子单元,用于根据当前语音段对应的语义理解结果确定当前语音段的语义相关度值。The semantic relevance feature extraction sub-unit is configured to determine a semantic relevance value of the current speech segment according to a semantic understanding result corresponding to the current speech segment.
优选地,所述系统还包括:语义相关度模型构建模块,用于预先构建语义相关度模型;所述语义相关度模型构建模块包括:Preferably, the system further comprises: a semantic relevance model building module, configured to pre-build a semantic relevance model; the semantic relevance model building module comprises:
第二拓扑结构确定单元,用于确定语义相关度模型的拓扑结构;a second topology determining unit, configured to determine a topology of the semantic relevance model;
第二数据收集单元,用于收集大量包含多人参与的交互语音数据作为训练数据,并对所述训练数据进行语义相关度标注;a second data collecting unit, configured to collect a plurality of interactive voice data including multiple participants as training data, and perform semantic relevance labeling on the training data;
语义相关特征提取单元,用于提取所述训练数据的语义相关特征;a semantic correlation feature extraction unit, configured to extract semantic related features of the training data;
第二训练单元,用于利用所述语义相关特征及标注信息训练得到语义相关度模型;a second training unit, configured to use the semantic related feature and the annotation information to obtain a semantic relevance model;
所述语义相关度特征提取子单元,具体用于从当前语音段对应的语义理解结果中提取语义相关特征;将所述语义相关特征输入所述语义相关度模型,根据所述语义相关度模型的输出得到当前语音段的语义相关度值。The semantic relevance feature extraction sub-unit is specifically configured to extract semantic-related features from semantic understanding results corresponding to the current speech segment; input the semantic-related features into the semantic relevance model, according to the semantic relevance model The output gets the semantic relevance value of the current speech segment.
优选地,所述语义相关特征包括:交互语音数据对应的文本词向量、交互语音数据中的用户指令涉及的业务类型。Preferably, the semantic related feature comprises: a text word vector corresponding to the interaction voice data, and a service type involved in the user instruction in the interaction voice data.
优选地,所述系统还包括:指令关联识别模型构建模块,用于预先构建指令关联识别模型;所述指令关联识别模型构建模块包括;Preferably, the system further includes: an instruction association identification model building module, configured to pre-build an instruction association recognition model; and the instruction association recognition model construction module includes:
第三拓扑结构确定单元,用于确定指令关联识别模型的拓扑结构;a third topology determining unit, configured to determine a topology of the instruction association identification model;
第三数据收集单元,收集大量包含多人参与的交互语音数据作为训练数据,并对所述训练数据进行角色间关联关系标注;The third data collecting unit collects a plurality of interactive voice data including multiple participants as training data, and performs the relationship between the roles of the training data;
指令关联特征提取单元,用于提取所述训练数据的指令关联特征;An instruction association feature extraction unit, configured to extract an instruction association feature of the training data;
第三训练单元,用于利用所述指令关联特征及标注信息训练得到指令关联识别模型;a third training unit, configured to use the instruction association feature and the annotation information to train the instruction association recognition model;
所述指令关系确定单元,具体用于将所述指令关联特征输入所述指令关联识别模型,根据所述指令关联识别模型的输出得到当前语音段中各角色间指令关系。The instruction relationship determining unit is specifically configured to input the instruction association feature into the instruction association recognition model, and obtain an instruction relationship between each character in the current voice segment according to the output of the instruction association recognition model.
优选地,所述各角色间指令关系包括:干扰、补充和独立。Preferably, the inter-character instruction relationships include: interference, supplementation, and independence.
一种智能语音交互设备,其中,包括相互连接的处理器和存储器;An intelligent voice interaction device, including an interconnected processor and a memory;
所述存储器用于存储程序指令;The memory is configured to store program instructions;
所述处理器用于运行所述程序指令以执行:The processor is configured to execute the program instructions to perform:
接收用户交互语音数据;Receiving user interaction voice data;
对所述交互语音数据进行语音识别及语义理解,得到识别文本及语义理解结果;Performing speech recognition and semantic understanding on the interactive speech data to obtain recognition text and semantic understanding results;
确定当前语音段是否为单人语音;Determine whether the current voice segment is a single voice;
如果是,则根据所述语义理解结果进行响应;If yes, respond according to the semantic understanding result;
否则,根据当前语音段及其对应的语义理解结果确定所述当前语音段中各角色间指令关系,然后根据所述各角色间指令关系进行响应。Otherwise, the command relationship between the characters in the current voice segment is determined according to the current voice segment and its corresponding semantic understanding result, and then responded according to the command relationship between the roles.
优选地,所述处理器还用于:预先构建说话人转折点判断模型,所述说话人转折点判断模型的构建过程包括:Preferably, the processor is further configured to: construct a speaker turning point judgment model in advance, and the constructing process of the speaker turning point judgment model comprises:
确定说话人转折点判断模型的拓扑结构;Determining the topological structure of the speaker turning point judgment model;
收集大量包含多人参与的交互语音数据,并对所述交互语音数据进行转折点标注;Collecting a large amount of interactive voice data including multiple participants, and marking the interactive voice data by turning points;
利用所述交互语音数据及标注信息训练得到说话人转折点判断模型参数;Using the interactive voice data and the annotation information to train to obtain a speaker turning point judgment model parameter;
所述确定当前语音段是否为单人语音包括:The determining whether the current voice segment is a single voice includes:
对于当前语音段中的每帧语音,提取其频谱特征;For each frame of speech in the current speech segment, extract its spectral features;
将提取的频谱特征输入所述说话人转折点判断模型,根据所述说话人转折点判断模型的输出确定每帧语音是否有转折点;Inputting the extracted spectral features into the speaker turning point judgment model, and determining whether there is a turning point in each frame voice according to the output of the speaker turning point judgment model;
如果当前语音段中有至少一帧语音有转折点,则确定当前语音段不是单人语音;否则,确定当前语音段是单人语音;If at least one frame of the current voice segment has a turning point, it is determined that the current voice segment is not a single voice; otherwise, determining that the current voice segment is a single voice;
所述处理器执行的所述根据当前语音段及其对应的语义理解结果确定所述当前语音段中各角色间指令关系包括:Determining, by the processor, the instruction relationship between the roles in the current voice segment according to the current voice segment and its corresponding semantic understanding result includes:
从当前语音段及其对应的语义理解结果中提取指令关联特征;Extracting instruction association features from current speech segments and their corresponding semantic understanding results;
根据所述指令关联特征确定当前语音段中各角色间指令关系。Determining an inter-role command relationship in the current speech segment according to the instruction association feature.
在另一实施例中,上述处理器用于实现上述任一智能语音交互方法。In another embodiment, the processor is configured to implement any of the above intelligent voice interaction methods.
本发明实施例提供的智能语音交互方法及系统,针对多人参与的交互场景的特点,对接收到的用户交互语音数据,判断是否为单人语音;如果不是,则通过对交互数据进行更细致准确的分析,得到多人参与交互情况下各角色指令间关系,根据各角色指令间关系合理地做出交互响应,从而解决了传统语音交互方案因未考虑多人参与交互情况所带来的用户意图理解错误、系统交互响应错误的问题,有效地提高了用户体验。The intelligent voice interaction method and system provided by the embodiments of the present invention, for the characteristics of the interactive scenes in which the plurality of people participate, determine whether the received user interaction voice data is a single voice; if not, the interaction data is more detailed. Accurate analysis, get the relationship between each role in the interaction of multiple people, and make an interactive response according to the relationship between each role command, thus solving the problem that the traditional voice interaction scheme is caused by not considering multi-person participation interaction. The intention to understand the error, the system interaction response error, effectively improve the user experience.
【附图说明】[Description of the Drawings]
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments will be briefly described below. It is obvious that the drawings in the following description are merely described in the present invention. Some of the embodiments of the present invention can also be obtained from those of ordinary skill in the art from the drawings.
图1是本发明实施例智能语音交互方法的流程图;1 is a flowchart of an intelligent voice interaction method according to an embodiment of the present invention;
图2是本发明实施例中构建说话人转折点判断模型的流程图;2 is a flow chart of constructing a speaker turning point judgment model in an embodiment of the present invention;
图3是本发明实施例中说话人转折点判断模型的时序示意图;3 is a timing diagram of a speaker turning point judgment model in an embodiment of the present invention;
图4是本发明实施例中构建语义相关度模型的流程图;4 is a flow chart of constructing a semantic relevance model in an embodiment of the present invention;
图5是本发明实施例中语义相关度模型的拓扑结构示意图;FIG. 5 is a schematic diagram of a topology structure of a semantic relevance model according to an embodiment of the present invention; FIG.
图6是本发明实施例中构建指令关联识别模型的流程图;6 is a flowchart of constructing an instruction association recognition model in an embodiment of the present invention;
图7是本发明实施例智能语音交互系统的结构示意图;7 is a schematic structural diagram of an intelligent voice interaction system according to an embodiment of the present invention;
图8是本发明实施例中指令关系识别模块的一种具体结构示意图;8 is a schematic diagram of a specific structure of an instruction relationship identification module in an embodiment of the present invention;
图9是本发明实施例中语音段与主麦克风的关系夹角的一种示意图;9 is a schematic diagram of an angle between a voice segment and a main microphone in an embodiment of the present invention;
图10是本发明实施例中语音段与主麦克风的关系夹角的另一种示意图;10 is another schematic diagram of an angle between a voice segment and a main microphone in an embodiment of the present invention;
图11是本发明实施例中智能语音交互系统的另一结构示意图。FIG. 11 is another schematic structural diagram of an intelligent voice interaction system according to an embodiment of the present invention.
【具体实施方式】【Detailed ways】
为了使本技术领域的人员更好地理解本发明实施例的方案,下面结合附图和实施方式对本发明实施例作进一步的详细说明。The embodiments of the present invention are further described in detail below with reference to the accompanying drawings and embodiments.
现有的语音交互系统中,仅根据端点检测技术确定一条条用户语音指令,并未考虑存在多人说话的情形,因此一轮交互指令中的后半句可能是前半句的干扰,或者是前半句的一个补充,或者是完全独立的两个子指令,此时如果不加以区分,有可能会得到错误的指令,进而会导致系统做出错误的响应,影响用户体验。针对这一情况,本发明实施例提供一种智能语音交互方法,针对多人参与的交互场景的特点,通过对交互语音数据进行更细致准确的分析判断,得到多人参与交互情况下各角色指令间关系,并根据各角色指令间关系合理地做出交互响应。In the existing voice interaction system, only one user voice instruction is determined according to the endpoint detection technology, and the situation in which multiple people speak is not considered. Therefore, the latter half of the round instruction may be the interference of the first half, or the first half. A supplement to a sentence, or two sub-instructions that are completely independent. If you do not distinguish between them, you may get the wrong instruction, which will cause the system to respond incorrectly and affect the user experience. In response to this situation, an embodiment of the present invention provides an intelligent voice interaction method. For the characteristics of an interactive scene in which multiple people participate, a more detailed and accurate analysis and judgment of the interactive voice data is performed, and various character commands are obtained in the case of multiple people participating in the interaction. Inter-relationship and reasonable interaction based on the relationship between the various instructions.
如图1所示,是本发明实施例智能语音交互方法的流程图,包括以下步骤:As shown in FIG. 1 , it is a flowchart of an intelligent voice interaction method according to an embodiment of the present invention, which includes the following steps:
步骤101,接收用户交互语音数据。Step 101: Receive user interaction voice data.
具体地,可以基于现有端点检测技术对音频流进行检测,得到音频流中的有效语音,作为用户的交互语音。所述端点检测技术需要设定停顿时长阈值eos(通常为0.5s-1s),如果语音停顿时间大于所述停顿时长阈值,则将音频流切断,将该段语音作为有效的用户交互语音。Specifically, the audio stream can be detected based on the existing endpoint detection technology, and the effective voice in the audio stream is obtained as the interactive voice of the user. The endpoint detection technique needs to set a pause duration threshold eos (usually 0.5s-1s). If the voice pause time is greater than the pause duration threshold, the audio stream is cut off, and the segment voice is used as an effective user interaction voice.
步骤102,对所述交互语音数据进行语音识别及语义理解,得到识别文本及语义理解结果。Step 102: Perform speech recognition and semantic understanding on the interactive speech data to obtain a recognition text and a semantic understanding result.
所述语音识别可以实时进行,即实时识别出截止到当前时刻用户所说的内容。具体地,由声学模型和语言模型构成解码网络,解码网络包含截止到当前时刻,所有候选的识别结果路径,从当前时刻选取解码得分最大的识别结果路径作为当前时刻的识别结果。接收到新的用户交互语音数据后,重新选取得分最大的识别结果路径,并更新之前的识别结果。The speech recognition can be performed in real time, that is, the content spoken by the user as of the current time is recognized in real time. Specifically, the decoding network is composed of an acoustic model and a language model. The decoding network includes all candidate recognition result paths up to the current time, and the recognition result path with the largest decoding score is selected as the recognition result of the current time from the current time. After receiving the new user interaction voice data, the path of the recognition result with the largest score is re-selected, and the previous recognition result is updated.
对语音识别结果进行语义理解可以采用现有技术,比如,基于文法规则的语义理解、基于本体知识库的语义理解、基于模型的语义理解等,对此本发明不做限定。The semantic understanding of speech recognition results may be based on prior art techniques, such as semantic understanding based on grammar rules, semantic understanding based on ontology knowledge base, semantic understanding based on models, etc., and the present invention is not limited thereto.
步骤103,确定当前语音段是否为单人语音。如果是,则执行步骤104;否则,执行步骤105。Step 103: Determine whether the current voice segment is single voice. If yes, go to step 104; otherwise, go to step 105.
在确定当前语音段是否为单人语音时,可以采用现有技术,比如,多说话人识别技术等。In determining whether the current voice segment is a single voice, prior art techniques such as multi-talker recognition techniques may be employed.
步骤104,根据所述语义理解结果进行响应。Step 104: respond according to the semantic understanding result.
具体响应方式比如可以是生成响应文本,并将响应文本反馈给用户,或者是对所述语义理解结果的一个具体动作,对此本发明实施例不做限定。如果是响应文本,可以通过语音播报的方式将所述响应文本反馈给用户;如果是一个具体操作,可以将该操作的结果呈现给用户。The specific response manner may be, for example, generating a response text, and feeding back the response text to the user, or a specific action on the semantic understanding result, which is not limited by the embodiment of the present invention. If it is a response text, the response text can be fed back to the user by means of voice broadcast; if it is a specific operation, the result of the operation can be presented to the user.
步骤105,根据当前语音段及其对应的语义理解结果确定所述当前语音段中各角色间指令关系。Step 105: Determine an instruction relationship between the roles in the current voice segment according to the current voice segment and its corresponding semantic understanding result.
具体地,可以首先从当前语音段及其对应的语义理解结果中提取指令关联特征;然后根据所述指令关联特征确定当前语音段中各角色间指令关系。Specifically, the instruction association feature may be first extracted from the current speech segment and its corresponding semantic understanding result; and then the inter-role instruction relationship in the current speech segment is determined according to the instruction association feature.
步骤106,根据所述各角色间指令关系进行响应。Step 106: respond according to the instruction relationship between the roles.
具体地,可以根据各角色间指令关系及预先设定的响应策略做出响应,如后半段是对前半段的干扰则只响应前半段意图、后半段是对前半段的补充则响应整句意图、前后半段独立(即重启新的一轮对话)则只响应后半段意图。Specifically, the response relationship may be responded according to the command relationship between the roles and the preset response strategy. For example, if the interference in the first half is the first half, the second half is only the first half, and the second half is the supplement to the first half. The intent of the sentence, the first half of the paragraph (ie restarting a new round of dialogue) only responds to the second half of the intent.
进一步地,在上述步骤103中,确定当前语音段是否为单人语音时,本发明实施例还可以采用基于说话人转折点判断模型的方法。具体地,可以预先构建说话人转折点判断模型,基于该说话人转折点判断模型来确定当前语音段是否为单人语音。Further, in the foregoing step 103, when determining whether the current voice segment is a single-person voice, the embodiment of the present invention may also adopt a method based on the speaker turning point judgment model. Specifically, the speaker turning point judgment model may be constructed in advance, and based on the speaker turning point judgment model, whether the current voice segment is a single voice is determined.
如图2所示,是本发明实施例中说话人转折点判断模型的构建流程,包括以下步骤:As shown in FIG. 2, it is a construction flow of a speaker turning point judgment model in the embodiment of the present invention, which includes the following steps:
步骤201,确定说话人转折点判断模型的拓扑结构。Step 201: Determine a topology structure of the speaker turning point judgment model.
所述说话人转折点判断模型的拓扑结构可以采用神经网络,如DNN(深度神经网络)、RNN(循环神经网络)、CNN(卷积神经网络)等,以BiLSTM(双向长短期记忆网络)为例,充分考虑到BiLSTM既能利用历史信息,又能利用未来信息的优势,可以更好地进行说话人转折点判断。The topology of the speaker turning point judgment model may use a neural network, such as DNN (Deep Neural Network), RNN (Circular Neural Network), CNN (Convolutional Neural Network), etc., taking BiLSTM (Two-way Long-term and Short-term Memory Network) as an example. Considering that BiLSTM can utilize both historical information and the advantages of future information, it can better judge the turning point of the speaker.
说话人转折点判断模型的拓扑结构主要包括输入层、隐层和输出层,其中输入层的输入为每帧语音的频谱特征,如39维的PLP(Perceptual Linear Predictive,感知线性预测)特征;隐层比如包含2层;输出层有2个节点,为是否有转折点判断的2维向量,有转折点为1,没有转折点则为0。The topological structure of the speaker turning point judgment model mainly includes the input layer, the hidden layer and the output layer, wherein the input of the input layer is the spectral feature of each frame of speech, such as a 39-dimensional PLP (Perceptual Linear Predictive) feature; For example, there are 2 layers; the output layer has 2 nodes, which is a 2D vector judged whether there is a turning point, and there is a turning point of 1, and no turning point is 0.
图3示出了说话人转折点判断模型的时序示意图,其中,F1~Ft表示输入层节点输入的频谱特征向量,h1~ht为隐层各节点的输出向量。FIG. 3 is a timing diagram showing a speaker turning point judging model, wherein F1 to Ft represent spectral feature vectors input by the input layer node, and h1 to ht are output vectors of each node of the hidden layer.
步骤202,收集大量包含多人参与的交互语音数据,并对所述交互语音数据进行转折点标注。Step 202: Collect a plurality of interactive voice data including multiple participants, and perform turning point labeling on the interactive voice data.
步骤203,利用所述交互语音数据及标注信息训练得到说话人转折点判断模型参数。Step 203: Train the speech data and the annotation information to obtain a speaker turning point judgment model parameter.
模型参数的具体训练方法可采用现有技术,如BPTT(反向传播)算法,在此不再详细描述。The specific training method of the model parameters may adopt a prior art, such as a BPTT (Back Propagation) algorithm, and will not be described in detail herein.
相应地,基于上述说话人转折点判断模型,在确定当前语音段是否为单人语音时,可以从当前语音段的每帧语音中提取相应的频谱特征,将提取的频谱特征输入所述说话人转折点判断模型,根据模型输出即可确定每帧语音中是否有转折点,如果有转折点,则表明转折点前后是不同的说话人的语音,相应地,如果当前语音段中有一帧语音有转折点,则确定当前语音段不是单人语音。当然,为了避免误判,还可以在当前语音段中有连续多帧(比如连续5帧)语音均有转折点时,才确定当前语音段不是单人语音,否则,确定当前语音段是单人语音。Correspondingly, based on the speaker turning point judgment model, when determining whether the current voice segment is a single voice, the corresponding spectral feature may be extracted from each frame of the current voice segment, and the extracted spectral feature is input into the speaker turning point. Judging the model, according to the model output, it can be determined whether there is a turning point in each frame of speech. If there is a turning point, it indicates that the turning point is different from the speaker's voice. Correspondingly, if there is a turning point in the current voice segment, the current point is determined. The voice segment is not a single voice. Of course, in order to avoid misjudgment, it is also determined that the current voice segment is not a single voice when there are consecutive multiple frames (such as five consecutive frames) in the current voice segment, otherwise, the current voice segment is determined to be a single voice. .
前面提到,在确定当前语音段中各角色间指令关系时,可以先从当前语音段及其对应的语义理解结果中提取指令关联特征,然后根据所述指令关联特征确定当前语音段中各角色间指令关系。As mentioned above, when determining the instruction relationship between the characters in the current speech segment, the instruction association feature may be extracted from the current speech segment and its corresponding semantic understanding result, and then the roles in the current speech segment are determined according to the instruction association feature. Inter-instruction relationship.
所述指令关联特征包括:声学特征和语义相关度特征;其中,所述声学特征包括以下任意一种或多种:语音段的平均音量大小、语音段的信噪比、语音段与主麦克风的关系夹角,所述关系夹角是指语音段所属声源与主麦克风连线与水平线之间的夹角,如图9和图10所示,分别针对线性麦克风和环形麦克风阵列,示出了语音段所属声源与主麦克风连线与水平线之间的夹角θ。这些声学特征可以根据当前语音段得到。所述语义相关度特征可以用0-1之间的数值来表示,即语义相关度值,具体可以根据当前语音段对应的语义理解结果及预先构建的语义相关度模型来确定。The instruction association feature includes: an acoustic feature and a semantic relevance feature; wherein the acoustic feature comprises any one or more of the following: an average volume level of the voice segment, a signal to noise ratio of the voice segment, a voice segment and a main microphone Relationship angle, the relationship angle refers to the angle between the sound source and the main microphone connection line and the horizontal line, as shown in FIG. 9 and FIG. 10, respectively, for the linear microphone and the ring microphone array, The angle θ between the sound source of the voice segment and the line connecting the main microphone and the horizontal line. These acoustic features can be derived from the current speech segment. The semantic relevance feature may be represented by a value between 0-1, that is, a semantic relevance value, which may be determined according to a semantic understanding result corresponding to the current speech segment and a pre-constructed semantic relevance model.
如图4所示,是本发明实施例中构建语义相关度模型的流程图,包括以下步骤:As shown in FIG. 4, it is a flowchart of constructing a semantic relevance model in the embodiment of the present invention, which includes the following steps:
步骤401,确定语义相关度模型的拓扑结构;Step 401: Determine a topology structure of the semantic relevance model;
所述语义相关度模型的拓扑结构可以采用神经网络,比如以DNN为例,如图5所示,文本词向量经过卷积及线性变换层之后得到低阶词向量特征,然后与业务类型特征进行拼接,送入DNN回归网络,最终输出一个0-1之间的语义相关度值。The topology of the semantic relevance model may use a neural network, for example, taking DNN as an example. As shown in FIG. 5, the text word vector is subjected to convolution and linear transformation layers to obtain low-order word vector features, and then with the business type feature. Splicing, sent to the DNN regression network, and finally output a semantic correlation value between 0-1.
步骤402,收集大量包含多人参与的交互语音数据作为训练数据,并对所述训练数据进行语义相关度标注;Step 402: Collect a plurality of interactive voice data including multiple participants as training data, and perform semantic relevance labeling on the training data;
步骤403,提取所述训练数据的语义相关特征; Step 403, extract semantic related features of the training data;
所述语义相关特征包括用户交互语音数据对应的文本词向量、用户指令涉及的业务类型。其中,文本词向量的提取可以采用现有技术,比如利用已知的字嵌入(word embedding)矩阵,提取识别文本中每个词的词向量(如50维),然后再将前后两个语音片段的词向量进行拼接,形成一个固定长度的向量,不够的补0,如总计50*20=1000 维。用户指令涉及的业务类型,比如可以是:闲聊、订票、天气、导航、音乐、乱说构成的6维向量。The semantic related features include a text word vector corresponding to the user interaction voice data, and a service type involved in the user instruction. Wherein, the extraction of the text word vector can adopt the prior art, for example, using a known word embedding matrix to extract a word vector (such as 50 dimensions) for identifying each word in the text, and then before and after the two speech segments. The word vectors are spliced to form a vector of fixed length, which is not enough to complement 0, such as a total of 50*20=1000 dimensions. The type of service involved in the user instruction can be, for example, a 6-dimensional vector consisting of a chat, a reservation, a weather, a navigation, a music, and a mess.
步骤404,利用所述指令关联特征及标注信息训练得到指令关联识别模型 Step 404, using the instruction association feature and the annotation information to train to obtain an instruction association recognition model
进一步地,在本发明实施例中,语音段中各角色间指令关系的确定也可以采用基于预先训练的模型来实现,即预先训练指令关联识别模型,将提取的指令关联特征输入该模型,根据模型的输出得到当前语音段中各角色间指令关系。Further, in the embodiment of the present invention, the determination of the command relationship between the characters in the voice segment may also be implemented by using a pre-training model, that is, the pre-training instruction association recognition model, and the extracted instruction association feature is input into the model, according to The output of the model gets the command relationship between the characters in the current speech segment.
如图6所示,是本发明实施例中构建指令关联识别模型的流程图,包括以下步骤:As shown in FIG. 6, it is a flowchart of constructing an instruction association identification model in the embodiment of the present invention, which includes the following steps:
步骤601,确定指令关联识别模型的拓扑结构;Step 601: Determine a topology structure of the instruction association identification model.
所述指令关联识别模型可以采用神经网络模型,以DNN为例,其模型拓扑结构主要包括输入层、隐层、输出层,其中输入层各节点分别输入相应的声学特征和语义相关度特征,比如可以优选上述三个声学特征,则输入层有4个节点;隐层同于常见的DNN隐层,一般取3-7层;输出层为3个节点,分别输出三种指令关联关系,即干扰、补充和独立。The instruction association recognition model may adopt a neural network model, taking DNN as an example. The model topology mainly includes an input layer, a hidden layer, and an output layer, wherein each node of the input layer inputs corresponding acoustic features and semantic relevance features, such as The above three acoustic features may be preferred, the input layer has 4 nodes; the hidden layer is the same as the common DNN hidden layer, generally takes 3-7 layers; the output layer is 3 nodes, respectively outputting three instruction association relationships, that is, interference , supplement and independence.
步骤602,收集大量包含多人参与的交互语音数据作为训练数据,并对所述训练数据进行角色间关联关系标注;Step 602: Collect a plurality of interactive voice data including multiple participants as training data, and mark the relationship between the training data;
角色间关联关系即:干扰、补充和独立这三种关系。The relationship between roles is: interference, supplementation and independence.
步骤603,提取所述训练数据的指令关联特征;Step 603: Extract an instruction association feature of the training data.
所述指令关联特征即前面提到的声学特征和语义相关度特征;所述声学特征包括:语音段的平均音量大小、语音段的信噪比、语音段与主麦克风的关系夹角;所述语义相关度特征为语义相关度值,具体可以从所述训练数据的每个语音段及对应的语义理解结果中提取,语义相关度特征的提取可以采用基于语义相关度模型的方式,具体过程可参照前面的描述,在此不再赘述。The instruction association feature is the aforementioned acoustic feature and semantic relevance feature; the acoustic feature includes: an average volume level of the voice segment, a signal to noise ratio of the voice segment, and a relationship between the voice segment and the main microphone; The semantic relevance feature is a semantic relevance value, which can be extracted from each speech segment of the training data and the corresponding semantic understanding result. The semantic relevance feature can be extracted based on the semantic relevance model. The specific process can be Referring to the previous description, it will not be described again here.
步骤604,利用所述指令关联特征及标注信息训练得到指令关联识别模型。Step 604: Train the instruction association recognition model by using the instruction association feature and the annotation information.
模型的具体训练方法可采用现有技术,在此不再详细描述。The specific training method of the model can adopt the prior art and will not be described in detail herein.
基于该指令关联识别模型,在确定当前语音段中各角色间指令关系时,可以将从当前语音段及其对应的语义理解结果中提取的指令关联特征输入所述指令关联识别模型,根据所述指令关联识别模型的输出即可得到当前语音段中各角色间指令关系。And determining, according to the instruction correlation identification model, the instruction association feature extracted from the current speech segment and the corresponding semantic understanding result thereof, when the instruction relationship between the roles in the current speech segment is determined, according to the instruction The instruction association identifies the output of the model to obtain the command relationship between the characters in the current speech segment.
本发明实施例提供的智能语音交互方法,针对多人参与的交互场景的特点,对接收到的用户交互语音数据,判断是否为单人语音;如果不是,则通过对交互数据进行更细致准确的分析,得到多人参与交互情况下各角色指令间关系,根据各角色指令间关系合理地做出交互响应,从而解决了传统语音交互方案因未考虑多人参与交互情况所带来的用户意图理解错误、系统交互响应错误的问题,有效地提高了用户体验。The intelligent voice interaction method provided by the embodiment of the present invention, for the characteristics of the interactive scene in which the plurality of people participate, determines whether the received user interaction voice data is single-person voice; if not, the data is more detailed and accurate. Analyze, get the relationship between each role in the interaction of multiple people, and make an interactive response according to the relationship between each role instruction, thus solving the user's intention understanding brought by the traditional voice interaction scheme without considering multi-person participation interaction. Errors, system interaction response errors, effectively improve the user experience.
相应地,本发明实施例还提供一种智能语音交互系统,如图7所示,是该系统的一种结构示意图,该系统包括以下各模块:Correspondingly, an embodiment of the present invention further provides an intelligent voice interaction system, as shown in FIG. 7, which is a schematic structural diagram of the system, and the system includes the following modules:
接收模块701,用于接收用户交互语音数据;The receiving module 701 is configured to receive user interaction voice data.
语音识别模块702,用于对所述交互语音数据进行语音识别,得到识别文本;The voice recognition module 702 is configured to perform voice recognition on the interactive voice data to obtain the recognized text.
语义理解模块703,用于对所述识别文本进行语义理解,得到语义理解结果;The semantic understanding module 703 is configured to perform semantic understanding on the recognized text to obtain a semantic understanding result;
判断模块704,用于判断当前语音段是否为单人语音;The determining module 704 is configured to determine whether the current voice segment is a single voice;
响应模块705,用于在所述判断模块704判断当前语音段是单人语音后,对所述语义理解结果进行响应;The response module 705 is configured to respond to the semantic understanding result after the determining module 704 determines that the current voice segment is a single voice.
指令关系识别模块706,用于在所述判断模块704判断当前语音段不是单人语音后,根据当前语音段及其对应的语义理解结果确定所述当前语音段中各角色间指令关系;The command relationship identification module 706 is configured to determine, after the determining module 704, that the current voice segment is not a single voice, determine an instruction relationship between the characters in the current voice segment according to the current voice segment and the corresponding semantic understanding result;
相应地,在该实施例中,所述响应模块705,还用于根据所述指令关系识别模块706确定的各角色间指令关系进行响应。Correspondingly, in the embodiment, the response module 705 is further configured to respond according to the inter-role command relationship determined by the instruction relationship identification module 706.
也就是说,在当前语音是单人语音的情况下,响应模块705直接语义理解结果进行响应,否则根据语义识别结果中各角色间指令关系进行响应。如后半段是对前半段的干扰则只响应前半段意图、后半段是对前半段的补充则响应整句意图、前后半段独立(即重启新的一轮对话)则只响应后半段意图,从而避免了在有多人参与交互的情况下响应错误的问题,提高了用户体验。That is to say, in the case that the current voice is single-person voice, the response module 705 directly responds to the semantic understanding result, otherwise responds according to the command relationship between the roles in the semantic recognition result. If the second half is the interference to the first half, it only responds to the first half of the intent. The second half is the supplement to the first half. The response to the whole sentence is intent, and the front and the back are independent (ie, restarting a new round of dialogue). Segment intent, thus avoiding the problem of responding to errors in the case of multiple people participating in the interaction, improving the user experience.
需要说明的是,上述判断模块704在判断当前语音段是否为单人语音时,可以采用现有技术,比如,多说话人识别技术等;也可以采用基于模型的方式,比如,由说话人转折点判断模型构建模块预先构建说话人转折点判断模型,所述说话人转折点判断模型构建模块可以作为本发明系统的一部分,也可以独立于本发明系统,对此本发明实施例不做限定。It should be noted that, when the determining module 704 determines whether the current voice segment is a single voice, the prior art may be used, for example, a multi-talker recognition technology, etc., or a model-based manner may be used, for example, a speaker turning point. The judging model building module pre-constructs the speaker turning point judging model, and the speaker turning point judging model building module can be used as a part of the system of the present invention, and can also be independent of the system of the present invention.
如前面所述,所述说话人转折点判断模型可以采用深层神经网络,如DNN、RNN、CNN等,所述说话人转折点判断模型构建模块的一种具体结构可以包括以下各单元:As described above, the speaker turning point judgment model may adopt a deep neural network, such as DNN, RNN, CNN, etc., and a specific structure of the speaker turning point judgment model building module may include the following units:
第一拓扑结构确定单元,用于确定说话人转折点判断模型的拓扑结构;a first topology determining unit, configured to determine a topology structure of the speaker turning point judgment model;
第一数据收集单元,用于收集大量包含多人参与的交互语音数据,并对所述交互语音数据进行转折点标注;a first data collecting unit, configured to collect a plurality of interactive voice data including multiple participants, and perform turning point labeling on the interactive voice data;
第一参数训练单元,用于利用所述交互语音数据及标注信息训练得到说话人转折点判断模型参数。The first parameter training unit is configured to use the interactive voice data and the annotation information to obtain a speaker turning point judgment model parameter.
相应地,基于该说话人转折点判断模型,上述判断模块704的一种具体结构可以包括以下各单元:Correspondingly, based on the speaker turning point judgment model, a specific structure of the determining module 704 may include the following units:
频谱特征提取单元,用于对于当前语音段中的每帧语音,提取其频谱特征;a spectrum feature extraction unit, configured to extract a spectral feature for each frame of speech in the current speech segment;
转折点确定单元,用于将提取的频谱特征输入所述说话人转折点判断模型,根据所述说话人转折点判断模型的输出确定每帧语音是否有转折点;a turning point determining unit, configured to input the extracted spectral feature into the speaker turning point judgment model, and determine, according to the output of the speaker turning point judgment model, whether there is a turning point in each frame of voice;
判断单元,用于在当前语音段中有至少一帧语音有转折点时,确定当前语音段不是单人语音;否则,确定当前语音段是单人语音。The determining unit is configured to determine that the current voice segment is not a single voice when there is at least one frame of voice in the current voice segment; otherwise, determine that the current voice segment is a single voice.
上述指令关系识别模块706具体可以从当前语音段及其对应的语义理解结果中提取指令关联特征,然后利用这些特征确定当前语音段中各角色间指令关系。如图8所示,所述指令关系识别模块706的一种具体结构包括:指令关联特征提取单元761和指令关系确定单元762,其中:所述指令关联特征提取单元761用于从当前语音段及其对应的语义理解结果中提取指令关联特征;所述指令关系确定单元762用于根据所述指令关联特征确定当前语音段中各角色间指令关系。The command relationship identification module 706 may specifically extract the instruction association feature from the current speech segment and its corresponding semantic understanding result, and then use the features to determine the command relationship between the roles in the current speech segment. As shown in FIG. 8, a specific structure of the instruction relationship identification module 706 includes: an instruction association feature extraction unit 761 and an instruction relationship determination unit 762, wherein: the instruction association feature extraction unit 761 is configured to use the current speech segment and An instruction association feature is extracted from the corresponding semantic understanding result; the instruction relationship determining unit 762 is configured to determine an inter-role command relationship in the current speech segment according to the instruction association feature.
所述指令关联特征包括:声学特征和语义相关度特征;所述声学特征包括以下任意一种或多种:语音段的平均音量大小、语音段的信噪比、语音段与主麦克风的关系夹角;所述语义相关度特征为语义相关度值。相应地,所述指令关联特征提取单元可以包括以下各子单元:The instruction association feature includes: an acoustic feature and a semantic relevance feature; the acoustic feature includes any one or more of the following: an average volume of the voice segment, a signal to noise ratio of the voice segment, and a relationship between the voice segment and the primary microphone. The semantic relevance feature is a semantic relevance value. Correspondingly, the instruction association feature extraction unit may include the following subunits:
声学特征提取子单元,用于从当前语音段中提取所述声学特征,具体可以采用现有技术;An acoustic feature extraction subunit, configured to extract the acoustic feature from a current speech segment, specifically using a prior art;
语义相关度特征提取子单元,用于根据当前语音段对应的语义理解结果确定当前语音段的语义相关度值,具体可以采用基于模型的方式,比如,由语义相关度模型构建模块预先构建语义相关度模型。The semantic relevance feature extraction sub-unit is configured to determine a semantic relevance value of the current speech segment according to a semantic understanding result corresponding to the current speech segment, and specifically may adopt a model-based manner, for example, constructing a semantic correlation by a semantic relevance model building module. Degree model.
所述语义相关度模型构建模块的一种具体结构包括以下各单元:A specific structure of the semantic relevance model building module includes the following units:
第二拓扑结构确定单元,用于确定语义相关度模型的拓扑结构;a second topology determining unit, configured to determine a topology of the semantic relevance model;
第二数据收集单元,用于收集大量包含多人参与的交互语音数据作为训练数据,并对所述训练数据进行语义相关度标注;a second data collecting unit, configured to collect a plurality of interactive voice data including multiple participants as training data, and perform semantic relevance labeling on the training data;
语义相关特征提取单元,用于提取所述训练数据的语义相关特征;a semantic correlation feature extraction unit, configured to extract semantic related features of the training data;
第二训练单元,用于利用所述语义相关特征及标注信息训练得到指令关联识别模型。The second training unit is configured to use the semantic related feature and the annotation information to train the instruction association recognition model.
相应地,基于上述语义相关度模型,所述语义相关度特征提取子单元可以首先从当前语音段对应的语义理解结果中提取语义相关特征;然后将所述语义相关特征输入所述语义相关度模型,根据所述语义相关度模型的输出即可得到当前语音段的语义相关度值。Correspondingly, based on the semantic relevance model, the semantic relevance feature extraction sub-unit may first extract semantic-related features from semantic understanding results corresponding to the current speech segment; and then input the semantic-related features into the semantic relevance model. According to the output of the semantic relevance model, the semantic relevance value of the current speech segment can be obtained.
需要说明的是,上述所述语义相关度模型构建模块可以作为本发明系统的一部分,也可以独立于本发明系统,对此本发明实施例不做限定。It should be noted that the above-mentioned semantic relevance model building module may be used as a part of the system of the present invention, or may be independent of the system of the present invention.
上述指令关系确定单元762具体可以采用基于模型的方式来确定当前语音段中各角色间指令关系,比如,由指令关联识别模型构建模块预先构建指令关联识别模型。The command relationship determining unit 762 may specifically determine the command relationship between the roles in the current voice segment by using a model-based manner. For example, the command association recognition model is pre-built by the instruction association recognition model building module.
所述指令关联识别模型构建模块的一种具体结构包括以下各单元;A specific structure of the instruction association identification model building module includes the following units;
第三拓扑结构确定单元,用于确定指令关联识别模型的拓扑结构;a third topology determining unit, configured to determine a topology of the instruction association identification model;
第三数据收集单元,收集大量包含多人参与的交互语音数据作为训练数据,并对所述训练数据进行角色间关联关系标注;The third data collecting unit collects a plurality of interactive voice data including multiple participants as training data, and performs the relationship between the roles of the training data;
指令关联特征提取单元,用于提取所述训练数据的指令关联特征;An instruction association feature extraction unit, configured to extract an instruction association feature of the training data;
第三训练单元,用于利用所述指令关联特征及标注信息训练得到指令关联识别模型。The third training unit is configured to train the instruction association recognition model by using the instruction association feature and the annotation information.
相应地,基于上述指令关联识别模型指令关系确定单元762可以将所述指令关联特征输入所述指令关联识别模型,根据所述指令关联识别模型的输出即可得到当前语音段中各角色间指令关系。Correspondingly, based on the instruction association identification model instruction relationship determining unit 762, the instruction association feature may be input into the instruction association recognition model, and the command relationship between the roles in the current speech segment may be obtained according to the output of the instruction association recognition model. .
本发明实施例提供的智能语音交互系统,针对多人参与的交互场景的特点,对接收到的用户交互语音数据,判断是否为单人语音;如果不是,则通过对交互数据进行更细致准确的分析,得到多人参与交互情况下各角色指令间关系,根据各角色指令间关系合理地做出交互响应,从而解决了传统语音交互方案因未考虑多人参与交互情况所带来的用户意图理解错误、系统交互响应错误的问题,有效地提高了用户体验。本发明智能语音交互系统可以应用于各种人机交互设备或装置中,对交互环境适应性强,响应准确率高。The intelligent voice interaction system provided by the embodiment of the present invention determines whether the received user interaction voice data is single-person voice for the characteristics of the interaction scene of the multi-person participation; if not, the data is more detailed and accurate. Analyze, get the relationship between each role in the interaction of multiple people, and make an interactive response according to the relationship between each role instruction, thus solving the user's intention understanding brought by the traditional voice interaction scheme without considering multi-person participation interaction. Errors, system interaction response errors, effectively improve the user experience. The intelligent voice interaction system of the invention can be applied to various human-computer interaction devices or devices, has strong adaptability to the interactive environment, and has high response accuracy.
本发明实施例还提供另一种文本行识别系统,如图11所示,是本发明实施例智能语音交互系统的另一结构示意图。Another embodiment of the present invention provides another text line identification system, as shown in FIG. 11, which is another schematic structural diagram of the intelligent voice interaction system according to the embodiment of the present invention.
在该实施例中,该系统包括相互连接的处理器111和存储器112。该存储器112用于存储程序指令,而且还可用于存储处理器111在处理过程中的数据。处理器111用于运行该程序指令以执行上述实施例中的智能语音交互方法。In this embodiment, the system includes a processor 111 and a memory 112 that are interconnected. The memory 112 is used to store program instructions and can also be used to store data of the processor 111 during processing. The processor 111 is configured to execute the program instructions to perform the intelligent voice interaction method in the above embodiments.
具体地,该智能语音交互系统可以为机器人、手机、电脑等任意具有信息处理能力的设备。处理器111还可以称为CPU(Central Processing Unit,中央处理单元)。处理器111可能是一种集成电路芯片,具有信号的处理能力。处理器111还可以是通用处 理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。Specifically, the intelligent voice interaction system can be any device with information processing capability such as a robot, a mobile phone, or a computer. The processor 111 may also be referred to as a CPU (Central Processing Unit). The processor 111 may be an integrated circuit chip with signal processing capabilities. The processor 111 can also be a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, and discrete hardware components. . The general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。而且,以上所描述的系统实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。The various embodiments in the specification are described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. Moreover, the system embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie It can be located in one place or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without any creative effort.
以上对本发明实施例进行了详细介绍,本文中应用了具体实施方式对本发明进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及装置;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。The embodiments of the present invention have been described in detail above, and the present invention has been described with reference to the specific embodiments thereof. The description of the above embodiments is only for facilitating understanding of the method and apparatus of the present invention. Meanwhile, for those skilled in the art, The present invention is not limited by the scope of the present invention.

Claims (20)

  1. 一种智能语音交互方法,其中,所述方法包括:An intelligent voice interaction method, wherein the method comprises:
    接收用户交互语音数据;Receiving user interaction voice data;
    对所述交互语音数据进行语音识别及语义理解,得到识别文本及语义理解结果;Performing speech recognition and semantic understanding on the interactive speech data to obtain recognition text and semantic understanding results;
    确定当前语音段是否为单人语音;Determine whether the current voice segment is a single voice;
    如果是,则根据所述语义理解结果进行响应;If yes, respond according to the semantic understanding result;
    否则,根据当前语音段及其对应的语义理解结果确定所述当前语音段中各角色间指令关系,然后根据所述各角色间指令关系进行响应。Otherwise, the command relationship between the characters in the current voice segment is determined according to the current voice segment and its corresponding semantic understanding result, and then responded according to the command relationship between the roles.
  2. 根据权利要求1所述的方法,其中,所述方法还包括:预先构建说话人转折点判断模型,所述说话人转折点判断模型的构建过程包括:The method according to claim 1, wherein the method further comprises: pre-establishing a speaker turning point judgment model, wherein the constructing process of the speaker turning point judgment model comprises:
    确定说话人转折点判断模型的拓扑结构;Determining the topological structure of the speaker turning point judgment model;
    收集大量包含多人参与的交互语音数据,并对所述交互语音数据进行转折点标注;Collecting a large amount of interactive voice data including multiple participants, and marking the interactive voice data by turning points;
    利用所述交互语音数据及标注信息训练得到说话人转折点判断模型参数;Using the interactive voice data and the annotation information to train to obtain a speaker turning point judgment model parameter;
    所述确定当前语音段是否为单人语音包括:The determining whether the current voice segment is a single voice includes:
    对于当前语音段中的每帧语音,提取其频谱特征;For each frame of speech in the current speech segment, extract its spectral features;
    将提取的频谱特征输入所述说话人转折点判断模型,根据所述说话人转折点判断模型的输出确定每帧语音是否有转折点;Inputting the extracted spectral features into the speaker turning point judgment model, and determining whether there is a turning point in each frame voice according to the output of the speaker turning point judgment model;
    如果当前语音段中有至少一帧语音有转折点,则确定当前语音段不是单人语音;否则,确定当前语音段是单人语音。If at least one frame of speech in the current speech segment has a turning point, it is determined that the current speech segment is not a single speech; otherwise, it is determined that the current speech segment is a single speech.
  3. 根据权利要求2所述的方法,其中,所述如果当前语音段中有至少一帧语音有转折点,则确定当前语音段不是单人语音;否则,确定当前语音段是单人语音,包括:The method according to claim 2, wherein if the at least one frame of speech in the current speech segment has a turning point, determining that the current speech segment is not a single speech; otherwise, determining that the current speech segment is a single speech, comprising:
    如果当前语音段中有连续多帧语音均有转折点,则确定当前语音段不是单人语音;否则,确定当前语音段是单人语音。If there are turning points in the continuous multi-frame speech in the current speech segment, it is determined that the current speech segment is not single-person speech; otherwise, it is determined that the current speech segment is single-person speech.
  4. 根据权利要求1所述的方法,其中,所述根据当前语音段及其对应的语义理解结果确定所述当前语音段中各角色间指令关系包括:The method according to claim 1, wherein the determining the inter-role command relationship in the current speech segment according to the current speech segment and its corresponding semantic understanding result comprises:
    从当前语音段及其对应的语义理解结果中提取指令关联特征;Extracting instruction association features from current speech segments and their corresponding semantic understanding results;
    根据所述指令关联特征确定当前语音段中各角色间指令关系。Determining an inter-role command relationship in the current speech segment according to the instruction association feature.
  5. 根据权利要求4所述的方法,其中,所述指令关联特征包括:声学特征和语义相关度特征;所述声学特征包括以下任意一种或多种:语音段的平均音量大小、语音段的信噪比、语音段与主麦克风的关系夹角,所述关系夹角是指语音段所属声源与主麦克风连线与水平线之间的夹角;所述语义相关度特征为语义相关度值;The method of claim 4, wherein the instruction association feature comprises: an acoustic feature and a semantic relevance feature; the acoustic feature comprises any one or more of the following: an average volume level of the speech segment, a letter of the speech segment The ratio of the noise ratio, the relationship between the voice segment and the main microphone, the angle between the sound source and the main microphone connected to the horizontal line; the semantic correlation feature is a semantic correlation value;
    所述从当前语音段及其对应的语义理解结果中提取指令关联特征包括:Extracting the instruction association feature from the current speech segment and its corresponding semantic understanding result includes:
    从当前语音段中提取所述声学特征;Extracting the acoustic features from the current speech segment;
    根据当前语音段对应的语义理解结果确定当前语音段的语义相关度值。The semantic relevance value of the current speech segment is determined according to the semantic understanding result corresponding to the current speech segment.
  6. 根据权利要求5所述的方法,其中,所述方法还包括:预先构建语义相关度模型,所述语义相关度模型的构建过程包括:The method according to claim 5, wherein the method further comprises: pre-establishing a semantic relevance model, the construction process of the semantic relevance model comprising:
    确定语义相关度模型的拓扑结构;Determining the topology of the semantic relevance model;
    收集大量包含多人参与的交互语音数据作为训练数据,并对所述训练数据进行语义相关度标注;Collecting a large amount of interactive voice data including multiple participants as training data, and performing semantic relevance labeling on the training data;
    提取所述训练数据的语义相关特征;Extracting semantically related features of the training data;
    利用所述语义相关特征及标注信息训练得到语义相关度模型;Using the semantic related features and the annotation information to train to obtain a semantic relevance model;
    所述根据当前语音段对应的语义理解结果确定当前语音段的语义相关度值包括:Determining the semantic relevance value of the current speech segment according to the semantic understanding result corresponding to the current speech segment includes:
    从当前语音段对应的语义理解结果中提取语义相关特征;Extracting semantic related features from semantic understanding results corresponding to the current speech segment;
    将所述语义相关特征输入所述语义相关度模型,根据所述语义相关度模型的输出得到当前语音段的语义相关度值。And inputting the semantic related feature into the semantic relevance model, and obtaining a semantic relevance value of the current speech segment according to the output of the semantic relevance model.
  7. 根据权利要求6所述的方法,其中,所述语义相关特征包括:交互语音数据对应的文本词向量、交互语音数据中的用户指令涉及的业务类型。The method according to claim 6, wherein the semantic related feature comprises: a text word vector corresponding to the interactive voice data, and a service type involved in the user instruction in the interactive voice data.
  8. 根据权利要求4所述的方法,其中,所述方法还包括:预先构建指令关联识别模型,所述指令关联识别模型的构建过程包括;The method according to claim 4, wherein the method further comprises: pre-building an instruction association recognition model, wherein the instruction association recognition model construction process comprises:
    确定指令关联识别模型的拓扑结构;Determining the topology of the instruction association recognition model;
    收集大量包含多人参与的交互语音数据作为训练数据,并对所述训练数据进行角色间关联关系标注;Collecting a large amount of interactive voice data including multiple participants as training data, and labeling the training data with inter-character relationship;
    提取所述训练数据的指令关联特征;Extracting instruction association features of the training data;
    利用所述指令关联特征及标注信息训练得到指令关联识别模型;Using the instruction association feature and the annotation information to train to obtain an instruction association recognition model;
    所述根据所述指令关联特征确定当前语音段中各角色间指令关系包括:Determining, according to the instruction association feature, an instruction relationship between each character in the current voice segment, including:
    将所述指令关联特征输入所述指令关联识别模型,根据所述指令关联识别模型的输出得到当前语音段中各角色间指令关系。And inputting the instruction association feature into the instruction association recognition model, and obtaining an instruction relationship between each character in the current voice segment according to the output of the instruction association recognition model.
  9. 根据权利要求4所述的方法,其中,所述各角色间指令关系包括:干扰、补充和独立。The method of claim 4 wherein said inter-role command relationships comprise: interference, supplementation, and independence.
  10. 一种智能语音交互系统,其中,所述系统包括:An intelligent voice interaction system, wherein the system comprises:
    接收模块,用于接收用户交互语音数据;a receiving module, configured to receive user interaction voice data;
    语音识别模块,用于对所述交互语音数据进行语音识别,得到识别文本;a voice recognition module, configured to perform voice recognition on the interactive voice data to obtain a recognized text;
    语义理解模块,用于对所述识别文本进行语义理解,得到语义理解结果;a semantic understanding module, configured to perform semantic understanding on the recognized text, and obtain a semantic understanding result;
    判断模块,用于判断当前语音段是否为单人语音;a determining module, configured to determine whether the current voice segment is a single voice;
    响应模块,用于在所述判断模块判断当前语音段是单人语音后,对所述语义理解结果进行响应;a response module, configured to respond to the semantic understanding result after the determining module determines that the current voice segment is a single voice;
    指令关系识别模块,用于在所述判断模块判断当前语音段不是单人语音后,根据当前语音段及其对应的语义理解结果确定所述当前语音段中各角色间指令关系;The command relationship identification module is configured to determine, after the determining module determines that the current voice segment is not a single voice, determine an instruction relationship between the characters in the current voice segment according to the current voice segment and the corresponding semantic understanding result;
    所述响应模块,还用于根据所述指令关系识别模块确定的各角色间指令关系进行响应。The response module is further configured to respond according to the inter-role command relationship determined by the instruction relationship identification module.
  11. 根据权利要求10所述的系统,其中,所述系统还包括:说话人转折点判断模型构建模块,用于预先构建说话人转折点判断模型;所述说话人转折点判断模型构建模块包括:The system according to claim 10, wherein the system further comprises: a speaker turning point judgment model building module, configured to pre-build a speaker turning point judgment model; and the speaker turning point judgment model building module comprises:
    第一拓扑结构确定单元,用于确定说话人转折点判断模型的拓扑结构;a first topology determining unit, configured to determine a topology structure of the speaker turning point judgment model;
    第一数据收集单元,用于收集大量包含多人参与的交互语音数据,并对所述交互语音数据进行转折点标注;a first data collecting unit, configured to collect a plurality of interactive voice data including multiple participants, and perform turning point labeling on the interactive voice data;
    第一参数训练单元,用于利用所述交互语音数据及标注信息训练得到说话人转折点判断模型参数;a first parameter training unit, configured to use the interactive voice data and the annotation information to obtain a speaker turning point judgment model parameter;
    所述判断模块包括:The determining module includes:
    频谱特征提取单元,用于对于当前语音段中的每帧语音,提取其频谱特征;a spectrum feature extraction unit, configured to extract a spectral feature for each frame of speech in the current speech segment;
    转折点确定单元,用于将提取的频谱特征输入所述说话人转折点判断模型,根据所述说话人转折点判断模型的输出确定每帧语音是否有转折点;a turning point determining unit, configured to input the extracted spectral feature into the speaker turning point judgment model, and determine, according to the output of the speaker turning point judgment model, whether there is a turning point in each frame of voice;
    判断单元,用于在当前语音段中有至少一帧语音有转折点时,确定当前语音段不是单人语音;否则,确定当前语音段是单人语音。The determining unit is configured to determine that the current voice segment is not a single voice when there is at least one frame of voice in the current voice segment; otherwise, determine that the current voice segment is a single voice.
  12. 根据权利要求11所述的系统,其中,所述判断单元具体用于在当前语音段中有连续多帧语音均有转折点时,确定当前语音段不是单人语音;否则,确定当前语音段是单人语音。The system according to claim 11, wherein the determining unit is configured to determine that the current speech segment is not a single speech when there are consecutive multi-frame speech in the current speech segment; otherwise, determining that the current speech segment is a single Human voice.
  13. 根据权利要求10所述的系统,其中,所述指令关系识别模块包括:The system of claim 10 wherein said instruction relationship identification module comprises:
    指令关联特征提取单元,用于从当前语音段及其对应的语义理解结果中提取指令关联特征;An instruction association feature extraction unit, configured to extract an instruction association feature from a current speech segment and a corresponding semantic understanding result thereof;
    指令关系确定单元,用于根据所述指令关联特征确定当前语音段中各角色间指令关系。The command relationship determining unit is configured to determine an instruction relationship between the roles in the current voice segment according to the instruction association feature.
  14. 根据权利要求13所述的系统,其中,所述指令关联特征包括:声学特征和语义相关度特征;所述声学特征包括以下任意一种或多种:语音段的平均音量大小、语音段的信噪比、语音段与主麦克风的关系夹角,所述关系夹角是指语音段所属声源与主麦克风连线与水平线之间的夹角;所述语义相关度特征为语义相关度值;The system of claim 13 wherein said instruction association feature comprises: an acoustic feature and a semantic relevance feature; said acoustic feature comprising any one or more of the following: an average volume level of the speech segment, a letter of the speech segment The ratio of the noise ratio, the relationship between the voice segment and the main microphone, the angle between the sound source and the main microphone connected to the horizontal line; the semantic correlation feature is a semantic correlation value;
    所述指令关联特征提取单元包括:The instruction association feature extraction unit includes:
    声学特征提取子单元,用于从当前语音段中提取所述声学特征;An acoustic feature extraction subunit for extracting the acoustic feature from a current speech segment;
    语义相关度特征提取子单元,用于根据当前语音段对应的语义理解结果确定当前语音段的语义相关度值。The semantic relevance feature extraction sub-unit is configured to determine a semantic relevance value of the current speech segment according to a semantic understanding result corresponding to the current speech segment.
  15. 根据权利要求14所述的系统,其中,所述系统还包括:语义相关度模型构建模块,用于预先构建语义相关度模型;所述语义相关度模型构建模块包括:The system of claim 14, wherein the system further comprises: a semantic relevance model building module, configured to pre-build a semantic relevance model; the semantic relevance model building module comprises:
    第二拓扑结构确定单元,用于确定语义相关度模型的拓扑结构;a second topology determining unit, configured to determine a topology of the semantic relevance model;
    第二数据收集单元,用于收集大量包含多人参与的交互语音数据作为训练数据,并对所述训练数据进行语义相关度标注;a second data collecting unit, configured to collect a plurality of interactive voice data including multiple participants as training data, and perform semantic relevance labeling on the training data;
    语义相关特征提取单元,用于提取所述训练数据的语义相关特征;a semantic correlation feature extraction unit, configured to extract semantic related features of the training data;
    第二训练单元,用于利用所述语义相关特征及标注信息训练得到语义相关度模型;a second training unit, configured to use the semantic related feature and the annotation information to obtain a semantic relevance model;
    所述语义相关度特征提取子单元,具体用于从当前语音段对应的语义理解结果中提取语义相关特征;将所述语义相关特征输入所述语义相关度模型,根据所述语义相关度模型的输出得到当前语音段的语义相关度值。The semantic relevance feature extraction sub-unit is specifically configured to extract semantic-related features from semantic understanding results corresponding to the current speech segment; input the semantic-related features into the semantic relevance model, according to the semantic relevance model The output gets the semantic relevance value of the current speech segment.
  16. 根据权利要求15所述的系统,其中,所述语义相关特征包括:交互语音数据对应的文本词向量、交互语音数据中的用户指令涉及的业务类型。The system according to claim 15, wherein the semantic related features comprise: a text word vector corresponding to the interactive voice data, and a service type involved in the user instruction in the interactive voice data.
  17. 根据权利要求13所述的系统,其中,所述系统还包括:指令关联识别模型构建模块,用于预先构建指令关联识别模型;所述指令关联识别模型构建模块包括;The system according to claim 13, wherein the system further comprises: an instruction association identification model building module, configured to pre-build an instruction association recognition model; and the instruction association recognition model construction module comprises:
    第三拓扑结构确定单元,用于确定指令关联识别模型的拓扑结构;a third topology determining unit, configured to determine a topology of the instruction association identification model;
    第三数据收集单元,收集大量包含多人参与的交互语音数据作为训练数据,并对所述训练数据进行角色间关联关系标注;The third data collecting unit collects a plurality of interactive voice data including multiple participants as training data, and performs the relationship between the roles of the training data;
    指令关联特征提取单元,用于提取所述训练数据的指令关联特征;An instruction association feature extraction unit, configured to extract an instruction association feature of the training data;
    第三训练单元,用于利用所述指令关联特征及标注信息训练得到指令关联识别模型;a third training unit, configured to use the instruction association feature and the annotation information to train the instruction association recognition model;
    所述指令关系确定单元,具体用于将所述指令关联特征输入所述指令关联识别模型,根据所述指令关联识别模型的输出得到当前语音段中各角色间指令关系。The instruction relationship determining unit is specifically configured to input the instruction association feature into the instruction association recognition model, and obtain an instruction relationship between each character in the current voice segment according to the output of the instruction association recognition model.
  18. 根据权利要求17所述的系统,其中,所述各角色间指令关系包括:干扰、补充和独立。The system of claim 17 wherein said inter-role command relationships comprise: interference, supplementation, and independence.
  19. 一种智能语音交互系统,其中,包括相互连接的处理器和存储器;An intelligent voice interaction system including an interconnected processor and a memory;
    所述存储器用于存储程序指令;The memory is configured to store program instructions;
    所述处理器用于运行所述程序指令以执行:The processor is configured to execute the program instructions to perform:
    接收用户交互语音数据;Receiving user interaction voice data;
    对所述交互语音数据进行语音识别及语义理解,得到识别文本及语义理解结果;Performing speech recognition and semantic understanding on the interactive speech data to obtain recognition text and semantic understanding results;
    确定当前语音段是否为单人语音;Determine whether the current voice segment is a single voice;
    如果是,则根据所述语义理解结果进行响应;If yes, respond according to the semantic understanding result;
    否则,根据当前语音段及其对应的语义理解结果确定所述当前语音段中各角色间指令关系,然后根据所述各角色间指令关系进行响应。Otherwise, the command relationship between the characters in the current voice segment is determined according to the current voice segment and its corresponding semantic understanding result, and then responded according to the command relationship between the roles.
  20. 根据权利要求19所述的系统,其中,所述处理器还用于:预先构建说话人转折点判断模型,所述说话人转折点判断模型的构建过程包括:The system according to claim 19, wherein the processor is further configured to: pre-build a speaker turning point judgment model, and the constructing process of the speaker turning point judgment model comprises:
    确定说话人转折点判断模型的拓扑结构;Determining the topological structure of the speaker turning point judgment model;
    收集大量包含多人参与的交互语音数据,并对所述交互语音数据进行转折点标注;Collecting a large amount of interactive voice data including multiple participants, and marking the interactive voice data by turning points;
    利用所述交互语音数据及标注信息训练得到说话人转折点判断模型参数;Using the interactive voice data and the annotation information to train to obtain a speaker turning point judgment model parameter;
    所述确定当前语音段是否为单人语音包括:The determining whether the current voice segment is a single voice includes:
    对于当前语音段中的每帧语音,提取其频谱特征;For each frame of speech in the current speech segment, extract its spectral features;
    将提取的频谱特征输入所述说话人转折点判断模型,根据所述说话人转折点判断模型的输出确定每帧语音是否有转折点;Inputting the extracted spectral features into the speaker turning point judgment model, and determining whether there is a turning point in each frame voice according to the output of the speaker turning point judgment model;
    如果当前语音段中有至少一帧语音有转折点,则确定当前语音段不是单人语音;否则,确定当前语音段是单人语音;If at least one frame of the current voice segment has a turning point, it is determined that the current voice segment is not a single voice; otherwise, determining that the current voice segment is a single voice;
    所述处理器执行的所述根据当前语音段及其对应的语义理解结果确定所述当前语音段中各角色间指令关系包括:Determining, by the processor, the instruction relationship between the roles in the current voice segment according to the current voice segment and its corresponding semantic understanding result includes:
    从当前语音段及其对应的语义理解结果中提取指令关联特征;Extracting instruction association features from current speech segments and their corresponding semantic understanding results;
    根据所述指令关联特征确定当前语音段中各角色间指令关系。Determining an inter-role command relationship in the current speech segment according to the instruction association feature.
PCT/CN2018/096705 2017-08-09 2018-07-23 Intelligent voice interaction method and system WO2019029352A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710676203.6A CN107437415B (en) 2017-08-09 2017-08-09 Intelligent voice interaction method and system
CN201710676203.6 2017-08-09

Publications (1)

Publication Number Publication Date
WO2019029352A1 true WO2019029352A1 (en) 2019-02-14

Family

ID=60460483

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/096705 WO2019029352A1 (en) 2017-08-09 2018-07-23 Intelligent voice interaction method and system

Country Status (2)

Country Link
CN (1) CN107437415B (en)
WO (1) WO2019029352A1 (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107437415B (en) * 2017-08-09 2020-06-02 科大讯飞股份有限公司 Intelligent voice interaction method and system
CN108159687B (en) * 2017-12-19 2021-06-04 芋头科技(杭州)有限公司 Automatic guidance system and intelligent sound box equipment based on multi-person interaction process
CN108053828A (en) * 2017-12-25 2018-05-18 无锡小天鹅股份有限公司 Determine the method, apparatus and household electrical appliance of control instruction
CN108197115B (en) * 2018-01-26 2022-04-22 上海智臻智能网络科技股份有限公司 Intelligent interaction method and device, computer equipment and computer readable storage medium
CN108520749A (en) * 2018-03-06 2018-09-11 杭州孚立计算机软件有限公司 A kind of voice-based grid-based management control method and control device
CN111819626A (en) * 2018-03-07 2020-10-23 华为技术有限公司 Voice interaction method and device
CN108766460B (en) * 2018-05-15 2020-07-10 浙江口碑网络技术有限公司 Voice-based interaction method and system
CN108874895B (en) * 2018-05-22 2021-02-09 北京小鱼在家科技有限公司 Interactive information pushing method and device, computer equipment and storage medium
CN108847225B (en) * 2018-06-04 2021-01-12 上海智蕙林医疗科技有限公司 Robot for multi-person voice service in airport and method thereof
CN109102803A (en) * 2018-08-09 2018-12-28 珠海格力电器股份有限公司 Control method, device, storage medium and the electronic device of household appliance
CN109065051B (en) * 2018-09-30 2021-04-09 珠海格力电器股份有限公司 Voice recognition processing method and device
WO2020211006A1 (en) * 2019-04-17 2020-10-22 深圳市欢太科技有限公司 Speech recognition method and apparatus, storage medium and electronic device
CN112992132A (en) * 2019-12-02 2021-06-18 浙江思考者科技有限公司 AI intelligent voice interaction program bridging one-key application applet
CN111081220B (en) * 2019-12-10 2022-08-16 广州小鹏汽车科技有限公司 Vehicle-mounted voice interaction method, full-duplex dialogue system, server and storage medium
CN111583956B (en) * 2020-04-30 2024-03-26 联想(北京)有限公司 Voice processing method and device
CN111785266A (en) * 2020-05-28 2020-10-16 博泰车联网(南京)有限公司 Voice interaction method and system
CN111897909B (en) * 2020-08-03 2022-08-05 兰州理工大学 Ciphertext voice retrieval method and system based on deep perceptual hashing
CN111968680A (en) * 2020-08-14 2020-11-20 北京小米松果电子有限公司 Voice processing method, device and storage medium
CN114822539A (en) * 2022-06-24 2022-07-29 深圳市友杰智新科技有限公司 Method, device, equipment and storage medium for decoding double-window voice

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102800315A (en) * 2012-07-13 2012-11-28 上海博泰悦臻电子设备制造有限公司 Vehicle-mounted voice control method and system
CN104333956A (en) * 2014-11-19 2015-02-04 国网冀北电力有限公司廊坊供电公司 Control method and system for lighting equipment in relay protection machine room
CN104732969A (en) * 2013-12-23 2015-06-24 鸿富锦精密工业(深圳)有限公司 Voice processing system and method
US20160379638A1 (en) * 2015-06-26 2016-12-29 Amazon Technologies, Inc. Input speech quality matching
CN107437415A (en) * 2017-08-09 2017-12-05 科大讯飞股份有限公司 A kind of intelligent sound exchange method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102800315A (en) * 2012-07-13 2012-11-28 上海博泰悦臻电子设备制造有限公司 Vehicle-mounted voice control method and system
CN104732969A (en) * 2013-12-23 2015-06-24 鸿富锦精密工业(深圳)有限公司 Voice processing system and method
CN104333956A (en) * 2014-11-19 2015-02-04 国网冀北电力有限公司廊坊供电公司 Control method and system for lighting equipment in relay protection machine room
US20160379638A1 (en) * 2015-06-26 2016-12-29 Amazon Technologies, Inc. Input speech quality matching
CN107437415A (en) * 2017-08-09 2017-12-05 科大讯飞股份有限公司 A kind of intelligent sound exchange method and system

Also Published As

Publication number Publication date
CN107437415A (en) 2017-12-05
CN107437415B (en) 2020-06-02

Similar Documents

Publication Publication Date Title
WO2019029352A1 (en) Intelligent voice interaction method and system
CN107993665B (en) Method for determining role of speaker in multi-person conversation scene, intelligent conference method and system
CN108000526B (en) Dialogue interaction method and system for intelligent robot
US11503155B2 (en) Interactive voice-control method and apparatus, device and medium
CN107665708B (en) Intelligent voice interaction method and system
CN108711421B (en) Speech recognition acoustic model establishing method and device and electronic equipment
CN108305643B (en) Method and device for determining emotion information
CN105427858B (en) Realize the method and system that voice is classified automatically
CN108701458B (en) Speech recognition
CN106331893B (en) Real-time caption presentation method and system
WO2017084197A1 (en) Smart home control method and system based on emotion recognition
US11574637B1 (en) Spoken language understanding models
KR20210070213A (en) Voice user interface
CN104036774A (en) Method and system for recognizing Tibetan dialects
CN110599999A (en) Data interaction method and device and robot
CN109119070A (en) A kind of sound end detecting method, device, equipment and storage medium
CN113314119B (en) Voice recognition intelligent household control method and device
JP6915637B2 (en) Information processing equipment, information processing methods, and programs
CN111192659A (en) Pre-training method for depression detection and depression detection method and device
CN111460143A (en) Emotion recognition model of multi-person conversation system
CN114596844A (en) Acoustic model training method, voice recognition method and related equipment
CN113314104A (en) Interactive object driving and phoneme processing method, device, equipment and storage medium
JP6306447B2 (en) Terminal, program, and system for reproducing response sentence using a plurality of different dialogue control units simultaneously
Preciado-Grijalva et al. Speaker fluency level classification using machine learning techniques
CN111128181B (en) Recitation question evaluating method, recitation question evaluating device and recitation question evaluating equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18843314

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18843314

Country of ref document: EP

Kind code of ref document: A1