WO2020135811A1 - 一种语音交互方法,设备和系统 - Google Patents

一种语音交互方法,设备和系统 Download PDF

Info

Publication number
WO2020135811A1
WO2020135811A1 PCT/CN2019/129631 CN2019129631W WO2020135811A1 WO 2020135811 A1 WO2020135811 A1 WO 2020135811A1 CN 2019129631 W CN2019129631 W CN 2019129631W WO 2020135811 A1 WO2020135811 A1 WO 2020135811A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
terminal
information
user
voice information
Prior art date
Application number
PCT/CN2019/129631
Other languages
English (en)
French (fr)
Inventor
郑明辉
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP19905540.1A priority Critical patent/EP3896691A4/en
Priority to JP2021537969A priority patent/JP7348288B2/ja
Publication of WO2020135811A1 publication Critical patent/WO2020135811A1/zh
Priority to US17/360,015 priority patent/US20210327436A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/4401Bootstrapping
    • G06F9/4418Suspend and resume; Hibernate and awake
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • the present application relates to the technical field of human-computer interaction, in particular to a method, equipment and system for human-computer voice interaction.
  • voice interaction As a new interactive technology has been widely used in various industries: home smart speakers, voice-controlled car terminals, personal voice assistants, conference system voice control, etc. Compared with mouse-key interaction and touch interaction, voice interaction has many advantages such as non-contact, hand-eye release, and natural (no need to learn). Restricted by the current technical level of artificial intelligence, the voice interaction system has not been able to fully simulate the natural and smooth dialogue between people. For example, the current voice interaction system cannot yet actively determine whether nearby speakers are speaking to the system, and must rely on a specific event trigger to start listening to voice commands.
  • the specific event here may be that the user utters a specified trained wake-up word, makes a specific sound, makes a specific gesture, presses a physical button or clicks an icon on the screen, etc., where the wake-up word triggers the most Being able to take advantage of voice interaction is also the most popular wake-up method currently used in voice interaction systems.
  • the voice system is awakened for the first time, when the user's voice command is executed or a round of conversation ends, if the user wants to issue the next voice command or enter the next round of conversation, he must say the wake word again.
  • Chinese patent application CN108182943A discloses a smart device control method, device and smart device.
  • the smart device control method includes: after responding to the interactive instruction corresponding to the voice information of the first user, maintaining the working state; after receiving the voice of the second user After the information is obtained, a voice recognition result corresponding to the second user voice information is obtained; according to the voice recognition result, it is determined whether the correlation between the second user voice information and the first user voice information is greater than or equal to a preset correlation degree If the correlation is greater than or equal to the preset correlation, respond to a target interaction instruction, where the target interaction instruction is: an interaction instruction corresponding to the second user's voice information.
  • the user does not need to wake up the smart device again when interacting with the smart device on the same topic for multiple times, but in this patent scheme, the second user's voice information must be communicated with the first user
  • the content of the voice information is strongly related (the same topic) to avoid waking up again.
  • the topic may be frequently switched. For example, after the voice system turns on the desk lamp, and then wants to listen to a song, the user still needs to wake up the system again in such scenarios .
  • Chinese patent application CN105912092A discloses that when the machine detects the sound signal of non-wake words, the system starts human/face detection, or uses the sound source localization method to adjust the camera shooting angle according to the sound and continue image detection. Face, wake the machine to start speech recognition.
  • the disadvantage of this solution is that the solution only involves the wake-up of one session and ignores the need for continuous sessions.
  • a strict wake-up mechanism is necessary. At this time, only the simple sound volume and Image detection as a judgment feature reduces the wake-up threshold and the accuracy is not high enough.
  • the present application provides a voice interaction method, terminal device, and system to reduce redundant wake-up during voice interaction and improve user experience by judging the user's willingness to continue the conversation.
  • a method for voice interaction includes: detecting an instruction for initiating voice interaction; in response to the instruction for initiating voice interaction, the terminal enters a voice interaction working state; the terminal receives first voice information and outputs a message for the first voice information Processing result; the terminal receives the second voice information, and determines whether the originator of the second voice information and the first voice information is the same user; if it is determined that the same user, the terminal output responds to the The processing result of the second voice information; if it is judged as a different user, the terminal ends the voice interactive working state.
  • the terminal determining whether the originator of the second voice information and the first voice information are the same user includes: when the terminal receives the first and second voice information, Acquiring the characteristics of the first and second voice information; the terminal determines whether the sender of the second voice information and the first voice information is based on the comparison result of the characteristics of the first and second voice information The same user.
  • the voice feature information is voiceprint model information.
  • the terminal determining whether the sender of the second voice information and the first voice information are the same user includes: when the terminal obtains the first and second voice information respectively The user's position or distance information; according to the user's position or distance information, the terminal determines whether the originator of the second voice information and the first voice information is the same user.
  • the terminal uses infrared sensing to detect the distance information of the user, and uses a microphone array to detect the position information of the user.
  • the terminal determining whether the sender of the second voice information and the first voice information are the same user includes: when the terminal obtains the first and second voice information respectively Facial feature information of the user; the terminal determines whether the originator of the second voice information and the first voice information is the same user by comparing the facial feature information of the user.
  • the terminal after the terminal determines that the originator of the second voice information and the first voice information is the same user, the terminal further determines whether the face orientation of the user meets a preset threshold After the preset threshold is satisfied, the terminal outputs the processing result for the second voice information, otherwise the terminal ends the voice interactive working state.
  • the judging whether the face orientation of the user satisfies a preset threshold includes: determining the offset between the visual center point of the voice interaction interface and the camera position, and determining according to the offset Whether the user's facial orientation meets a preset threshold.
  • the terminal entering the voice interaction working state further includes: the terminal presenting a first voice interaction interface; after the terminal outputs the processing result for the first voice information, the terminal presents the first Two voice interaction interfaces, the first voice interaction interface is different from the second voice interaction interface; the terminal ending the voice interaction working state includes: the terminal cancels the second voice interaction interface.
  • a terminal for realizing intelligent voice interaction including: a voice interaction module and a willingness to continue conversation judgment module, the voice interaction module is used for realizing intelligent voice interaction, and outputs targeted voice information according to the received voice information The result of the processing; the willingness to continue the dialogue judgment module, used to judge whether the received first voice information and the second voice information are the same user, the first voice information is an instruction of the voice interaction module in response to the initiation of voice interaction The voice information received later; the second voice information is the voice information received after the voice interaction module outputs the processing result for the first voice information.
  • the willingness to continue conversation judgment module judges whether the received first voice information and the second voice information are the same user, including: the willingness to continue conversation judgment module according to the first and the second The comparison result of the characteristics of the two voice information determines whether the sender of the second voice information and the first voice information are the same user.
  • the voice feature information is voiceprint model information.
  • the willingness to continue conversation judgment unit judges whether the received first voice information and the second voice information are the same user, including: the willingness to continue conversation judgment module based on receiving the first and second In the case of second voice information, the user's position or distance information determines whether the sender of the second voice information and the first voice information are the same user.
  • the willingness to continue conversation determination module uses infrared sensing to detect distance information of the user, and uses a microphone array to detect position information of the user.
  • the willingness to continue conversation judgment module judges whether the received first voice information and the second voice information are the same user, including: In the second voice information, the facial feature information of the user determines whether the sender of the second voice information and the first voice information are the same user.
  • the willingness to continue conversation judgment module judges that the originator of the second voice information and the first voice information is the same user, it further judges whether the face orientation of the user satisfies the preset Threshold.
  • the judging whether the face orientation of the user satisfies a preset threshold includes: determining the offset between the visual center point of the voice interaction interface and the camera position, and determining according to the offset Whether the user's facial orientation meets a preset threshold.
  • the terminal further includes a voice interaction interface presentation module for presenting a first voice interaction interface after the terminal enters a voice interaction working state, and outputting the first voice interaction interface for the first After processing the voice information, a second voice interaction interface is presented, and the first voice interaction interface is different from the second voice interaction interface.
  • an embodiment of the present application provides a conference system for implementing intelligent voice interaction.
  • the conference system includes any terminal of the foregoing aspect and at least one server.
  • the terminal is connected to the at least one server through a network to implement intelligence.
  • the server includes: a voiceprint recognition server, a face recognition server, a speech recognition and semantic understanding server, a speech synthesis server and a conversation willingness recognition server.
  • an embodiment of the present application provides a chip, including a processor and a memory; the memory is used to store computer execution instructions, the processor is connected to the memory, and when the chip is running, the processor executes the computer stored in the memory Execute instructions to make the chip execute any of the above intelligent voice interaction methods.
  • an embodiment of the present application provides a computer storage medium, in which instructions are stored in the computer storage medium, and when the instructions are executed on a computer, the computer is caused to perform any of the above intelligent voice interaction methods.
  • an embodiment of the present application provides a computer program product, where the computer program product includes instructions, and when the instructions are run on a computer, the computer is caused to perform any of the above intelligent voice interaction methods.
  • any of the devices, computer storage media, computer program products, chips, and systems for intelligent voice interaction provided above are used to implement the corresponding methods provided above, therefore, the benefits that they can achieve For the effect, refer to the beneficial effect in the corresponding method, which will not be repeated here.
  • FIG. 1 is a schematic diagram of a system for implementing voice interaction provided by an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a voice interaction method according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of an embodiment of determining whether the senders of voice information are the same according to an embodiment of the present invention
  • FIG. 4 is a schematic diagram of an algorithm for considering orientation deviation when calculating a user’s face orientation according to an embodiment of the present invention
  • FIG. 5 is a schematic diagram of an embodiment in which an interactive interface changes during a voice interaction process according to an embodiment of the present invention
  • FIG. 6 is a schematic diagram of an intelligent terminal device provided by an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of yet another intelligent terminal device provided by an embodiment of the present invention.
  • FIG. 1 A schematic diagram of a system 100 for implementing voice interaction provided by an embodiment of the present invention is shown in FIG. 1.
  • the system includes a voice terminal 101, a voiceprint recognition server 102, a face recognition server 103, a voice recognition and semantic understanding server 104; speech synthesis Server 105, session willingness recognition server 106; the intelligent voice terminal 101 is connected to the above servers 102-106 through a network, and the functions of each network element are further described as follows:
  • Terminal 101 responsible for voice collection, image collection, wake-up detection, dialogue management, control management, status indication, sound playback and content display;
  • the terminal 101 may be an intelligent voice terminal that can detect an instruction of voice interaction initiated by the user, such as a wake-up word spoken by the user, click a button to initiate voice interaction, or some user-defined voice or operation, etc. , In response to the user-initiated voice interaction instruction, enter the voice interaction working state; the difference is that the voice interaction working state, or the voice interaction dialogue state, is different from the simple detection of wake-up words and other instructions for initiating voice interaction.
  • the received voice information calls various processing resources for processing and outputs the corresponding processing result or operation status; after receiving the first voice information issued by the user, the terminal device 101 outputs the processing result for the first voice information, for example Answer the user’s question and trigger operations such as joining a meeting and turning on the microphone; after executing the instruction corresponding to the first voice message, the terminal 101 can further determine whether the user wishes to continue the conversation.
  • the second voice message can be received When it is judged that the originator of the second voice information and the first voice information are the same user, the processing result for the second voice information is output; if it is judged as a different user, the voice interaction working state is ended; judgment Whether it is the same user can be accomplished through face recognition, speaker's position and/or distance, user's voice characteristics and other information; terminal 101 can further determine whether the user is focused on the current conversation and has the desire to continue the conversation, For example, it is determined whether the user's face orientation meets a preset threshold, and only after the preset threshold is met, the processing result for the second voice information is output, otherwise, the voice interaction working state ends.
  • the terminal 101 may also consider the deviation of the user's face orientation judgment caused when the projections of the voice interaction interface and the screen camera in the normal direction do not coincide. Specifically, when judging the user's face orientation, if the terminal display screen is relatively wide, the projection of the visual center position of the voice assistant interface and the camera position in the normal direction may not coincide, when the user looks at the voice assistant interface From the perspective of the camera, there may be a face orientation deviation, that is, the camera may think that the user is not facing the screen, but the user is actually engaged in a dialogue with the voice assistant interface, so the camera is the center position to determine the user's face When heading, this deviation needs to be considered;
  • the terminal 101 may also prompt the user to the current working state through different UI interfaces, for example, when entering the voice interactive working state, the first voice interactive interface is presented; after outputting the processing result for the first voice information, the first voice is presented.
  • the interactive interface is different, such as a more concise second voice interactive interface; after judging that the user has no intention to continue the dialogue, all voice interactive interfaces are cancelled.
  • the terminal 101 may be: a smart phone, a smart home product (such as a smart speaker), a smart vehicle-mounted device, a smart wearable device, a smart robot, a conference terminal, etc. This is reasonable; it is understandable that the terminal 101 is in a voice interaction process
  • the required functions can be realized by connecting to the relevant server through the network, that is, the terminal 101 can work in a communication connection with the servers 102-106, and the terminal 101 can also integrate itself to implement all intelligent voice interaction offices in the embodiments of the present invention. All or part of the necessary functions; in addition, the servers 102-106 are only exemplary divisions in terms of functions. In implementation, they may have different combinations of functions or provide other services for the terminal.
  • Voiceprint recognition server 102 generate a speaker voiceprint model based on the voice data collected by terminal 101; and perform speaker voiceprint comparison to confirm the speaker's identity. The result is returned to the willingness recognition server 106;
  • Face recognition server 103 detect the face from the image collected by the voice terminal and further calculate the face orientation, and user identity recognition, and return the result to the willingness recognition server 106;
  • Voice recognition and semantic understanding server 104 convert the voice signals collected and uploaded by the terminal into text and semantics, and send them to the terminal 101 or other servers for processing;
  • the speech synthesis server 105 synthesize the text that the terminal 101 requests the speaker to broadcast, and send it back to the terminal 101;
  • Conversation willingness recognition server 106 accept information returned by voiceprint recognition, face recognition server, or infrared sensing device and microphone array on the terminal (voiceprint, face or speaker sound source position and/or distance), and comprehensively judge words Does the person have the willingness to continue the conversation and send the result to the terminal 101;
  • Embodiments of the present invention provide an implementation of a voice interaction system by adding speaker dialogue willingness recognition in the voice interaction process, for example, determining whether the person who continues to speak is the same person, to decide whether to continue to respond to the received voice information, and support the user to end a round of conversation
  • the system of this embodiment also supports the use of the camera to collect the face orientation to determine the user's willingness to continue the conversation, thereby enhancing the conversation
  • the accuracy of the dialogue intention recognition of the person the system of this embodiment supports adding the UI interface after the end of the first round of conversation (which may include an appropriate delay after the first round of conversation) to the existing interactive interface
  • an embodiment of the present invention further provides a voice interaction method. As shown in FIG. 2, the method includes the following steps:
  • the instruction to wake up the terminal to enter the state of voice interaction can take many forms, for example, the user awakens the word "meeting" and the user clicks the button to initiate voice interaction. Or predefined sounds of other users.
  • the terminal In response to the initiated voice interaction instruction, the terminal enters a voice interaction working state;
  • the user speaks the wake-up word "meeting” or "small and tiny”, when the system detects the wake-up word, it plays a response prompt tone, enters the voice command listening state (also a voice interactive working state), and the terminal can also pop up on the screen Voice assistant user interface.
  • the interface contains command prompt information, voice system status indication and other content; the terminal can interact with users through ASR and NLP services and dialog management functions, where ASR is automatic speech recognition and NLP is natural language processing.
  • the terminal receives the first voice information and outputs a processing result for the first voice information
  • the user then speaks a voice command, such as "join meeting”.
  • the voice signal is recognized (local recognition or sent to the voice recognition server), and the recognition result is returned.
  • the conference terminal executes the task of joining the conference according to the returned result.
  • the current round of sessions may not be ended immediately, that is, there may be a certain delay, and it is not necessary to immediately enter the session willingness judgment state (such as a semi-wake state), because the user may still reissue immediately New instructions, this delay is generally shorter, for example, 5 seconds; it can be considered that after the delay ends, the current round of sessions ends.
  • the terminal receives the second voice information, and judges whether the sender of the second voice information and the first voice information is the same user; if it is judged to be the same user, the terminal outputs a response to the 2.
  • the processing result of the voice information if it is judged as a different user, the terminal ends the voice interactive working state.
  • the terminal After the terminal outputs the processing result for the first voice information (or after a certain time delay), the terminal enters the session willingness judgment state (for example, the semi-wake state), and then receives the second voice information sent by the terminal, if the user If you need to invite others to join the meeting, you can directly say "Call Zhang San” without having to say the wake word again.
  • the terminal's dialogue willingness recognition server determines whether the voice instruction is issued to the voice assistant according to the identity of the speaker or further according to the face orientation. At this time, the terminal sends the voice segment to the voice recognition server for recognition and enters the normal dialogue process;
  • FIG. 3 it is a schematic diagram of an embodiment for determining whether the sender of the second voice information and the first voice information are the same according to an embodiment of the present invention:
  • the terminal can of course re-enter the voice interactive working state
  • the method of judging whether the same person can be compared through voice feature information such as voiceprint comparison. Specifically, when the user receives the first voice signal, the user obtains the voice feature information of the first voice signal, such as voice After receiving the second voice information, the voice characteristics of the second voice information are also extracted and compared. If a certain threshold is met, it is determined to be the same user. If it is different, the voice interactive working state is ended; this In case, after joining the meeting, if there is someone else speaking beside the speaker (without the wake word), the speaker will continue the dialogue. The recognition server judges that the speaker has no intention to continue the conversation based on that the speaker is not the same person as the previous round of conversation. , No response.
  • the terminal when the terminal detects the second voice signal, it can also detect the distance or orientation of the speaker and the terminal by infrared induction to determine whether it is the same person, or use face recognition to determine whether it is the same user, which is understandable Yes, when the user receives the first voice signal, he also obtains the distance information or face information of the sender of the first voice signal, and compares with the distance or face information of the sender of the second voice signal;
  • Face orientation detection after joining the meeting, the user may have no other voice commands to be issued, and want to talk to colleagues around him, the user may speak normally to the colleague, at this time, the face orientation can be further used to confirm whether the user is positive Determine the user's willingness to talk on the screen, for example, by calculating the angle of the user's face orientation deviation, and then use the Head Pose Estimation (HPE) technology to confirm, that is, use computer vision and pattern recognition methods in digital In the image, determine the orientation of the human head, and use a spatial coordinate system to identify the head pose direction parameters, that is, the head position parameter (x, y, z) and the direction angle parameter (Yaw, Pitch, Roll).
  • HPE Head Pose Estimation
  • the dialogue intention recognition server can determine that the user has no desire to continue dialogue, and the system does not respond, that is, exits the voice interactive working state.
  • Embodiments of the present invention provide a method for implementing voice interaction by adding speaker dialogue willingness recognition in the voice interaction process, for example, determining whether the person who continues to speak is the same person, to decide whether to continue to respond to the received voice information, and support the user to end a round of conversation
  • the system of this embodiment also supports the use of the camera to collect the face orientation to determine the user's willingness to continue the conversation, thereby improving the accuracy of the speaker's willingness to recognize the conversation; it is worth noting that the embodiment of the present invention recognizes the user's willingness to continue the conversation ( Speaker recognition and face orientation recognition) do not require (speech to text) conversion or semantic analysis, and are less difficult to deploy and easier to implement.
  • the embodiment of the present invention also considers the deviation of the user's face orientation judgment caused when the projections of the voice interaction interface and the screen camera in the normal direction do not coincide. Specifically, since the usual algorithm is based on the camera to determine the user's face orientation, if the terminal display screen is relatively wide, the projection of the visual center of the voice assistant interface and the camera position in the normal direction may not coincide At this time, when the user is looking at the voice assistant interface (with the willingness to talk), the camera may have a face orientation deviation, that is, the camera may think that the user is not facing the screen, so the camera is the center position to judge When the user's face is facing, this deviation needs to be considered.
  • This embodiment provides a face orientation correction algorithm, which is used to detect the user's face orientation and determine whether it meets the requirements: the camera is divided into two situations: a fixed camera and a tracking camera with a pan-tilt target. When the projection is aligned, if the user is facing the visual interaction interface (that is, facing the camera), the gimbal camera will not produce an angle deviation.
  • the user position sound source
  • the connection of the point, the connection and the connection between the user's position and the camera position form an angle, and confirm whether the user's face orientation meets the requirements through the angle value;
  • a lateral (left and right) lateral deviation angle of the face image; (in the illustration, the value of a is negative when the face is deflected right, and the value of a is positive when the face is deviated to the left)
  • b the angle between the projection of the speaker’s sound source and the visual focus of the voice assistant on the horizontal plane and the screen normal (in the illustration, when the face is on the right side of the normal vertical plane of the visual focus of the voice assistant, the b value is negative number);
  • c angle between the projection of the speaker’s face and the camera on the horizontal plane and the normal direction of the screen. (In the illustration, when the face is on the right side of the normal vertical plane of the camera center, the value of c is negative);
  • the deviation angle ⁇ 2 is the face orientation correction value calculated considering that the visual center of the voice interaction interface and the camera are not aligned.
  • the method of the embodiment of the present invention can more accurately detect the user's face orientation during the conversation, thereby realizing more intelligent and efficient voice interaction, especially for scenes where the position of the large screen and the voice interaction interface on the screen may change flexibly, It can realize more accurate recognition of users' willingness to continue the conversation and reduce misjudgment.
  • the present invention further provides an embodiment of a voice interaction interface change.
  • the embodiment of the present invention introduces a semi-wakeup state indication interface in the user interaction interface: in the wakeup monitoring state, when the system detects a wakeup word , A voice assistant user interface (UI) (the first voice interaction interface) pops up, and the information displayed on the interface includes command prompts, broadcast messages, voice recognition text results, and animated icons of the assistant's working status.
  • UI voice assistant user interface
  • the interface will not completely exit, but shrink to a small icon (second voice interactive interface) to remind the user that the system is in the half-wake state (wake delay), this The system will determine whether the user has the willingness to continue the conversation.
  • the voice interaction state is completely exited and the wake-up listening state is entered.
  • a UI interface after the end of the first round of the session (which may include an appropriate delay after the first round of the session) to the existing interactive interface, such as a half-wake (wake-up delay) status UI
  • the simplicity of the interface is ensured , To reduce interference, and also effectively remind the user of the current working state of the system.
  • An embodiment of the present invention further provides a terminal device 600.
  • the terminal device includes a terminal that realizes intelligent voice interaction, including: a voice interaction module 601 and a continued dialogue willingness judgment module 602. Describe the functions of each module of the terminal device 600:
  • the voice interaction module 601 is used to implement intelligent voice interaction and output targeted processing results according to the received voice information
  • the willingness to continue conversation judgment module 602 is used to judge whether the received first voice information and the second voice information are the same user, the first voice information is after the voice interaction unit responds to the instruction to initiate voice interaction The received voice information; the second voice information is the voice information received after the voice interaction module 601 outputs the processing result for the first voice information.
  • the willingness to continue conversation judgment module 602 determines whether the originator of the second voice information and the first voice information is the same user according to the comparison result of the characteristics of the first and second voice information.
  • the speech feature information is voiceprint model information, as shown in FIG. 6, at this time, the willingness to continue conversation judgment module 602 includes a speaker voiceprint generation unit and a comparison unit, which are used to obtain The voiceprints of the first and second voice information are compared, and the comparison result corresponds to the judgment result of the user's dialogue intention.
  • the willingness to continue conversation determination module determines whether the originator of the second voice information and the first voice information are the same user according to the user's position or distance information when receiving the first and second voice information .
  • the unit for determining the willingness to continue the conversation uses infrared sensing to detect the distance information of the user and the microphone array to detect the position information of the user, as shown in FIG. It includes an azimuth distance acquisition unit and a comparison unit, which are used to acquire the orientation and distance information of the user when the terminal receives the first and second voice information and perform comparison, and correspond the comparison result to the judgment result of the user's dialogue willingness.
  • the willingness to continue conversation determination module determines whether the originator of the second voice information and the first voice information is the same user according to the facial feature information of the user when the first and second voice information are received.
  • the willingness to continue conversation judgment module includes a facial feature generation unit and a comparison unit, which are respectively used to obtain the facial features of the user when the terminal receives the first and second voice information and perform comparison , And the comparison result corresponds to the judgment result of the user's dialogue willingness.
  • the willingness to continue conversation determination module further determines whether the user's facial orientation meets a preset threshold.
  • the willingness to continue conversation judgment module includes a sound source localization unit and a face detection unit.
  • the sound source localization unit is used to locate the user’s localization location (sound source) or voice direction through the microphone array
  • the face detection unit is used to detect the user's face position to calculate the user's face orientation.
  • the offset of the visual center point of the voice interaction interface and the position of the camera may be further considered, and the user may be determined according to the offset Whether the face orientation of the user meets the preset threshold, and the judgment result corresponds to the judgment result of the user's willingness to continue the conversation.
  • the lip movement detection unit may be used to further detect whether the user is speaking to further confirm the user's willingness to continue the conversation. For example, sometimes the user's voice may be relatively low and not detected by the terminal. However, by detecting that the user has lip movements, plus the previous judgment of the same user and facial orientation recognition, it can be confirmed that the user is indeed engaged in further conversation, and the voice interaction state is continued to avoid premature exit.
  • the terminal further includes a voice interaction interface presentation module 603, which is used to present a first voice interaction interface after the terminal enters a voice interaction working state, and output a message for the first voice information at the terminal After processing the result, a second voice interaction interface is presented.
  • the first voice interaction interface is different from the second voice interaction interface.
  • the second voice interaction interface is more concise and avoids interference with the user.
  • the various information required by the above-mentioned continuous session willingness judgment module can be collected and obtained by the terminal itself, or can be obtained by connecting to a related device or server through a network or cable; even the continuous session willingness judgment module itself, It can also be realized by a device or server connected by a network or cable, that is, the terminal only serves as an interface for voice interaction with the user, and is responsible for collecting user information such as voice, images, and outputting processed voice and image information. Function cloudification.
  • the “modules” or “units” in FIG. 6 may provide application specific integrated circuits (Application Specific Integrated Circuit, ASIC), electronic circuits, processors and memories that execute one or more software or firmware programs, combined logic circuits, and others. Functional components. If the integrated unit or module is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • ASIC Application Specific Integrated Circuit
  • FIG. 7 is a schematic structural diagram of a terminal device 700 according to an embodiment of the present application.
  • the structure includes a processor 701, a memory 702, a transceiver 703, a display 704, and a detector 705 (microphone, or further including a camera, infrared detection device, etc.).
  • the processor 701 is connected to the memory 702 and the transceiver 703, for example, the processor 801 may be connected to the memory 702 and the transceiver 703 through a bus.
  • the processor 701 may be configured for the terminal device 700 to perform the corresponding functions in the foregoing embodiments.
  • the processor 701 may be a central processing unit (English: central processing unit, CPU), a network processor (English: network processor, NP), a hardware chip, or any combination thereof.
  • the above-mentioned hardware chip may be an application-specific integrated circuit (English: application-specific integrated circuit, ASIC), a programmable logic device (English: programmable logic device, PLD), or a combination thereof.
  • the PLD can be a complex programmable logic device (English: complex programmable logic device, CPLD), field programmable logic gate array (English: field-programmable gate array, FPGA), general array logic (English: generic array logic, GAL) Or any combination thereof.
  • the memory 702 memory is used to store program codes and the like.
  • the memory 702 may include volatile memory (English: volatile memory), such as random access memory (English: random access memory, abbreviation: RAM); the memory 702 may also include non-volatile memory (English: non-volatile memory) , Such as read-only memory (English: read-only memory, abbreviation: ROM), flash memory (English: flash memory), hard disk (English: hard disk drive, abbreviation: HDD) or solid state drive (English: solid-state drive) , Abbreviation: SSD); the memory 702 may also include a combination of the above kinds of memories.
  • the detector 705 includes an audio pickup device such as a microphone, which is used to send voice information (such as the first or second voice information) issued by the user to the processor for processing or sound field localization; it may also include a distance measuring device such as a camera, infrared sensor, etc. Collect and send user-related information (face, distance, orientation, etc.) to the processor for processing 701;
  • an audio pickup device such as a microphone
  • voice information such as the first or second voice information
  • a distance measuring device such as a camera, infrared sensor, etc. Collect and send user-related information (face, distance, orientation, etc.) to the processor for processing 701;
  • the transceiver 703 may be a communication module and a transceiver circuit, which are used to implement possible data, signaling, and other information transmission between the terminal device and other network elements such as various servers in the foregoing embodiments.
  • the processor 701 may call the program code to perform the operations in the method embodiments described in FIGS. 2-5.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a dedicated computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted through the computer-readable storage medium.
  • the computer instructions can be sent from one website site, computer, server or data center to another website site by wire (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) , Computer, server or data center for transmission.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer.
  • magnetic media eg, floppy disk, hard disk, magnetic tape
  • optical media eg, DVD
  • semiconductor media eg, Solid State Disk (SSD)
  • SSD Solid State Disk

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)
  • Telephone Function (AREA)

Abstract

一种语音交互的方法,包括:检测到发起语音交互的指示后,终端进入语音交互工作状态;终端收到第一语音信息,输出针对所述第一语音信息的处理结果;终端收到第二语音信息,判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户;如果判断为同一用户,则输出响应于所述第二语音信息的处理结果;如果判断为不同用户,则结束所述语音交互工作状态;通过在语音交互流程中增加话者对话意愿识别,例如判断继续说话的人是否相同来决定是否继续响应收到的语音信息,支持用户在一轮会话结束后不必再次说出唤醒词即可连续向系统发出语音指令,有效减少了语音交互,特别是会议的语音交互中的冗余唤醒。

Description

一种语音交互方法,设备和系统 技术领域
本申请涉及人机交互技术领域,尤其涉及一种人机语音交互的方法,设备和系统。
背景技术
随着人工智能的兴起,语音交互作为一种新的交互技术已经在各个行业中被广泛应用:家庭智能音箱、语音控制车载终端、个人语音助手、会议系统语音控制等。与鼠键交互和触摸交互相比,语音交互具有非接触、释放手眼、自然(不用学习)等诸多优点。受人工智能当前所处的技术水平限制,语音交互系统还不能做到完全模拟人与人之间的自然对话那般流畅智能。例如,当前的语音交互系统还不能主动判断附近的说话人是否在对系统说话,必须依靠一个特定的事件触发才能开始收听语音指令。这里的特定事件可以是用户说出一个指定的被训练过的唤醒词、发出一段特定的声音、做出一个特定的手势动作、按压一个物理按键或点击屏幕上的图标等,其中唤醒词触发最能发挥语音交互的优点,也是当前语音交互系统中应用最为普及的唤醒方式。但是,语音系统被首次唤醒后,当用户语音指令被执行或一轮会话结束后,用户如果紧接着想要发出下一条语音指令或进入下一轮会话,必须再次说出唤醒词。这种交互流程跟人与人之间的语言交流习惯相差甚远(人们交谈时,如果对话没有明显结束,哪怕是话题切换,我们也不必反复呼喊对方的名字以维持交谈),带来的不良后果是,用户在与系统对话的过程中经常忘记说唤醒词,影响对话的连续性。这也是当前所有语音交互系统存在的通用性问题。
中国专利申请CN108182943A公开了一种智能设备控制方法、装置及智能设备,所述智能设备控制方法包括:响应完第一用户语音信息对应的交互指令后,保持工作状态;在接收到第二用户语音信息后,获得所述第二用户语音信息对应的语音识别结果;根据所述语音识别结果,判断所述第二用户语音信息与所述第一用户语音信息的相关度是否大于等于预设相关度;若所述相关度大于等于预设相关度,响应目标交互指令,其中,所述目标交互指 令为:所述第二用户语音信息对应的交互指令。通过该专利方案,智能设备被用户唤醒后,用户就相同话题的内容与智能设备进行多次交互时,不需要再次唤醒智能设备,但是该专利方案中,第二用户语音信息必须与第一用户语音信息内容强相关(相同话题),才能免再次唤醒,实际应用中话题可能会频繁切换,比如让语音系统打开台灯后,紧接着想听一首歌,此类场景中用户仍需再次唤醒系统。中国专利申请CN105912092A公开了一种当机器检测到非唤醒词的声音信号时,系统开启人体/人脸检测,或利用声源定位方法循声调整摄像头拍摄角度并继续图像检测,如果检测到人体/人脸,则唤醒机器开始语音识别。但是该方案缺点在于:该方案只涉及一次会话的唤醒,忽视了连续会话的需求;另外,当用户长时间未与机器对话时,严格的唤醒机制是必须的,此时仅凭简单的声量和图像检测作为判断特征,降低了唤醒阈值,精度也不够高。
发明内容
本申请提供一种语音交互的方法、终端设备及系统,用以通过判断用户的继续对话意愿,减少语音交互过程中的冗余唤醒,提升用户体验。
第一方面,提供一种语音交互的方法。该方法包括:检测到发起语音交互的指示;响应于所述发起语音交互的指示,所述终端进入语音交互工作状态;所述终端收到第一语音信息,输出针对所述第一语音信息的处理结果;所述终端收到第二语音信息,判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户;如果判断为同一用户,则所述终端输出响应于所述第二语音信息的处理结果;如果判断为不同用户,则所述终端结束所述语音交互工作状态。
在一种可能的设计中,所述终端判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户,包括:所述终端收到第一和第二语音信息时,分别获取所述第一和第二语音信息的特征;所述终端根据所述第一和第二语音信息特征的比较结果,确定所述第二语音信息与所述第一语音信息的发出者是否为同一用户。
在一种可能的设计中,所述语音特征信息为声纹模型信息。
在一种可能的设计中,所述终端判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户,包括:所述终端分别获取收到第一和第二语 音信息时用户的方位或者距离信息;所述终端根据所述用户方位或者距离信息,判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户。
在一种可能的设计中,所述终端利用红外感应探测所述用户的距离信息,利用麦克风阵列探测所述用户的方位信息。
在一种可能的设计中,所述终端判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户,包括:所述终端分别获取收到第一和第二语音信息时用户的面部特征信息;所述终端通过比较所述用户面部特征信息,判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户。
在一种可能的设计中,所述终端判断所述第二语音信息与所述第一语音信息的发出者为同一用户以后,所述终端进一步判断所述用户的面部朝向是否满足预设的阈值,满足预设的阈值后,所述终端输出针对所述第二语音信息的处理结果,否则所述终端结束所述语音交互工作状态。
在一种可能的设计中,所述判断所述用户的面部朝向是否满足预设的阈值,包括:确定语音交互界面的视觉中心点和摄像头位置的偏移量,根据所述偏移量,确定所述用户的面部朝向是否满足预设的阈值。
在一种可能的设计中,所述终端进入语音交互工作状态进一步包括:所述终端呈现第一语音交互界面;所述终端输出针对所述第一语音信息的处理结果后,所述终端呈现第二语音交互界面,所述第一语音交互界面不同于所述第二语音交互界面;所述终端结束所述语音交互工作状态,包括:所述终端取消所述第二语音交互界面。
第二方面,提供一种实现智能语音交互的终端,包括:语音交互模块和继续对话意愿判断模块,所述语音交互模块,用于实现智能语音交互,根据收到的语音信息,输出针对性的的处理结果;继续对话意愿判断模块,用于判断收到的第一语音信息和第二语音信息是否为同一个用户,所述第一语音信息为所述语音交互模块响应于发起语音交互的指示后收到的语音信息;所述第二语音信息为所述语音交互模块输出针对所述第一语音信息的处理结果后收到的语音信息。
在一种可能的设计中,所述继续对话意愿判断模块判断收到的第一语音信息和第二语音信息是否为同一个用户,包括:所述继续对话意愿判断模块 根据所述第一和第二语音信息特征的比较结果,确定所述第二语音信息与所述第一语音信息的发出者是否为同一用户。
在一种可能的设计中,所述语音特征信息为声纹模型信息。
在一种可能的设计中,所述继续对话意愿判断单元判断收到的第一语音信息和第二语音信息是否为同一个用户,包括:所述继续对话意愿判断模块根据收到第一和第二语音信息时用户的方位或者距离信息,判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户。
在一种可能的设计中,所述继续对话意愿判断模块利用红外感应探测所述用户的距离信息,利用麦克风阵列探测所述用户的方位信息。
在一种可能的设计中,所述继续对话意愿判断模块判断收到的第一语音信息和第二语音信息是否为同一个用户,包括:所述继续对话意愿判断模块根据收到第一和第二语音信息时用户的面部特征信息,判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户。
在一种可能的设计中,所述继续对话意愿判断模块判断所述第二语音信息与所述第一语音信息的发出者为同一用户以后,进一步判断所述用户的面部朝向是否满足预设的阈值。
在一种可能的设计中,所述判断所述用户的面部朝向是否满足预设的阈值,包括:确定语音交互界面的视觉中心点和摄像头位置的偏移量,根据所述偏移量,确定所述用户的面部朝向是否满足预设的阈值。
在一种可能的设计中,所述终端还包括语音交互界面呈现模块,用于在所述终端进入语音交互工作状态后,呈现第一语音交互界面,以及在所述终端输出针对所述第一语音信息的处理结果后,呈现第二语音交互界面,所述第一语音交互界面不同于所述第二语音交互界面。
第三方面,本申请实施例提供一种实现智能语音交互的会议系统,所述会议系统包含前述方面的任一终端以及至少一个服务器,所述终端通过网络与所述至少一个服务器连接,实现智能语音交互,所述服务器包括:声纹识别服务器,人脸识别服务器,语音识别和语义理解服务器,语音合成服务器和会话意愿识别服务器。
第四方面,本申请实施例提供一种芯片,包括处理器和存储器;该存储器用于存储计算机执行指令,处理器与该存储器连接,当该芯片运行时,处 理器执行该存储器存储的该计算机执行指令,以使该芯片执行上述任一智能语音交互的方法。
第五方面,本申请实施例提供一种计算机存储介质,所述计算机存储介质中存储有指令,当所述指令在计算机上运行时,使得所述计算机执行上述任一智能语音交互的方法。
第六方面,本申请实施例提供一种计算机程序产品,所述计算机程序产品包含有指令,当所述指令在计算机上运行时,使得所述计算机执行上述任一智能语音交互的方法。
另外,第二方面至第六方面中任一种设计方式所带来的技术效果可参见上述第一方面中不同设计方法所带来的技术效果,此处不再赘述。
可以理解地,上述提供的任一种设备、计算机存储介质、计算机程序产品、芯片、用于智能语音交互的系统均用于实现上文所提供的对应的方法,因此,其所能达到的有益效果可参考对应的方法中的有益效果,此处不再赘述。
附图说明
图1为本发明实施例提供的一种实现语音交互的系统示意图;
图2为本发明实施例提供的一种语音交互的方法流程示意图;
图3为本发明实施例判断语音信息的发出者是否相同的实施例示意图;
图4为本发明实施例计算用户面部朝向时考虑朝向偏差的算法示意图;
图5为本发明实施例语音交互过程中交互界面变化的实施例示意图;
图6为本发明实施例提供一种智能终端设备示意图;
图7为本发明实施例提供又一种智能终端设备示意图。
具体实施方式
下面对本申请实施例中的部分用于进行解释说明,以便于本领域技术人员理解。
本发明实施例提供的一种实现语音交互的系统100示意图如图1所示,该系统包含语音终端101,声纹识别服务器102,人脸识别服务器103,语音 识别和语义理解服务器104;语音合成服务器105,会话意愿识别服务器106;所述智能语音终端101通过网络和上述服务器102-106连接,各网元功能进一步说明如下:
终端101:负责语音采集、图像采集、唤醒检测、对话管理、控制管理、状态指示、声音播放以及内容显示等功能;
具体的,终端101可以是一种智能语音终端,可以检测到用户发起的语音交互的指示,如用户说出的唤醒词,点击发起语音交互的按钮,或者某些用户预定义的声音或者操作等,响应于所述用户发起的语音交互指示,进入语音交互工作状态;区别于单纯检测唤醒词等发起语音交互的指示,语音交互工作状态,或者称为语音交互对话状态,是指终端101可以对接收到的语音信息,调用各种处理资源进行处理并输出相应处理结果或者操作的状态;终端设备101收到用户发出的第一语音信息后,输出针对所述第一语音信息的处理结果,例如回答用户的问题,触发例如加入会议,打开麦克风等操作;执行完第一语音信息对应的指令后,终端101可以进一步判断用户是否有继续对话的意愿,具体的,可以在收到第二语音信息时,判断第二语音信息与所述第一语音信息的发出者为同一用户,则输出针对所述第二语音信息的处理结果;如果判断为不同用户,则结束所述语音交互工作状态;判断是否为同一用户的方式,可以是通过人脸识别,说话人的方位和/或者距离,用户的声音特征等信息完成;终端101还可以进一步判断用户是否专注于当前会话并有继续对话的意愿,例如判断用户的面部朝向是否满足预设的阈值,满足预设的阈值后,才输出针对所述第二语音信息的处理结果,否则结束所述语音交互工作状态。进一步的,终端101还会考虑语音交互界面和屏幕摄像头在法线方向上的投影并不重合时导致的对用户面部朝向判断的偏差。具体来说,判断用户面部朝向时,如果终端显示屏幕比较宽大,语音助手界面的视觉中心位置和摄像头的位置在法线方向上的投影可能并不重合,此时用户注视着语音助手界面的时候,在摄像头看来,可能是存在面部朝向偏差的,即摄像头可能认为用户并没有正面对着屏幕,但是用户其实是正对着语音助手界面专注的进行对话,因此以摄像头为中心位置来判断用户面部朝向的时候,需要考虑这个偏差;
终端101还可以通过不同的UI界面来向用户提示当前的工作状态,例如进入语音交互工作状态时呈现第一语音交互界面;输出针对所述第一语音信息的处理结果后,呈现与第一语音交互界面不同的,例如更简洁的第二语音交互界面;在判断出用户没有继续对话的意愿后,再取消所有的语音交互界面。
终端101可以为:智能手机、智能家居产品(如:智能音箱)、智能车载设备、智能穿戴设备、智能机器人,会议终端等,这都是合理的;可以理解的是,终端101在语音交互过程中需要的功能可以通过网络连接到相关的服务器来实现的,即终端101可以采用与服务器102-106通过通信连接的方式工作,终端101也可以本身集成了实现本发明实施例所有智能语音交互所必须的全部或者部分功能;另外,服务器102-106只是从功能上进行的示例性划分,实现中他们可能有不同的功能组合或者为终端提供其他的服务。
声纹识别服务器102:根据终端101采集到的语音数据,生成话者声纹模型;并进行话者声纹比对,确认话者身份。结果返回意愿识别服务器106;
人脸识别服务器103:从语音终端采集到的图像中检测人脸并可以进一步计算人脸朝向,以及用户身份识别,结果返回意愿识别服务器106;
语音识别和语义理解服务器104:将终端采集上传的语音信号转换为文本和语义,发送给终端101或者其他服务器处理;
语音合成服务器105:将终端101请求扬声器播报的文字合成语音,并送回终端101;
会话意愿识别服务器106:接受声纹识别、人脸识别服务器,或者终端上红外感应装置和麦克风阵列等返回的信息(声纹,人脸或者说话者声源方位和/或距离),综合判断话者是否有继续对话意愿,并将结果发送至终端101;
本发明实施例提供实现语音交互系统通过在语音交互流程中增加话者对话意愿识别,例如判断继续说话的人是否是同一个人来决定是否继续响应收到的语音信息,支持用户在一轮会话结束后不必再次说出唤醒词(或者其他唤醒方式)即可连续向系统发出语音指令,有效减少了语音交互过程中的冗余唤醒;同时,旁人插话以及话者与旁人的交流的语音信号会被智能过滤,有效减少系统的误响应,从而提升语音交互的流畅性和准确性,改善用户体 验;本实施例的系统还支持利用摄像头采集到人脸朝向来判断用户继续对话的意愿,从而提升话者对话意愿识别的准确度;本实施例的系统支持在现有交互界面中增加第一轮会话(可包括第一轮会话后的适当延时)结束后的UI界面,例如半唤醒(唤醒延时)状态UI,既保证界面的简洁,减少干扰,也能有效提示用户系统当前所处的工作状态。值得指出的是,本发明实施例对用户继续对话意愿的识别(话者识别和人脸朝向识别)不需要进行语音到文字的转换或语义的分析,部署难度较低,更容易实现。
利用附图1中所述的系统,本发明实施例进一步提供了一种语音交互的方法,如图2所示,所述方法包括步骤:
S201、检测到发起语音交互的指示;
也可以称为唤醒终端开始进入语音交互状态的指示,如前所述,发起语音交互的指示可以有多种形式,例如用户说出的唤醒词“开会了”,用户点击发起语音交互的按钮,或者其他用户预定义的声音等。
S202、响应于所述发起的语音交互指示,所述终端进入语音交互工作状态;
用户说出唤醒词“开会了”或“小微小微”,当系统检测到唤醒词后播放应答提示音,进入语音指令收听状态(也是一种语音交互工作状态),终端还可以在屏幕上弹出语音助手用户界面。界面包含命令提示信息、语音系统状态指示等内容;终端可以通过ASR和NLP服务以及对话管理功能与用户进行交互,其中ASR为自动语音识别,NLP为自然语言处理。
S203、所述终端收到第一语音信息,输出针对所述第一语音信息的处理结果;
接着用户说出语音指令,如“加入会议”。语音信号被识别(本地识别或者送往语音识别服务器),并返回识别结果。会议终端根据返回的结果执行加入会议任务。
会议终端根据返回的结果执行加入会议任务后,本轮会话可以没有马上结束,即可以有一定的时延,不必立即进入会话意愿判断状态(例如半唤醒状态),因为用户可能还会马上再发出新的指示,这个时延一般较短,例如5秒;可以认为时延结束后,本轮会话结束。
S204、所述终端收到第二语音信息,判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户;如果判断为同一用户,则所述终端输出响应于所述第二语音信息的处理结果;如果判断为不同用户,则所述终端结束所述语音交互工作状态。
终端输出针对第一语音信息的处理结果后(或者经过了一定的时延),终端即进入会话意愿判断状态(例如半唤醒状态),此时收到终端发送的第二语音信息,如果该用户需要邀请其他人入会,可以直接说“呼叫张三”,而不必再次说出唤醒词。终端对话意愿识别服务器,依据话者身份或者进一步根据人脸朝向判断该语音指令是向语音助手发出,此时终端才会将该语音片段送往语音识别服务器进行识别,进入正常对话流程;
参考图3,为本发明实施例判断第二语音信息与所述第一语音信息的发出者是否相同的一个实施例示意图:
S2041、终端检测到第二语音信号;
可选的,如果检测到唤醒词,终端当然可以重新进入语音交互工作状态;
判断是否同一个人的方法,可以是通过语音特征信息比对,例如声纹比对,具体的,用户在收到第一语音信号的时候,即获取了第一语音信号的声音特征信息,例如声纹信息,在收到第二语音信息后,也提取出第二语音信息的语音特征进行对比,如果满足一定的阈值,则判定为同一个用户,如果不同,则结束语音交互工作状态;这种情况下,如果加入会议后,话者身边有其他人说话(不包含唤醒词),话者继续对话意愿识别服务器依据该话者与上轮对话话者非同一人,判断话者无继续对话意愿,不予响应。
可选的,终端检测到第二语音信号时,还可以通过红外感应探测话者与终端的距离或者方位来判断是否是同一人,或者利用人脸识别来判断是否是同一个用户,可以理解的是,用户在收到第一语音信号的时候也获取了第一语音信号发出者的距离信息或者人脸信息,据此和第二语音信号发出者的距离或者人脸信息进行比对判断;
S2042、人脸朝向检测;如果加入会议后,用户可能没有其他语音指令需要发出,并且想要跟身边的同事对话,用户可能面向同事正常说话,此时,可以进一步通过人脸朝向确认用户是否正面对屏幕来确定用户的会话意愿,例如通过计算用户面部朝向偏差的角度来确认,再如采用头部姿态估计 (Head Pose Estimate,HPE)技术来确认,即利用计算机视觉和模式识别的方法在数字图像中判断人头部的朝向问题,利用一个空间坐标系内识别头部的姿态方向参数,也就是,头部位置参数(x,y,z)和方向角度参数(Yaw,Pitch,Roll)。按照估计结果的不同,分为离散的粗糙头部姿态估计(单张图像)、连续的精细头部姿态估计(视频),本发明实施例在此不再赘述。如果因人脸朝向检测结果未满足设定阈值要求,对话意愿识别服务器可以判断该用户无持续对话意愿,系统不予响应,即退出语音交互工作状态。
本发明实施例提供实现语音交互方法通过在语音交互流程中增加话者对话意愿识别,例如判断继续说话的人是否是同一个人来决定是否继续响应收到的语音信息,支持用户在一轮会话结束后不必再次说出唤醒词(或者其他唤醒方式)即可连续向系统发出语音指令,有效减少了语音交互过程中的冗余唤醒,旁人插话以及话者与旁人的交流的语音信号会被智能过滤,有效减少系统的误响应,从而提升语音交互的流畅性和准确性,改善用户体验;
本实施例的系统还支持利用摄像头采集到人脸朝向来判断用户继续对话的意愿,从而提升话者对话意愿识别的准确度;值得指出的是,本发明实施例对用户继续对话意愿的识别(话者识别和人脸朝向识别)不需要进行(语音到文字)的转换或语义分析,部署难度较低,更容易实现。
进一步的,本发明实施例还考虑语音交互界面和屏幕摄像头在法线方向上的投影并不重合时导致的对用户面部朝向判断的偏差。具体来说,由于通常的算法都是以摄像头为基准来判断用户面部朝向的,如果终端显示屏幕比较宽大,语音助手界面的视觉中心位置和摄像头的位置在法线方向上的投影可能并不重合,此时用户注视着语音助手界面的时候(具备对话意愿),在摄像头看来,可能是存在面部朝向偏差的,即摄像头可能认为用户并没有正面对着屏幕,因此以摄像头为中心位置来判断用户面部朝向的时候,需要考虑这个偏差。
本实施例中提供了一种人脸朝向修正算法,用于检测用户的人脸朝向并判断其是否满足要求:摄像头分为固定摄像头和带云台巡声目标追踪摄像头两种情形。当投影对齐时,用户如果正面着视觉交互界面(即正面着摄像头),云台摄像头不会产生角度偏差,如果用户面部不是正面着摄像头(交互界面),此时摄像头即可根据人脸朝向算法判断用户是否正面对屏幕;例如通 过计算用户面部朝向偏差的角度(△=a)来确认;对于固定摄像头的情形,还可以是通过麦克风阵列定位用户位置(声源),形成用户到麦克风语音接收点的连线,该连线和用户位置与摄像头位置的连线形成一个夹角,通过夹角值确认用户人脸朝向是否满足要求;
同样是云台摄像头的情形,如果语音助手界面的视觉中心位置(可由系统获取或者由语音助手上报)和摄像头的位置(可以是固定配置)在法线方向上的投影并不对齐,二者与用户位置的连线形成一个夹角,那么在计算偏差角度△时,就要考虑这个夹角,如附图4所示,假设:
a=人脸图像的横向(左右)侧偏角度;(图例中,人脸右偏时a值取负数,人脸左偏时a值取正数);
b=话者声源与语音助手视觉焦点连线在水平面上的投影与屏幕法向的夹角(图例中,当人脸处在语音助手视觉焦点法向竖直平面右侧时,b值为负数);
c=话者人脸与摄像头连线在水平面上的投影与屏幕法向的夹角。(图例中,当人脸处在摄像头中心法向竖直平面右侧时,c值为负数);
那么人脸朝向与正视语音助手视觉焦点方向的偏差角度△2=a+(b-c);
这里,偏差角度△2即为考虑了语音交互界面视觉中心和摄像头并不对齐的情况计算出的人脸朝向修正值。
通过本发明实施例的方法,可以更加准确的检测用户在对话时候的面部朝向,从而实现更加智能高效的语音交互,特别是对于大屏幕和语音交互界面在屏幕上的位置可能灵活变化的场景,可以实现更精准的用户继续对话意愿识别,减少误判。
如附图5所示,本发明进一步提供了语音交互界面变化的一个实施例,本发明实施例在用户交互界面中引入了半唤醒状态指示界面:唤醒监听状态下,当系统检测到唤醒词时,弹出语音助手用户界面(UI)(第一语音交互界面),界面显示的信息包括命令提示、播报语、语音识别文字结果、助手工作状态动画图标等。本轮会话结束后,进入半唤醒状态,界面不会完全退出,而是收缩为一个小的图标(第二语音交互界面),用以提示用户系统正 处于半唤醒状态(唤醒延时),此时系统会判断用户是否有继续对话意愿,半唤醒状态结束后,再完全退出语音交互状态,进入唤醒监听状态。
本实施例通过在现有交互界面中增加第一轮会话(可包括第一轮会话后的适当延时)结束后的UI界面,例如半唤醒(唤醒延时)状态UI,既保证界面的简洁,减少干扰,也能有效提示用户系统当前所处的工作状态。
本发明实施例进一步提供了一种终端设备600,如附图6所示,该终端设备包括是一种实现智能语音交互的终端,包括:语音交互模块601和继续对话意愿判断模块602,下面具体描述该终端设备600各个模块的功能:
所述语音交互模块601,用于实现智能语音交互,根据收到的语音信息,输出针对性的的处理结果;
所述继续对话意愿判断模块602,用于判断收到的第一语音信息和第二语音信息是否为同一个用户,所述第一语音信息为所述语音交互单元响应于发起语音交互的指示后收到的语音信息;所述第二语音信息为所述语音交互模块601输出针对所述第一语音信息的处理结果后收到的语音信息。
可选的,所述继续对话意愿判断模块602根据所述第一和第二语音信息特征的比较结果,确定所述第二语音信息与所述第一语音信息的发出者是否为同一用户。
可选的,所述语音特征信息为声纹模型信息,如附图6所示,此时,所述继续对话意愿判断模块602包含话者声纹生成单元和比对单元,分别用于获取第一和第二语音信息的声纹以及进行比对,并将比对结果对应为用户对话意愿的判断结果。
可选的,所述继续对话意愿判断模块根据收到第一和第二语音信息时用户的方位或距离信息,判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户。
可选的,所述继续对话意愿判断单元利用红外感应探测所述用户的距离信息,利用麦克风阵列探测所述用户的方位信息,如附图6所示,此时,所述继续对话意愿判断模块包含方位距离获取单元和比对单元,分别用于获取终端收到第一和第二语音信息时候用户的方位和距离信息以及进行比对,并将比对结果对应为用户对话意愿的判断结果。
可选的,所述继续对话意愿判断模块根据收到第一和第二语音信息时用户的面部特征信息,判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户。如附图6所示,此时,所述继续对话意愿判断模块包含面部特征生成单元和比对单元,分别用于获取终端收到第一和第二语音信息时候用户的面部特征以及进行比对,并将比对结果对应为用户对话意愿的判断结果。
可选的,所述继续对话意愿判断模块判断所述第二语音信息与所述第一语音信息的发出者为同一用户以后,进一步判断所述用户的面部朝向是否满足预设的阈值。如附图6所示,此时,所述继续对话意愿判断模块包含声源定位单元和人脸检测单元,声源定位单元用于通过麦克风阵列定位用户的定位用户位置(声源)或者话音方向,人脸检测单元用于检测用户的面部位置,从而计算出用户面部朝向,具体算法可以参考前述方法实施例S2042中的描述,在此不再赘述。通过获取用户的面部朝向并和一定的阈值进行比对,将比对结果对应为用户对话意愿的判断结果
可选的,在判断所述用户的面部朝向是否满足预设的阈值时,还可以进一步考虑语音交互界面的视觉中心点和摄像头位置的偏移量,根据所述偏移量,确定所述用户的面部朝向是否满足预设的阈值,并将判断结果对应为用户继续对话意愿的判断结果。
可选的,在判断人脸朝向时候,还可以进一步通过唇动检测单元检测用户是否在说话,以进一步确认用户继续对话意愿,例如,有时候可能用户说话声音比较小,没有被终端检测到,但是通过检测到用户有唇动,加上前面的同一用户的判断以及面部朝向识别,可以确认用户确实在进行进一步的对话,则继续保持语音交互状态,避免过早退出。
可选的,所述终端还包括语音交互界面呈现模块603,用于在所述终端进入语音交互工作状态后,呈现第一语音交互界面,以及在所述终端输出针对所述第一语音信息的处理结果后,呈现第二语音交互界面,所述第一语音交互界面不同于所述第二语音交互界面,例如第二语音交互界面更加简洁,避免对用户形成干扰。
可以理解的是,上述继续会话意愿判断模块需要的各种信息,可以通过终端自身收集和获取,也可以是通过网络或者线缆连接到相关的设备或者服 务器获取;甚至继续会话意愿判断模块本身,也可以通过网络或者线缆连接的设备或者服务器来实现,即终端只作为一个与用户进行语音交互的界面,负责采集语音,图像等用户信息以及负责输出处理后的语音和图像信息,将其他所有功能云化。
由于本申请实施例提供的终端设备用于执行前述所有实施例中的方法,因此其所能获得的技术效果可参考上述方法实施例,在此不再赘述。
图6中的“模块”或者“单元”可以为专用集成电路(Application Specific Integrated Circuit,ASIC)、电子线路、执行一个或多个软件或固件程序的处理器和存储器、组合逻辑电路和其他提供上述功能的组件。所述集成的单元或者模块如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。
请参阅图7,是本申请实施例提供的一种终端设备700的结构示意图。该结构包括处理器701、存储器702、收发器703以及显示器704,检测器705(麦克风,或进一步包括摄像头,红外检测器件等)。处理器701连接到存储器702和收发器703,例如处理器801可以通过总线连接到存储器702和收发器703。
处理器701可以被配置为终端设备700执行前述实施例中相应的功能。该处理器701可以是中央处理器(英文:central processing unit,CPU),网络处理器(英文:network processor,NP),硬件芯片或者其任意组合。上述硬件芯片可以是专用集成电路(英文:application-specific integrated circuit,ASIC),可编程逻辑器件(英文:programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(英文:complex programmable logic device,CPLD),现场可编程逻辑门阵列(英文:field-programmable gate array,FPGA),通用阵列逻辑(英文:generic array logic,GAL)或其任意组合。
存储器702存储器用于存储程序代码等。存储器702可以包括易失性存储器(英文:volatile memory),例如随机存取存储器(英文:random access memory,缩写:RAM);存储器702也可以包括非易失性存储器(英文:non-volatile memory),例如只读存储器(英文:read-only memory,缩写:ROM),快闪存储器(英文:flash memory),硬盘(英文:hard disk drive, 缩写:HDD)或固态硬盘(英文:solid-state drive,缩写:SSD);存储器702还可以包括上述种类的存储器的组合。
检测器705包括麦克风等音频拾取设备,用于将用户发出的语音信息(如第一或者第二语音信息)发送给处理器处理或者进行声场定位;还可以包含摄像头,红外感应等测距装置,将用户相关信息(人脸,距离,方位等)采集并发送给处理器处理701;
收发器703(可选)可以是通信模块、收发电路,用于实现前述实施例中终端设备与各个服务器等其他网络单元之间可能的数据、信令等信息的传输。
处理器701可以调用所述程序代码以执行如图2-图5所述方法实施例中的操作。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者通过所述计算机可读存储介质进行传输。所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质。例如,可以利用磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))来存储或传输所述计算机指令。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。。

Claims (21)

  1. 一种语音交互的方法,其特征在于,所述方法包括:
    终端检测到发起语音交互的指示;
    响应于所述发起语音交互的指示,所述终端进入语音交互工作状态;
    所述终端收到第一语音信息,输出针对所述第一语音信息的处理结果;
    所述终端收到第二语音信息,判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户;
    如果判断为同一用户,则所述终端输出响应于所述第二语音信息的处理结果;
    如果判断为不同用户,则所述终端结束所述语音交互工作状态。
  2. 如权利要求1中所述的方法,其特征在于,所述终端判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户,包括:
    所述终端收到第一和第二语音信息时,分别获取所述第一和第二语音信息的特征;
    所述终端根据所述第一和第二语音信息特征的比较结果,确定所述第二语音信息与所述第一语音信息的发出者是否为同一用户。
  3. 如权利要求1中所述的方法,其特征在于,所述语音特征信息为声纹模型信息。
  4. 如权利要求1中所述的方法,其特征在于,所述终端判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户,包括:
    所述终端分别获取收到第一和第二语音信息时用户的方位或者距离信息;
    所述终端根据所述用户方位或者距离信息,判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户。
  5. 如权利要求4中所述的方法,其特征在于,所述终端利用红外感应探测所述用户的距离信息,根据收到第一和第二语音信息时用户的距离信息确认是否为同一用户;或者
    所述终端利用利用麦克风阵列探测所述用户的方位信息,根据收到第一和第二语音信息时用户的方位信息确认是否为同一用户。
  6. 如权利要求1中所述的方法,其特征在于,所述终端判断所述第二 语音信息与所述第一语音信息的发出者是否为同一用户,包括:
    所述终端分别获取收到第一和第二语音信息时用户的面部特征信息;
    所述终端通过比较所述用户面部特征信息,判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户。
  7. 如权利要求1-6中任一项所述的方法,其特征在于,所述方法还包括,判断所述第二语音信息与所述第一语音信息的发出者为同一用户以后,所述终端进一步判断所述用户的面部朝向是否满足预设的阈值,满足预设的阈值后,所述终端输出针对所述第二语音信息的处理结果,否则所述终端结束所述语音交互工作状态。
  8. 如权利要求7中所述的方法,其特征在于,所述判断所述用户的面部朝向是否满足预设的阈值,包括:确定语音交互界面的视觉中心点和摄像头位置的偏移量,根据所述偏移量,确定所述用户的面部朝向是否满足预设的阈值。
  9. 如权利要求1-8中任一项所述的方法,其特征在于,
    所述终端进入语音交互工作状态进一步包括:所述终端呈现第一语音交互界面;
    所述终端输出针对所述第一语音信息的处理结果后,所述终端呈现第二语音交互界面,所述第一语音交互界面不同于所述第二语音交互界面;
    所述终端结束所述语音交互工作状态,包括:所述终端取消所述第二语音交互界面。
  10. 一种实现智能语音交互的终端,其特征在于,所述终端包括:语音交互模块和继续对话意愿判断模块,
    所述语音交互模块,用于实现智能语音交互,根据收到的语音信息,输出针对性的的处理结果;
    继续对话意愿判断模块,用于判断收到的第一语音信息和第二语音信息是否为同一个用户,所述第一语音信息为所述语音交互模块响应于发起语音交互的指示后收到的语音信息;所述第二语音信息为所述语音交互模块输出针对所述第一语音信息的处理结果后收到的语音信息。
  11. 如权利要求10所述的终端,其特征在于,所述继续对话意愿判断模块判断收到的第一语音信息和第二语音信息是否为同一个用户,包括:
    所述继续对话意愿判断模块根据所述第一和第二语音信息特征的比较结果,确定所述第二语音信息与所述第一语音信息的发出者是否为同一用户。
  12. 如权利要求11中所述的终端,其特征在于,所述语音特征信息为声纹模型信息。
  13. 如权利要求10中所述的终端,其特征在于,所述继续对话意愿判断单元判断收到的第一语音信息和第二语音信息是否为同一个用户,包括:
    所述继续对话意愿判断模块根据收到第一和第二语音信息时用户的方位或者距离信息,判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户。
  14. 如权利要求13中所述的终端,其特征在于,所述继续对话意愿判断模块利用利用红外感应探测所述用户的距离信息,根据收到第一和第二语音信息时用户的距离信息确认是否为同一用户;或者,所述继续对话意愿判断模块利用麦克风阵列探测所述用户的方位信息,根据收到第一和第二语音信息时用户的方位信息确认是否为同一用户。
  15. 如权利要求10中所述的终端,其特征在于,所述继续对话意愿判断模块判断收到的第一语音信息和第二语音信息是否为同一个用户,包括:
    所述继续对话意愿判断模块根据收到第一和第二语音信息时用户的面部特征信息,判断所述第二语音信息与所述第一语音信息的发出者是否为同一用户。
  16. 如权利要求10-15中任一项所述的终端,其特征在于,所述继续对话意愿判断模块判断所述第二语音信息与所述第一语音信息的发出者为同一用户以后,进一步判断所述用户的面部朝向是否满足预设的阈值。
  17. 如权利要求16中所述的终端,其特征在于,所述判断所述用户的面部朝向是否满足预设的阈值,包括:确定语音交互界面的视觉中心点和摄像头位置的偏移量,根据所述偏移量,确定所述用户的面部朝向是否满足预设的阈值。
  18. 如权利要求10-17中任一项所述的终端,其特征在于,所述终端还包括语音交互界面呈现模块,用于在所述终端进入语音交互工作状态后,呈现第一语音交互界面,以及在所述终端输出针对所述第一语音信息的处理结 果后,呈现第二语音交互界面,所述第一语音交互界面不同于所述第二语音交互界面。
  19. 一种实现智能语音交互的会议系统,其特征在于,所述会议系统包含如权利要求10到17中所述的任一终端以及至少一个服务器,所述终端通过网络与所述至少一个服务器连接,实现智能语音交互,所述服务器包括:声纹识别服务器,人脸识别服务器,语音识别和语义理解服务器,语音合成服务器和会话意愿识别服务器。
  20. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1至9中任一项所述的方法。
  21. 一种实现智能语音交互的终端,包括存储器、处理器、及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至9中任一项所述的方法。
PCT/CN2019/129631 2018-12-29 2019-12-28 一种语音交互方法,设备和系统 WO2020135811A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP19905540.1A EP3896691A4 (en) 2018-12-29 2019-12-28 VOICE INTERACTION METHOD, DEVICE AND SYSTEM
JP2021537969A JP7348288B2 (ja) 2018-12-29 2019-12-28 音声対話の方法、装置、及びシステム
US17/360,015 US20210327436A1 (en) 2018-12-29 2021-06-28 Voice Interaction Method, Device, and System

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811644940.9 2018-12-29
CN201811644940.9A CN111402900B (zh) 2018-12-29 2018-12-29 一种语音交互方法,设备和系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/360,015 Continuation US20210327436A1 (en) 2018-12-29 2021-06-28 Voice Interaction Method, Device, and System

Publications (1)

Publication Number Publication Date
WO2020135811A1 true WO2020135811A1 (zh) 2020-07-02

Family

ID=71128858

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/129631 WO2020135811A1 (zh) 2018-12-29 2019-12-28 一种语音交互方法,设备和系统

Country Status (5)

Country Link
US (1) US20210327436A1 (zh)
EP (1) EP3896691A4 (zh)
JP (1) JP7348288B2 (zh)
CN (1) CN111402900B (zh)
WO (1) WO2020135811A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2022069534A (ja) * 2021-04-16 2022-05-11 阿波▲羅▼智▲聯▼(北京)科技有限公司 投影シーンの表示制御方法、装置、設備、媒体及びプログラム製品
WO2023063965A1 (en) * 2021-10-13 2023-04-20 Google Llc Digital signal processor-based continued conversation

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20210042520A (ko) * 2019-10-10 2021-04-20 삼성전자주식회사 전자 장치 및 이의 제어 방법
CN111833876A (zh) * 2020-07-14 2020-10-27 科大讯飞股份有限公司 会议发言控制方法、系统、电子设备及存储介质
CN112017629B (zh) * 2020-07-15 2021-12-21 马上消费金融股份有限公司 语音机器人的会话控制方法及设备、存储介质
CN111951795B (zh) * 2020-08-10 2024-04-09 中移(杭州)信息技术有限公司 语音交互方法、服务器、电子设备和存储介质
CN112133296B (zh) * 2020-08-27 2024-05-21 北京小米移动软件有限公司 全双工语音控制方法、装置、存储介质及语音设备
CN112908322A (zh) * 2020-12-31 2021-06-04 思必驰科技股份有限公司 用于玩具车的语音控制方法和装置
CN113314120B (zh) * 2021-07-30 2021-12-28 深圳传音控股股份有限公司 处理方法、处理设备及存储介质
CN113643728B (zh) * 2021-08-12 2023-08-22 荣耀终端有限公司 一种音频录制方法、电子设备、介质及程序产品
CN117746849A (zh) * 2022-09-14 2024-03-22 荣耀终端有限公司 一种语音交互方法、装置及终端
CN115567336B (zh) * 2022-09-28 2024-04-16 四川启睿克科技有限公司 一种基于智慧家居的无唤醒语音控制系统及方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104284257A (zh) * 2013-07-10 2015-01-14 通用汽车环球科技运作有限责任公司 用于口头对话服务仲裁的系统和方法
CN105325049A (zh) * 2013-07-19 2016-02-10 三星电子株式会社 用于通信的方法和装置
CN105453025A (zh) * 2013-07-31 2016-03-30 谷歌公司 用于已识别语音发起动作的视觉确认
CN105912092A (zh) 2016-04-06 2016-08-31 北京地平线机器人技术研发有限公司 人机交互中的语音唤醒方法及语音识别装置
US20160344567A1 (en) * 2015-05-22 2016-11-24 Avaya Inc. Multi-channel conferencing
CN108182943A (zh) 2017-12-29 2018-06-19 北京奇艺世纪科技有限公司 一种智能设备控制方法、装置及智能设备
CN108604448A (zh) * 2015-11-06 2018-09-28 谷歌有限责任公司 跨装置的话音命令

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100754384B1 (ko) * 2003-10-13 2007-08-31 삼성전자주식회사 잡음에 강인한 화자위치 추정방법 및 장치와 이를 이용한카메라 제어시스템
US10042993B2 (en) * 2010-11-02 2018-08-07 Homayoon Beigi Access control through multifactor authentication with multimodal biometrics
US9129604B2 (en) * 2010-11-16 2015-09-08 Hewlett-Packard Development Company, L.P. System and method for using information from intuitive multimodal interactions for media tagging
US20120259638A1 (en) * 2011-04-08 2012-10-11 Sony Computer Entertainment Inc. Apparatus and method for determining relevance of input speech
US9098467B1 (en) * 2012-12-19 2015-08-04 Rawles Llc Accepting voice commands based on user identity
US9123340B2 (en) * 2013-03-01 2015-09-01 Google Inc. Detecting the end of a user question
US9460715B2 (en) * 2013-03-04 2016-10-04 Amazon Technologies, Inc. Identification using audio signatures and additional characteristics
JP6819988B2 (ja) 2016-07-28 2021-01-27 国立研究開発法人情報通信研究機構 音声対話装置、サーバ装置、音声対話方法、音声処理方法およびプログラム
US9898082B1 (en) * 2016-11-01 2018-02-20 Massachusetts Institute Of Technology Methods and apparatus for eye tracking
US20180293221A1 (en) * 2017-02-14 2018-10-11 Microsoft Technology Licensing, Llc Speech parsing with intelligent assistant
US11250844B2 (en) * 2017-04-12 2022-02-15 Soundhound, Inc. Managing agent engagement in a man-machine dialog
US10950228B1 (en) * 2017-06-28 2021-03-16 Amazon Technologies, Inc. Interactive voice controlled entertainment
TWI704490B (zh) * 2018-06-04 2020-09-11 和碩聯合科技股份有限公司 語音控制裝置及方法
EP4036910A1 (en) * 2018-08-21 2022-08-03 Google LLC Dynamic and/or context-specific hot words to invoke automated assistant

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104284257A (zh) * 2013-07-10 2015-01-14 通用汽车环球科技运作有限责任公司 用于口头对话服务仲裁的系统和方法
CN105325049A (zh) * 2013-07-19 2016-02-10 三星电子株式会社 用于通信的方法和装置
CN105453025A (zh) * 2013-07-31 2016-03-30 谷歌公司 用于已识别语音发起动作的视觉确认
US20160344567A1 (en) * 2015-05-22 2016-11-24 Avaya Inc. Multi-channel conferencing
CN108604448A (zh) * 2015-11-06 2018-09-28 谷歌有限责任公司 跨装置的话音命令
CN105912092A (zh) 2016-04-06 2016-08-31 北京地平线机器人技术研发有限公司 人机交互中的语音唤醒方法及语音识别装置
CN108182943A (zh) 2017-12-29 2018-06-19 北京奇艺世纪科技有限公司 一种智能设备控制方法、装置及智能设备

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2022069534A (ja) * 2021-04-16 2022-05-11 阿波▲羅▼智▲聯▼(北京)科技有限公司 投影シーンの表示制御方法、装置、設備、媒体及びプログラム製品
JP7318043B2 (ja) 2021-04-16 2023-07-31 阿波▲羅▼智▲聯▼(北京)科技有限公司 投影シーンの表示制御方法、装置、設備、媒体及びプログラム製品
US11955039B2 (en) 2021-04-16 2024-04-09 Apollo Intelligent Connectivity (Beijing) Technology Co., Ltd. Method and apparatus for controlling display in a screen projection scenario, device and program product
WO2023063965A1 (en) * 2021-10-13 2023-04-20 Google Llc Digital signal processor-based continued conversation

Also Published As

Publication number Publication date
US20210327436A1 (en) 2021-10-21
JP2022516491A (ja) 2022-02-28
CN111402900B (zh) 2024-04-23
EP3896691A4 (en) 2022-07-06
CN111402900A (zh) 2020-07-10
JP7348288B2 (ja) 2023-09-20
EP3896691A1 (en) 2021-10-20

Similar Documents

Publication Publication Date Title
WO2020135811A1 (zh) 一种语音交互方法,设备和系统
US10930303B2 (en) System and method for enhancing speech activity detection using facial feature detection
US10733987B1 (en) System and methods for providing unplayed content
US11922095B2 (en) Device selection for providing a response
KR101726945B1 (ko) 수동 시작/종료 포인팅 및 트리거 구문들에 대한 필요성의 저감
US9940949B1 (en) Dynamic adjustment of expression detection criteria
WO2020076779A1 (en) System and method for managing a mute button setting for a conference call
WO2014120291A1 (en) System and method for improving voice communication over a network
WO2021031308A1 (zh) 音频处理方法、装置及存储介质
TW202223877A (zh) 用戶話音輪廓管理
CN113779208A (zh) 用于人机对话的方法和装置
US11909786B2 (en) Systems and methods for improved group communication sessions
US20230282224A1 (en) Systems and methods for improved group communication sessions
JPWO2019093123A1 (ja) 情報処理装置および電子機器
KR102134860B1 (ko) 인공지능 스피커 및 이의 비언어적 요소 기반 동작 활성화 방법
CN108942926B (zh) 一种人机交互的方法、装置和系统
CN111968680A (zh) 一种语音处理方法、装置及存储介质
US20230230587A1 (en) Dynamic adaptation of parameter set used in hot word free adaptation of automated assistant
US11659332B2 (en) Estimating user location in a system including smart audio devices
KR20240036701A (ko) 컨텍스트 신호들에 기초하는 참여 상태 보존
KR20240033006A (ko) 소프트 핫워드로 자동 스피치 인식
Zhang et al. Fusing array microphone and stereo vision for improved computer interfaces

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19905540

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021537969

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019905540

Country of ref document: EP

Effective date: 20210714