WO2016127550A1 - 人机语音交互方法和装置 - Google Patents

人机语音交互方法和装置 Download PDF

Info

Publication number
WO2016127550A1
WO2016127550A1 PCT/CN2015/083207 CN2015083207W WO2016127550A1 WO 2016127550 A1 WO2016127550 A1 WO 2016127550A1 CN 2015083207 W CN2015083207 W CN 2015083207W WO 2016127550 A1 WO2016127550 A1 WO 2016127550A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
result
server
broadcast
voice recognition
Prior art date
Application number
PCT/CN2015/083207
Other languages
English (en)
French (fr)
Inventor
陈本东
谢文
Original Assignee
百度在线网络技术(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百度在线网络技术(北京)有限公司 filed Critical 百度在线网络技术(北京)有限公司
Publication of WO2016127550A1 publication Critical patent/WO2016127550A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output

Definitions

  • the present invention relates to the field of Internet technologies, and in particular, to a human-machine voice interaction method and apparatus.
  • Speech recognition and human-computer voice interaction have a long history.
  • the existing various voice assistant applications (Application; hereinafter referred to as APP), in the operation mode, the recording is triggered by the button, after the recording is completed, the machine broadcasts the answer. When you answer the answer, you cannot record. That is to say, the existing voice assistant APP can only perform half-duplex communication, that is, when the machine broadcasts, the user cannot speak, and when the user speaks, the machine cannot broadcast.
  • some voice assistants APP set the automatic answer mode, that is, the machine automatically enters the recording state after the broadcast of the machine, but in this automatic answer mode, the machine sometimes switches automatically, sometimes not automatically switching, but the user is overwhelmed.
  • the existing human-machine voice interaction mode is very inconvenient to use, and each time a question and answer requires user intervention, the operation is cumbersome, the human-computer interaction mode is also very unnatural, and the user experience is poor.
  • the object of the present invention is to solve at least one of the technical problems in the related art to some extent.
  • a first object of the present invention is to propose a human-machine voice interaction method.
  • the voice broadcast and the user's voice input can be simultaneously performed, thereby realizing that the human-computer interaction process does not need to repeatedly switch between recording and broadcasting, and realize full-duplex human-computer interaction.
  • the way of communication in turn, can make multiple rounds of dialogue more coherent.
  • a second object of the present invention is to provide a human-machine voice interactive device.
  • the human voice interaction method of the first aspect of the present invention includes: receiving, in a process of performing voice announcement by the terminal, the voice recognition server, the voice recognition result sent by the voice recognition server,
  • the speech recognition result is a voice input by the voice recognition server to a user who uses the terminal Transmitting the identification, sending the speech recognition result to the keyword understanding server for context understanding, receiving and saving the result of the context understanding sent by the keyword understanding server; determining the user input according to the saved context understanding result Intent of the voice, generating a broadcast result according to the intention; transmitting the broadcast result to the voice recognition server, so that the voice recognition server sends the broadcast result to the terminal for voice broadcast.
  • the human-machine voice interaction method in the embodiment of the present invention may receive a voice recognition result sent by the voice recognition server during the process of performing voice broadcast on the broadcast result sent by the voice recognition server, and determine the voice input by the user according to the voice recognition result. Intent, and generating a broadcast result according to the intention, and then transmitting the broadcast result to the voice recognition server, and the voice recognition server sends the broadcast result to the terminal for voice broadcast, so that voice broadcast and voice broadcast can be realized in the process of human-machine voice interaction.
  • the user's voice input is performed at the same time, and the two modes of recording and broadcasting are not required to be repeatedly switched in the process of human-computer interaction, and the full-duplex communication mode of human-computer interaction is realized, thereby making the multi-round dialogue more coherent.
  • the human voice interaction method of the second aspect of the present invention includes: receiving, during a voice broadcast of a broadcast result sent by a voice recognition server, a voice sent by the terminal, where the voice is Using the user of the terminal to input to the terminal; identifying the voice, and transmitting the voice recognition result to the multi-round dialogue server, so that the multiple round dialogue server sends the voice recognition result to the keyword understanding server Performing context understanding, receiving and saving the result of the context understanding sent by the keyword understanding server, determining the intention of the voice input by the user according to the saved context understanding result, and generating a broadcast result according to the intention; receiving the The broadcast result sent by the multi-round dialogue server sends the broadcast result to the terminal for voice broadcast.
  • the terminal recognizes the voice after receiving the voice sent by the terminal, and then sends the voice recognition result to multiple rounds.
  • a dialogue server so that the multi-round conversation server determines the intention of the voice input by the user according to the voice recognition result, and generates a broadcast result according to the intention, and then the voice recognition server receives the broadcast result sent by the multiple rounds of the dialogue server, and sends the broadcast result to the broadcast result.
  • the terminal performs voice broadcast; thus, in the process of human-machine voice interaction, the voice broadcast and the user's voice input are simultaneously performed, and the two states of the recording and broadcasting are not required to be repeatedly switched during the human-computer interaction process, and the human-computer interaction is fully realized. The way the workers communicate, which in turn can make multiple rounds of dialogue more coherent.
  • a human-machine voice interaction method includes: receiving, during a voice broadcast of a broadcast result sent by a voice recognition server, a voice input by a user using the terminal; Transmitting the voice input by the user to the voice recognition server, so that the voice recognition server identifies the voice, and transmitting the voice recognition result to the multi-round dialogue server, where the multiple round dialogue server.
  • the speech recognition result is sent to the keyword understanding server for context understanding, the result of the context understanding sent by the keyword understanding server is received and saved, and the voice input by the user is determined according to the saved context understanding result.
  • the intention of generating a broadcast result according to the intention; receiving and broadcasting the broadcast result sent by the voice recognition server, and the broadcast result sent by the voice recognition server is sent by the multiple round dialogue server to the voice recognition server.
  • the human-machine voice interaction method in the embodiment of the present invention receives the voice input by the user using the terminal in the process of performing voice broadcast on the broadcast result sent by the voice recognition server, and then sends the voice input by the user to the voice recognition server. So that the voice recognition server recognizes the voice, and sends the voice recognition result to the multi-round dialogue server, and the multi-round dialogue server determines the intention of the voice input by the user according to the voice recognition result, and then generates the broadcast result according to the intention; Then, the terminal receives and broadcasts the broadcast result sent by the voice recognition server; thereby, in the process of human-machine voice interaction, the voice broadcast and the user's voice input are simultaneously performed, and the two-way recording and broadcast are not required to be repeated during the human-computer interaction process.
  • the state realizes the full-duplex communication mode of human-computer interaction, which in turn can make the multi-round dialogue more coherent.
  • a human-machine voice interaction apparatus includes: a receiving module, configured to receive, during a voice broadcast of a broadcast result sent by a voice recognition server, a voice recognition server a voice recognition result, the voice recognition result being sent by the voice recognition server after identifying the voice input by the user using the terminal; and transmitting, by the sending module, the voice recognition result to the keyword understanding server for context After the understanding, the keyword is received to understand the result of the context understanding sent by the server; the sending module is configured to send the voice recognition result received by the receiving module to the keyword understanding server for context understanding; and the saving module is configured to save a result of the context understanding received by the receiving module; a determining module, configured to determine an intent of the voice input by the user according to a result of the context understanding saved by the saving module; and a generating module, configured to determine the intent according to the determining module Generating a broadcast result; Sending module is further configured to send the broadcast generation module generates the result to the speech recognition
  • the receiving module may receive the voice recognition result sent by the voice recognition server, and the determining module determines the user according to the voice recognition result.
  • the intent of the input voice the generating module generates a broadcast result according to the intention determined by the determining module, and then the sending module sends the broadcast result to the voice recognition server, and the voice recognition server sends the broadcast result to the terminal for voice broadcast, thereby realizing the person
  • the voice broadcast and the user's voice input are carried out at the same time.
  • a human-machine voice interaction device includes: a receiving module, configured to receive a voice sent by the terminal during a voice broadcast of a broadcast result sent by a voice recognition server by a terminal The voice is input to the terminal by a user using the terminal; and the voice recognition is performed at the sending module After the result is sent to the multi-round conversation server, the broadcast result sent by the multiple rounds of the dialogue server is received; the identification module is configured to identify the voice received by the receiving module; and the sending module is configured to identify the identification module
  • the speech recognition result is sent to the multi-round conversation server, so that the multi-round conversation server sends the speech recognition result to the keyword understanding server for context understanding, and receives and saves the result of the context understanding sent by the keyword understanding server, And determining an intention of the voice input by the user according to the saved context understanding result, and generating a broadcast result according to the intention; and after the receiving module receives the broadcast result sent by the multiple rounds of dialogue servers, the broadcast report
  • a receiving module configured to receive a voice sent by the terminal
  • the recognition module identifies the voice, and then the sending module sends the voice.
  • the recognition result is sent to the multi-round dialogue server, so that the multi-round dialogue server determines the intention of the voice input by the user according to the voice recognition result, and generates a broadcast result according to the intention, and then the receiving module receives the broadcast result sent by the multi-round dialogue server, and The sending module sends the above broadcast result to the terminal for voice broadcast; thereby, in the process of human-machine voice interaction, the voice broadcast and the user's voice input are simultaneously performed, and the manual interaction process does not need to repeatedly switch between recording and broadcast.
  • the state realizes the full-duplex communication mode of human-computer interaction, which in turn can make the multi-round dialogue more coherent.
  • a human-machine voice interaction apparatus includes: a receiving module, configured to receive a user using the terminal in a process of performing voice broadcast on a broadcast result sent by a voice recognition server by a terminal Inputting voice; and after the sending module sends the voice to the voice recognition server, receiving a broadcast result sent by the voice recognition server, and the broadcast result sent by the voice recognition server is sent by the multiple round dialogue server
  • the sending module is configured to send the voice received by the receiving module to the voice recognition server, so that the voice recognition server identifies the voice, and sends the voice recognition result Sending to the multi-round dialogue server, the voice recognition result is sent to the keyword understanding server for context understanding, receiving and saving the result of the context understanding sent by the keyword understanding server, and according to the saved context
  • the result of the understanding determines the user input Voice intent, and according to the intention of generating a broadcast results
  • broadcast module configured to broadcast the broadcast receiving module receives a result of.
  • the receiving module receives the voice input by the user using the terminal, and then the sending module sends the voice input by the user.
  • the receiving module receives and broadcasts the broadcast result sent by the voice recognition server by the broadcast module; thereby, in the process of human-machine voice interaction, the voice broadcast and the user's voice input are simultaneously performed, and the human-computer interaction process is realized. There is no need to repeatedly switch between recording and broadcast status.
  • the human-computer interaction full-duplex communication method can make the multi-round dialogue more coherent.
  • FIG. 1 is a flow chart of an embodiment of a human-machine voice interaction method according to the present invention
  • FIG. 2 is a flow chart of another embodiment of a human-machine voice interaction method according to the present invention.
  • FIG. 3 is a flow chart of still another embodiment of a human-machine voice interaction method according to the present invention.
  • FIG. 4 is a schematic diagram of an embodiment of a connection relationship in a human-machine voice interaction method according to the present invention.
  • FIG. 5 is a schematic structural diagram of an embodiment of a human-machine voice interaction device according to the present invention.
  • FIG. 6 is a schematic structural diagram of another embodiment of a human-machine voice interaction device according to the present invention.
  • FIG. 7 is a schematic structural diagram of still another embodiment of a human-machine voice interaction device according to the present invention.
  • FIG. 1 is a flowchart of an embodiment of a human-machine voice interaction method according to the present invention. As shown in FIG. 1 , the human-machine voice interaction method may include:
  • Step 101 In the process of performing voice broadcast on the broadcast result sent by the voice recognition server, the terminal receives the voice recognition result sent by the voice recognition server, and the voice recognition result is that the voice recognition server recognizes the voice input by the user using the terminal. Sent.
  • the user using the terminal may continue to input voice, that is, the terminal performs voice broadcast on the broadcast result. Still receiving the voice input by the user, and continuously transmitting the voice input by the user to the voice recognition server for voice recognition, and then the voice recognition server continuously transmits the voice recognition result to the multi-round dialogue server, and the multi-round dialogue server continuously receives The speech recognition result sent by the speech recognition server. Therefore, in the process of human-machine voice interaction, the voice broadcast and the user's voice input are simultaneously performed, and thus the two states of the recording and the broadcast are not required to be repeatedly switched during the human-computer interaction process.
  • receiving the voice recognition result sent by the voice recognition server may be: receiving a voice recognition result that is sent by the voice recognition server after the determined voice recognition result reaches a predetermined confidence level, and reaches the predetermined confidence level.
  • the predetermined confidence level may be set by the specific implementation.
  • the size of the predetermined confidence is not limited in this embodiment.
  • the voice recognition server when the user inputs the voice to the terminal, the voice recognition server continuously recognizes the voice sent by the terminal, and when the voice recognition server determines that the obtained voice recognition result has reached a predetermined confidence level, The voice recognition server sends the voice recognition result that achieves the predetermined confidence level to the multi-round dialogue server, so that the multi-round dialogue server performs the subsequent steps 102 to 104 to determine the intention of the voice input by the user, thereby generating an effective broadcast result.
  • Sending to the above terminal for voice broadcast that is, if the terminal receives the broadcast result, the user can interrupt the voice input of the user and directly broadcast the obtained broadcast result to the user.
  • Step 102 Send the voice recognition result to a keyword understanding (Query Understand; hereinafter referred to as QU) server for context understanding, and receive and save the result of the context understanding sent by the QU server.
  • QU keyword understanding
  • Step 103 Determine an intention of the voice input by the user according to the saved context understanding result, and generate a broadcast result according to the intention.
  • the multi-round dialogue server clarifies the intention of the voice input by the user according to the saved context understanding result, and then directly generates the broadcast result according to the intention described above;
  • the generating the broadcast result according to the intention may be: acquiring information corresponding to the intention from the resource access server according to the intention, and generating a broadcast result according to the acquired information.
  • Step 104 Send the broadcast result to the voice recognition server, so that the voice recognition server sends the broadcast result to the terminal for voice broadcast.
  • the content suitable for the user is obtained according to the user information and the current state of the user, and the cloud pushing service is triggered, and the content suitable for the user is sent to the terminal by using the cloud pushing service, and Initiate a dialogue with the above terminal.
  • the multi-round conversation server has learning ability, and can be based on the user information of the user (for example, the user's schedule and/or heard songs, etc.) and the current state of the user (for example: current location and / or the current conversation content, etc.), analyze the user's thoughts and wishes, obtain content suitable for recommendation to the user, and then the multi-round dialogue server can trigger the cloud push service, and the content suitable for the user can be sent to the above through the cloud push service.
  • the terminal initiates a dialogue with the terminal.
  • the subsequent dialog process is the same as that described in steps 101 to 104, and will not be described again.
  • the terminal may receive the voice recognition result sent by the voice recognition server, determine the intention of the voice input by the user according to the voice recognition result, and according to the intention Generate a broadcast result, and then send the broadcast result to the voice recognition server, which is recognized by the voice
  • the server sends the above broadcast result to the terminal for voice broadcast, so that in the process of human-machine voice interaction, the voice broadcast and the user's voice input are simultaneously performed, and there is no need to repeatedly switch between recording and broadcast in the process of human-computer interaction.
  • the state realizes the full-duplex communication mode of human-computer interaction, which in turn can make the multi-round dialogue more coherent.
  • the human-machine voice interaction method may include:
  • Step 201 In the process of performing voice broadcast on the broadcast result sent by the voice recognition server, the terminal receives the voice sent by the terminal, and the voice is input to the terminal by using the user of the terminal.
  • the voice recognition server may further receive the voice sent by the terminal, that is, in the process of human-machine voice interaction, voice broadcast and The user's voice input is performed at the same time, so that it is possible to realize two states of repeated recording and broadcasting in the process of human-computer interaction.
  • Step 202 Identify the voice, and send the voice recognition result to the multi-round dialogue server, so that the multi-round dialogue server sends the voice recognition result to the QU server for context understanding, and receives and saves the result of the context understanding sent by the QU server. And determining the intention of the voice input by the user according to the result of the saved context understanding, and generating the broadcast result according to the intention described above.
  • identifying the voice includes determining a start and an end of each sentence in the voice by a silence detection technique.
  • the speech recognition server can implement segmentation of the sentence, that is, the speech recognition server can determine the start and end of each sentence in the speech.
  • the sending the voice recognition result to the multi-round conversation server may be: after determining that the obtained voice recognition result reaches a predetermined confidence level, transmitting the voice recognition result that reaches the predetermined confidence level to the multi-round dialogue server.
  • the predetermined confidence level may be set by the specific implementation.
  • the size of the predetermined confidence is not limited in this embodiment.
  • the voice recognition server when the user inputs the voice to the terminal, the voice recognition server continuously recognizes the voice sent by the terminal, and when the voice recognition server determines that the obtained voice recognition result has reached a predetermined confidence level, The voice recognition server sends the voice recognition result that achieves the predetermined confidence level to the multi-round dialogue server, so that the multi-round dialogue server determines the voice input by the user according to the manner described in steps 102 to 104 of the embodiment shown in FIG. 1 of the present invention.
  • the intention is to generate an effective broadcast result and send it to the terminal for voice broadcast. That is, if the terminal receives the broadcast result, the voice input of the user can be interrupted, and the obtained broadcast result is directly broadcast to the user.
  • Step 203 Receive a broadcast result sent by the multi-round dialogue server, and send the broadcast result to the terminal for voice broadcast.
  • the voice recognition server After the terminal performs voice broadcast on the broadcast result sent by the voice recognition server, after receiving the voice sent by the terminal, the voice is identified, and then the voice recognition result is sent to the multi-round dialogue server for multiple rounds.
  • the dialogue server determines the intention of the voice input by the user according to the voice recognition result, and generates a broadcast result according to the intention, and then the voice recognition server receives the broadcast result sent by the multiple rounds of the dialogue server, and sends the broadcast result to the terminal for voice broadcast;
  • the voice broadcast and the user's voice input are simultaneously performed, and the two modes of recording and broadcasting are not required to be repeatedly switched in the process of human-computer interaction, thereby realizing the full-duplex communication mode of human-computer interaction, and further Can make multiple rounds of dialogue more coherent.
  • FIG. 3 is a flowchart of still another embodiment of a human-machine voice interaction method according to the present invention.
  • the human-machine voice interaction method may include:
  • Step 301 In the process of performing voice broadcast on the broadcast result sent by the voice recognition server, the terminal receives the voice input by the user using the terminal.
  • the voice input by the user receiving the use of the terminal may be: in the process of broadcasting the broadcast result sent by the voice recognition server by the terminal used by the user.
  • the echo cancellation technology eliminates the input of text-to-speech (TTS) speech, and only receives the voice input by the user.
  • the user can still input voice to the terminal, that is, the user can input the voice interrupt to the terminal by interrupting the voice broadcast of the terminal.
  • the user can still input voice to the terminal, that is, the user can input the voice interrupt to the terminal by interrupting the voice broadcast of the terminal.
  • Directly feedback the broadcast result of the terminal broadcast affecting the next broadcast content of the terminal, so that in the process of human-machine voice interaction, the voice broadcast and the user's voice input are simultaneously performed, thereby realizing the process of human-computer interaction Switch between recording and broadcast status repeatedly.
  • Step 302 Send the voice input by the user to the voice recognition server, so that the voice recognition server recognizes the voice, and send the voice recognition result to the multi-round dialogue server, and the voice recognition result is sent by the multiple round dialogue server.
  • the context of the QU server is understood, the result of the context understanding sent by the QU server is received and saved, and the intention of the voice input by the user is determined according to the result of the saved context understanding, and the broadcast result is generated according to the intention.
  • sending the voice input by the user to the voice recognition server may be: sending a voice of a predetermined length input by the user to the voice recognition server.
  • the predetermined length may be set by itself in a specific implementation, and the size of the predetermined length is not limited in this embodiment.
  • sending the voice input by the user to the voice recognition server may be: determining, by the silence detection technology, the start and end of each sentence in the voice input by the user, and transmitting only the voice recording including the voice to the voice recognition server. .
  • the predetermined length can be set.
  • the predetermined length of voice input by the user is sent to the voice recognition server; or, sometimes, the user may pause during the process of inputting the voice, so the user input may be determined by the silence detection technology.
  • the beginning and end of each sentence in the speech only the recording containing the voice is sent to the voice recognition server, so that the voice recognition server recognizes the voice, and sends the voice recognition result to the multi-round dialogue server
  • the round dialogue server sends the voice recognition result to the QU server for context understanding, receives and saves the result of the context understanding sent by the QU server, and determines the intention of the voice input by the user according to the saved context understanding result, and according to the intention Generate broadcast results.
  • the multi-round dialogue server sends the broadcast result to the voice recognition server, and the voice recognition server sends the broadcast result to the terminal.
  • the terminal can interrupt the voice input of the user, and perform voice broadcast on the broadcast result.
  • Step 303 Receive and broadcast the broadcast result sent by the voice recognition server.
  • the broadcast result sent by the voice recognition server is sent by the multi-round dialogue server to the voice recognition server.
  • the terminal receives the voice input by the user using the terminal, and then sends the voice input by the user to the voice recognition server, so that the voice recognition server is enabled. Identifying the voice, and transmitting the voice recognition result to the multi-round dialogue server, determining, by the multi-round dialogue server, the intention of the voice input by the user according to the voice recognition result, and further generating the broadcast result according to the intention; and then receiving and broadcasting the terminal
  • the broadcast report sends the broadcast result; thus, in the process of human-machine voice interaction, the voice broadcast and the user's voice input are simultaneously performed, and the two states of the recording and broadcasting are not required to be repeatedly switched during the human-computer interaction process, thereby realizing the human-machine Interacting full-duplex communication, which in turn can make multiple rounds of conversation more coherent.
  • connection relationship between the terminal, the voice recognition server, the multi-round dialogue server, the QU server and the resource access server may be as shown in FIG. 4 .
  • 4 is a schematic diagram of an embodiment of a connection relationship in a human-machine voice interaction method according to the present invention.
  • the terminal receives the voice input by the user using the terminal.
  • the user can still input the voice to the terminal, that is, the user can input the voice interrupt to the terminal by interrupting the voice broadcast of the terminal, or directly Feedback is provided on the broadcast result of the terminal broadcast, so that the following two dialog scenarios can be implemented.
  • Dialogue scenario 1 The user interrupts the voice broadcast of the terminal
  • Conversation scenario 2 User feedback terminal voice broadcast
  • the terminal sends the voice input by the user to the voice recognition server
  • the voice recognition server identifies the voice
  • the voice recognition result is sent to the QU by the multi-round dialogue server.
  • the server performs context understanding, receives and saves the result of the context understanding sent by the QU server, and determines the intention of the voice input by the user according to the saved context understanding result, and generates a broadcast result according to the intention.
  • a predetermined length can be set.
  • the voice of the predetermined length input by the user is sent to the voice recognition server; Or, sometimes the user may pause during the process of inputting the voice, so the start and end of each sentence in the voice input by the user may be determined by the silence detection technology, and only the recording containing the voice is sent to the voice recognition server.
  • the speech recognition server is caused to recognize the speech and send the speech recognition result to the multi-round conversation server.
  • the voice recognition server since the voice recognition server continuously recognizes the voice transmitted by the terminal when the user inputs the voice to the terminal, when the voice recognition server determines that the obtained voice recognition result has reached a predetermined confidence level, the voice The recognition server transmits the speech recognition result that achieves the above predetermined confidence level to the multi-round conversation server.
  • the voice recognition result is sent to the QU server for context understanding by the multi-round conversation server, the result of the context understanding sent by the QU server is received and saved, and the intention of the voice input by the user is determined according to the saved context understanding result. And generate a broadcast result according to the above intention.
  • the multi-round dialogue server sends the broadcast result to the voice recognition server, and the voice recognition server sends the broadcast result to the terminal.
  • the terminal can interrupt the voice input of the user, and perform voice broadcast on the broadcast result, so that the following dialogue scenario can be realized. .
  • Dialogue scenario 3 The terminal interrupts the user's voice input.
  • the multi-round conversation server has learning ability, and can be based on the user information of the user (for example, the user's schedule and/or heard songs, etc.) and the current state of the user (for example: current location and/or current conversation content, etc.) Analyze the user's thoughts and wishes, obtain content suitable for recommendation to the user, and then the multi-round dialogue server can trigger the cloud push service, and the content suitable for the user can be sent to the terminal through the cloud push service, and the terminal is initiated.
  • the dialogue can thus achieve the following dialogue scenarios.
  • Conversation Scenario 4 Recommend taxi information to the user according to the user's schedule
  • Terminal You booked the ticket at 4 o'clock this afternoon. The current time is 2 pm. Is it a taxi for you?
  • the terminal when the terminal performs voice broadcast on the broadcast result, the user can still input voice to the terminal, and then the terminal sends the voice to the voice recognition server for identification, and the voice recognition server sends the voice recognition result to the multi-round dialogue server, and more
  • the round dialogue server sends the voice recognition result to the QU server for context understanding, then receives and saves the result of the context understanding sent by the QU server, and determines the intention of the voice input by the user according to the saved context understanding result, and then according to the intention
  • the generated broadcast result is returned to the terminal for voice broadcast, and the following five states can be implemented:
  • the terminal keeps the voice broadcast.
  • the voice input by the user may be “Aha” or “interesting”);
  • the terminal stops the current broadcast and ends the current topic. In this state, the voice input by the user may be “knowed” or “enough”;
  • the multi-round dialogue server connects the resource access server to open a new topic.
  • the voice input by the user may be “the weather in Beijing”;
  • the multi-round dialogue server connects the resource access server to the topic.
  • the voice input by the user may be “Beijing weather” and “Shanghai”;
  • the voice input by the user may be "the previous joke is finished”; you can also actively ask the server through multiple rounds of dialogue, the broadcast result received by the terminal may be "the weather broadcast is over, you still need Do you finish the previous paragraph?"
  • the present invention can maintain a dialogue and ensure a chat effect without requiring manual intervention by a user (such as a button operation).
  • FIG. 5 is a schematic structural diagram of an embodiment of a human-machine voice interaction device according to the present invention.
  • the human-machine voice interaction device in this embodiment can be used as a multi-round dialogue server or a part of a multi-round dialogue server to implement the embodiment shown in FIG. 1 of the present invention.
  • the human-machine voice interaction device may include: a receiving module 51, a sending module 52, a saving module 53, a determining module 54, and a generating module 55.
  • the receiving module 51 is configured to receive, during a voice broadcast of the broadcast result sent by the voice recognition server, a voice recognition result sent by the voice recognition server, where the voice recognition result is a voice recognition server to a user using the terminal.
  • the input voice is identified and transmitted; and after the sending module 52 sends the voice recognition result to the QU server for context understanding, the result of the context understanding sent by the QU server is received.
  • the user using the terminal may continue to input voice, that is, the terminal performs voice broadcast on the broadcast result. Still receiving the voice input by the user, and continuously transmitting the voice input by the user to the voice recognition server for voice recognition, and then the voice recognition server continuously transmits the voice recognition result to the plurality of rounds of dialogue servers, so that the receiving module 51 continuously receives The speech recognition result sent by the speech recognition server. Therefore, in the process of human-machine voice interaction, the voice broadcast and the user's voice input are simultaneously performed, and thus the two states of the recording and the broadcast are not required to be repeatedly switched during the human-computer interaction process.
  • the sending module 52 is configured to send the voice recognition result received by the receiving module 51 to the QU server for context understanding.
  • the saving module 53 is configured to save the result of the context understanding received by the receiving module 51.
  • the determining module 54 is configured to determine an intent of the voice input by the user according to the result of the context understanding saved by the saving module 53.
  • the generating module 55 is configured to generate a broadcast result according to the intention determined by the determining module 54.
  • the sending module 52 is further configured to send the broadcast result generated by the generating module 55 to the voice recognition server, so that the voice recognition server sends the broadcast result to the terminal for voice broadcast.
  • the generating module 55 is specifically configured to acquire information corresponding to the intent from the resource access server according to the intent determined by the determining module 54, and generate a broadcast result according to the obtained information.
  • the receiving module 51 is specifically configured to receive, after the voice recognition server determines that the obtained voice recognition result reaches a predetermined confidence level, the voice recognition result that reaches the predetermined confidence level.
  • the predetermined confidence level may be set by the specific implementation.
  • the size of the predetermined confidence is not limited in this embodiment.
  • the voice recognition server when the user inputs voice to the terminal, the voice recognition server is also constantly facing the terminal. The transmitted voice is identified.
  • the voice recognition server determines that the obtained voice recognition result has reached a predetermined confidence level, the voice recognition server transmits the voice recognition result that reaches the predetermined confidence level to the multi-round dialogue server to determine the module. Determining the intention of the voice input by the user, and then generating a valid broadcast result by the generating module 55, the sending module 52 sends the broadcast result to the terminal for voice broadcast, that is, if the terminal receives the broadcast result, it can play The user's voice input is broken, and the obtained broadcast result is directly broadcast to the user.
  • the human-machine voice interaction device may further include: an obtaining module 56, configured to obtain content suitable for recommendation to the user according to the user information and the current state of the user; and the sending module 52 is further configured to: The cloud push service is triggered, and the content suitable for the recommendation to the user is sent to the terminal through the cloud push service, and a dialogue with the terminal is initiated.
  • the receiving module 51 in the process of the voice broadcast of the broadcast result sent by the voice recognition server, can receive the voice recognition result sent by the voice recognition server, and the determining module 54 determines the user input according to the voice recognition result.
  • the generating module 55 generates a broadcast result according to the determined intention, and then the sending module 52 sends the broadcast result to the voice recognition server, and the voice recognition server sends the broadcast result to the terminal for voice broadcast, thereby realizing the man-machine
  • the voice broadcast and the user's voice input are simultaneously performed, and the two modes of recording and broadcasting are not required to be repeatedly switched in the process of human-computer interaction, thereby realizing the full-duplex communication mode of human-computer interaction, thereby enabling multiple rounds of dialogue. More coherent.
  • FIG. 6 is a schematic structural diagram of another embodiment of the human-machine voice interaction apparatus according to the present invention.
  • the human-machine voice interaction apparatus in this embodiment can be used as a voice recognition server or a part of the voice recognition server to implement the embodiment of the present invention shown in FIG. 2 .
  • the flow, as shown in FIG. 6, the human-machine voice interaction device may include: a receiving module 61, a sending module 62, and an identifying module 63;
  • the receiving module 61 is configured to receive, during the voice broadcast of the broadcast result sent by the voice recognition server, the voice sent by the terminal, where the voice is input by the user using the terminal to the terminal; and in the sending module 62 After the speech recognition result is sent to the multi-round conversation server, the broadcast result sent by the multi-round conversation server is received.
  • the receiving module 61 may further receive the voice sent by the terminal, that is, in the process of human-machine voice interaction, voice broadcast and The user's voice input is performed at the same time, so that it is possible to realize two states of repeated recording and broadcasting in the process of human-computer interaction.
  • the identification module 63 is configured to identify the voice received by the receiving module 61.
  • the identification module 63 is specifically configured to determine the start and end of each sentence in the voice by using a silence detection technique.
  • the recognition module 63 can implement segmentation of the sentence, that is, the recognition module 63 can determine the start and end of each sentence in the speech.
  • the sending module 62 is configured to send the voice recognition result recognized by the identification module 63 to the multi-round dialogue server, so that the multi-round dialogue server sends the voice recognition result to the QU server for context understanding, and receives and saves the context understanding sent by the QU server. a result of determining a voice input by the user according to the saved context, and generating a broadcast result according to the intention; and transmitting the broadcast result to the terminal after the receiving module 61 receives the broadcast result sent by the multiple rounds of the dialogue server Conduct a voice announcement.
  • the sending module 62 is specifically configured to send, after determining that the obtained speech recognition result reaches a predetermined confidence level, the speech recognition result that reaches the predetermined confidence level to the multi-round dialogue server.
  • the predetermined confidence level may be set by the specific implementation. The size of the predetermined confidence is not limited in this embodiment.
  • the identification module 63 when the user inputs the voice to the terminal, the identification module 63 also continuously recognizes the voice sent by the terminal.
  • the sending module 62 Transmitting the speech recognition result that achieves the predetermined confidence level to the multi-round dialogue server, so that the multi-round dialogue server determines the intention of the voice input by the user according to the manner described in steps 102 to 104 of the embodiment shown in FIG. 1 of the present invention, and further A valid broadcast result is generated and sent to the terminal for voice broadcast, that is, if the terminal receives the broadcast result, the voice input of the user can be interrupted, and the obtained broadcast result is directly broadcast to the user.
  • the recognition module 63 identifies the voice, and then the sending module 62 transmits the voice.
  • the recognition result is sent to the multi-round dialogue server, so that the multi-round dialogue server determines the intention of the voice input by the user according to the voice recognition result, and generates a broadcast result according to the intention, and then the receiving module 61 receives the broadcast result sent by the multi-round dialogue server, and The transmitting module 62 sends the broadcast result to the terminal for voice broadcast; thereby, in the process of human-machine voice interaction, the voice broadcast and the user's voice input are simultaneously performed, and the recording and broadcasting are not required to be repeated during the human-computer interaction process. Two states, realize the full-duplex communication mode of human-computer interaction, which can make the multi-round dialogue more coherent.
  • FIG. 7 is a schematic structural diagram of still another embodiment of the human-machine voice interaction device of the present invention.
  • the human-machine voice interaction device in this embodiment can be used as a terminal or a part of the terminal to implement the process of the embodiment shown in FIG. 3 of the present invention. 7 shows that the human-machine voice interaction device can include: a receiving module 71, a sending module 72, and a broadcast module 73;
  • the receiving module 71 is configured to receive, during the voice broadcast of the broadcast result sent by the voice recognition server, the voice input by the user using the terminal, and after the sending module 72 sends the voice to the voice recognition server, receive the foregoing.
  • the broadcast result sent by the voice recognition server is sent by the voice recognition server to the voice recognition server.
  • the receiving module 71 is specifically configured to send the voice recognition server to the terminal.
  • the echo input technology is used to eliminate the input of the played TTS voice, and only the voice input by the user is received.
  • the user The voice can still be input to the terminal, that is, the user can interrupt the voice broadcast of the terminal by inputting the voice to the terminal, or directly feedback the broadcast result broadcasted by the terminal, thereby affecting the next broadcast content of the terminal, thereby realizing the person.
  • the voice broadcast and the user's voice input are performed simultaneously, and thus the two states of the recording and broadcast are not required to be repeatedly switched during the human-computer interaction process.
  • the sending module 72 is configured to send the voice received by the receiving module 71 to the voice recognition server, so that the voice recognition server identifies the voice, and sends the voice recognition result to the multi-round dialogue server, which is
  • the voice recognition result is sent to the QU server for context understanding, receiving and saving the result of the context understanding sent by the QU server, and determining the intention of the voice input by the user according to the saved context understanding result, and generating the broadcast result according to the intention;
  • the broadcast module 73 is configured to broadcast the broadcast result received by the receiving module 71.
  • the sending module 72 is specifically configured to send a voice of a predetermined length input by the user to the voice recognition server.
  • the predetermined length may be set by itself in a specific implementation, and the size of the predetermined length is not limited in this embodiment.
  • the sending module 72 is specifically configured to determine, by using a silence detection technology, the start and end of each sentence in the voice input by the user, and only transmit the voice-containing recording to the voice recognition server.
  • a predetermined length can be set.
  • the sending module 72 sends the voice of the predetermined length input by the user to the voice recognition.
  • the server or, sometimes, the user may pause during the process of inputting the voice, so the start and end of each sentence in the voice input by the user may be determined by the silence detection technology, and only the recording containing the voice is sent to the voice recognition server. So that the voice recognition server recognizes the voice, and sends the voice recognition result to the multi-round dialogue server, and the voice recognition result is sent to the QU server for context understanding by the plurality of rounds of dialogue servers, and the QU server is received and saved.
  • the result of the context understanding, and the intention of the voice input by the user described above is determined based on the result of the saved context understanding, and the broadcast result is generated according to the above intention. Then, the multi-round dialogue server sends the broadcast result to the voice recognition server, and the voice recognition server sends the broadcast result to the terminal. At this time, the terminal can interrupt the voice input of the user, and perform voice broadcast on the broadcast result.
  • the receiving module 71 receives the voice input by the user using the terminal, and then the sending module 72 sends the voice input by the user to the voice.
  • the receiving module 71 receives and broadcasts the broadcast result sent by the voice recognition server by the broadcast module 73; thus, in the process of human-machine voice interaction, the voice broadcast and the user's voice input are simultaneously performed, and the human-computer interaction process is realized. There is no need to repeatedly switch between recording and broadcasting, and realize the full-duplex communication mode of human-computer interaction, which can make the multi-round conversation more coherent.
  • portions of the invention may be implemented in hardware, software, firmware or a combination thereof.
  • multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system.
  • a suitable instruction execution system For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals.
  • Discrete logic circuit, ASIC with suitable combination logic gate Programmable Gate Array (PGA), Field Programmable Gate Array (FPGA).
  • each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.
  • the above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种人机语音交互方法和装置,该人机语音交互方法包括:在终端对播报结果进行语音播报的过程中,接收语音识别服务器发送的语音识别结果(101);将语音识别结果发送给QU服务器进行上下文理解,接收并保存上下文理解的结果(102);根据保存的上下文理解的结果确定用户输入的语音的意图,根据上述意图生成播报结果(103);将播报结果发送给语音识别服务器,以便语音识别服务器将播报结果发送给终端进行语音播报(104)。可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,从而可以实现人机交互过程中不需要反复切换录音和播报两种状态,进而可以使得多轮对话更连贯。

Description

人机语音交互方法和装置
相关申请的交叉引用
本申请要求百度在线网络技术(北京)有限公司于2015年2月13日提交的、发明名称为“人机语音交互方法和装置”的、中国专利申请号“201510080163.X”的优先权。
技术领域
本发明涉及互联网技术领域,尤其涉及一种人机语音交互方法和装置。
背景技术
语音识别和人机语音交互有了很长的历史,现有的各种语音助手类应用(Application;以下简称:APP),在操作方式上,录音的触发通过按键,录音完毕后,机器播报答案,播报答案时,不能录音。也就是说,现有的语音助手类APP只能进行半双工通信,即机器播报时,用户不能说话,用户说话时,机器不能播报。
这样就需要机器在录音和播报两种状态之间不停地切换,往往需要用户的操作来干预,使用起来很不方便。现在,有的语音助手类APP设置了自动应答模式,即机器播报完毕后自动进入录音状态,但是在这种自动应答模式下,机器有时自动切换,有时不自动切换,反而让用户不知所措。
综上所述,现有的人机语音交互模式使用起来非常不便,每次一问一答,都需要用户干预,操作繁琐,人机交互方式也很不自然,用户体验度较差。
发明内容
本发明的目的旨在至少在一定程度上解决相关技术中的技术问题之一。
为此,本发明的第一个目的在于提出一种人机语音交互方法。通过该方法,在人机语音交互的过程中,语音播报和用户的语音输入可以同时进行,从而可以实现人机交互过程中不需要反复切换录音和播报两种状态,实现人机交互全双工的通信方式,进而可以使得多轮对话更连贯。
本发明的第二个目的在于提出一种人机语音交互装置。
为了实现上述目的,本发明第一方面实施例的人机语音交互方法,包括:在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收所述语音识别服务器发送的语音识别结果,所述语音识别结果是所述语音识别服务器对使用所述终端的用户输入的语音 进行识别后发送的;将所述语音识别结果发送给关键词理解服务器进行上下文理解,接收并保存所述关键词理解服务器发送的上下文理解的结果;根据保存的上下文理解的结果确定所述用户输入的语音的意图,根据所述意图生成播报结果;将所述播报结果发送给所述语音识别服务器,以便所述语音识别服务器将所述播报结果发送给所述终端进行语音播报。
本发明实施例的人机语音交互方法,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,可以接收语音识别服务器发送的语音识别结果,根据上述语音识别结果确定用户输入的语音的意图,并根据该意图生成播报结果,然后将播报结果发送给语音识别服务器,由语音识别服务器将上述播报结果发送给终端进行语音播报,从而可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,实现人机交互过程中不需要反复切换录音和播报两种状态,实现人机交互全双工的通信方式,进而可以使得多轮对话更连贯。
为了实现上述目的,本发明第二方面实施例的人机语音交互方法,包括:在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收所述终端发送的语音,所述语音是使用所述终端的用户输入给所述终端的;对所述语音进行识别,将语音识别结果发送给多轮对话服务器,以便所述多轮对话服务器将所述语音识别结果发送给关键词理解服务器进行上下文理解,接收并保存所述关键词理解服务器发送的上下文理解的结果,以及根据保存的上下文理解的结果确定所述用户输入的语音的意图,并根据所述意图生成播报结果;接收所述多轮对话服务器发送的播报结果,将所述播报结果发送给所述终端进行语音播报。
本发明实施例的人机语音交互方法,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收终端发送的语音之后,对上述语音进行识别,然后将语音识别结果发送给多轮对话服务器,以便多轮对话服务器根据上述语音识别结果确定用户输入的语音的意图,并根据上述意图生成播报结果,然后语音识别服务器接收多轮对话服务器发送的播报结果,并将上述播报结果发送给终端进行语音播报;从而可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,实现人机交互过程中不需要反复切换录音和播报两种状态,实现人机交互全双工的通信方式,进而可以使得多轮对话更连贯。
为了实现上述目的,本发明第三方面实施例的人机语音交互方法,包括:在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收使用所述终端的用户输入的语音;将所述用户输入的语音发送给所述语音识别服务器,以使所述语音识别服务器对所述语音进行识别,并将语音识别结果发送给多轮对话服务器,由所述多轮对话服务器将所述语音识别结果发送给关键词理解服务器进行上下文理解,接收并保存所述关键词理解服务器发送的上下文理解的结果,以及根据保存的上下文理解的结果确定所述用户输入的语音 的意图,并根据所述意图生成播报结果;接收并播报所述语音识别服务器发送的播报结果,所述语音识别服务器发送的播报结果是所述多轮对话服务器发送给所述语音识别服务器的。
本发明实施例的人机语音交互方法,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收使用上述终端的用户输入的语音,然后将上述用户输入的语音发送给语音识别服务器,以使语音识别服务器对上述语音进行识别,并将语音识别结果发送给多轮对话服务器,由多轮对话服务器根据该语音识别结果确定用户输入的语音的意图,进而根据上述意图生成播报结果;然后,终端接收并播报语音识别服务器发送的播报结果;从而可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,实现人机交互过程中不需要反复切换录音和播报两种状态,实现人机交互全双工的通信方式,进而可以使得多轮对话更连贯。
为了实现上述目的,本发明第四方面实施例的人机语音交互装置,包括:接收模块,用于在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收所述语音识别服务器发送的语音识别结果,所述语音识别结果是所述语音识别服务器对使用所述终端的用户输入的语音进行识别后发送的;以及在发送模块将所述语音识别结果发送给关键词理解服务器进行上下文理解之后,接收所述关键词理解服务器发送的上下文理解的结果;所述发送模块,用于将所述接收模块接收的语音识别结果发送给关键词理解服务器进行上下文理解;保存模块,用于保存所述接收模块接收的上下文理解的结果;确定模块,用于根据所述保存模块保存的上下文理解的结果确定所述用户输入的语音的意图;生成模块,用于根据所述确定模块确定的意图生成播报结果;所述发送模块,还用于将所述生成模块生成的播报结果发送给所述语音识别服务器,以便所述语音识别服务器将所述播报结果发送给所述终端进行语音播报。
本发明实施例的人机语音交互装置,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收模块可以接收语音识别服务器发送的语音识别结果,确定模块根据上述语音识别结果确定用户输入的语音的意图,生成模块根据确定模块确定的意图生成播报结果,然后发送模块将播报结果发送给语音识别服务器,由语音识别服务器将上述播报结果发送给终端进行语音播报,从而可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,实现人机交互过程中不需要反复切换录音和播报两种状态,实现人机交互全双工的通信方式,进而可以使得多轮对话更连贯。
为了实现上述目的,本发明第五方面实施例的人机语音交互装置,包括:接收模块,用于在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收所述终端发送的语音,所述语音是使用所述终端的用户输入给所述终端的;以及在发送模块将语音识别 结果发送给多轮对话服务器之后,接收所述多轮对话服务器发送的播报结果;识别模块,用于对所述接收模块接收的语音进行识别;所述发送模块,用于将所述识别模块识别的语音识别结果发送给多轮对话服务器,以便所述多轮对话服务器将所述语音识别结果发送给关键词理解服务器进行上下文理解,接收并保存所述关键词理解服务器发送的上下文理解的结果,以及根据保存的上下文理解的结果确定所述用户输入的语音的意图,并根据所述意图生成播报结果;以及在所述接收模块接收所述多轮对话服务器发送的播报结果之后,将所述播报结果发送给所述终端进行语音播报。
本发明实施例的人机语音交互装置,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收模块接收终端发送的语音之后,识别模块对上述语音进行识别,然后发送模块将语音识别结果发送给多轮对话服务器,以便多轮对话服务器根据上述语音识别结果确定用户输入的语音的意图,并根据上述意图生成播报结果,然后接收模块接收多轮对话服务器发送的播报结果,并由发送模块将上述播报结果发送给终端进行语音播报;从而可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,实现人机交互过程中不需要反复切换录音和播报两种状态,实现人机交互全双工的通信方式,进而可以使得多轮对话更连贯。
为了实现上述目的,本发明第六方面实施例的人机语音交互装置,包括:接收模块,用于在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收使用所述终端的用户输入的语音;以及在发送模块将所述语音发送给所述语音识别服务器之后,接收所述语音识别服务器发送的播报结果,所述语音识别服务器发送的播报结果是所述多轮对话服务器发送给所述语音识别服务器的;所述发送模块,用于将所述接收模块接收的语音发送给所述语音识别服务器,以使所述语音识别服务器对所述语音进行识别,并将语音识别结果发送给多轮对话服务器,由所述多轮对话服务器将所述语音识别结果发送给关键词理解服务器进行上下文理解,接收并保存所述关键词理解服务器发送的上下文理解的结果,以及根据保存的上下文理解的结果确定所述用户输入的语音的意图,并根据所述意图生成播报结果;播报模块,用于播报所述接收模块接收的播报结果。
本发明实施例的人机语音交互装置,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收模块接收使用上述终端的用户输入的语音,然后发送模块将上述用户输入的语音发送给语音识别服务器,以使语音识别服务器对上述语音进行识别,并将语音识别结果发送给多轮对话服务器,由多轮对话服务器根据该语音识别结果确定用户输入的语音的意图,进而根据上述意图生成播报结果;然后,接收模块接收并由播报模块播报语音识别服务器发送的播报结果;从而可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,实现人机交互过程中不需要反复切换录音和播报两种状态,实现 人机交互全双工的通信方式,进而可以使得多轮对话更连贯。
本发明附加的方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。
附图说明
本发明上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:
图1为本发明人机语音交互方法一个实施例的流程图;
图2为本发明人机语音交互方法另一个实施例的流程图;
图3为本发明人机语音交互方法再一个实施例的流程图;
图4为本发明人机语音交互方法中的连接关系一个实施例的示意图;
图5为本发明人机语音交互装置一个实施例的结构示意图;
图6为本发明人机语音交互装置另一个实施例的结构示意图;
图7为本发明人机语音交互装置再一个实施例的结构示意图。
具体实施方式
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。相反,本发明的实施例包括落入所附加权利要求书的精神和内涵范围内的所有变化、修改和等同物。
图1为本发明人机语音交互方法一个实施例的流程图,如图1所示,该人机语音交互方法可以包括:
步骤101,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收语音识别服务器发送的语音识别结果,上述语音识别结果是语音识别服务器对使用上述终端的用户输入的语音进行识别后发送的。
本实施例中,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,使用上述终端的用户仍然可以继续输入语音,也就是说,该终端在对播报结果进行语音播报的过程中,仍在继续接收用户输入的语音,并持续地将用户输入的语音发送给语音识别服务器进行语音识别,然后语音识别服务器持续地将语音识别结果发送给多轮对话服务器,多轮对话服务器持续地接收语音识别服务器发送的语音识别结果。从而可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,进而可以实现人机交互过程中不需要反复切换录音和播报两种状态。
具体地,接收语音识别服务器发送的语音识别结果可以为:接收上述语音识别服务器在确定获得的语音识别结果达到预定的置信度之后,发送的达到上述预定的置信度的语音识别结果。其中,该预定的置信度可以在具体实现时自行设定,本实施例对上述预定的置信度的大小不作限定。
本实施例中,用户在向上述终端输入语音的时候,语音识别服务器也在不断地对终端发送来的语音进行识别,当语音识别服务器确定已获得的语音识别结果已达到预定的置信度时,语音识别服务器将达到上述预定的置信度的语音识别结果发送给多轮对话服务器,以便多轮对话服务器执行后续的步骤102~步骤104,确定用户输入的语音的意图,进而生成有效的播报结果,发送给上述终端进行语音播报,也就是说,如果终端接收到播报结果了,就可以打断用户的语音输入,直接向用户播报获得的播报结果。
步骤102,将上述语音识别结果发送给关键词理解(Query Understand;以下简称:QU)服务器进行上下文理解,接收并保存上述QU服务器发送的上下文理解的结果。
步骤103,根据保存的上下文理解的结果确定上述用户输入的语音的意图,并根据上述意图生成播报结果。
本实施例中,多轮对话服务器会根据保存的上下文理解的结果明确用户输入的语音的意图,然后可以直接根据上述意图生成播报结果;
或者,根据上述意图生成播报结果可以为:根据上述意图从资源接入服务器获取与上述意图对应的信息,根据获取的信息生成播报结果。
步骤104,将上述播报结果发送给所述语音识别服务器,以便语音识别服务器将上述播报结果发送给上述终端进行语音播报。
本实施例中,还可以根据上述用户的用户信息和当前状态,获得适合推荐给上述用户的内容,并触发云推送服务,通过上述云推送服务将适合推荐给用户的内容发送给上述终端,并发起与上述终端的对话。
也就是说,本实施例中,多轮对话服务器有学习能力,可以根据用户的用户信息(例如:用户的日程安排和/或听过的歌曲等)和用户的当前状态(例如:当前位置和/或当前对话内容等),分析用户的想法和意愿,获得适合推荐给用户的内容,然后多轮对话服务器可以触发云推送服务,可以通过上述云推送服务将适合推荐给用户的内容发送给上述终端,并发起与上述终端的对话。之后的对话过程与步骤101~步骤104描述的过程相同,在此不再赘述。
上述实施例中,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,可以接收语音识别服务器发送的语音识别结果,根据上述语音识别结果确定用户输入的语音的意图,并根据该意图生成播报结果,然后将播报结果发送给语音识别服务器,由语音识 别服务器将上述播报结果发送给终端进行语音播报,从而可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,实现人机交互过程中不需要反复切换录音和播报两种状态,实现人机交互全双工的通信方式,进而可以使得多轮对话更连贯。
图2为本发明人机语音交互方法另一个实施例的流程图,如图2所示,该人机语音交互方法可以包括:
步骤201,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收上述终端发送的语音,上述语音是使用上述终端的用户输入给上述终端的。
本实施例中,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,语音识别服务器还可以接收上述终端发送的语音,也就是说,在人机语音交互的过程中,语音播报和用户的语音输入同时进行,从而可以实现人机交互过程中不需要反复切换录音和播报两种状态。
步骤202,对上述语音进行识别,将语音识别结果发送给多轮对话服务器,以便多轮对话服务器将上述语音识别结果发送给QU服务器进行上下文理解,接收并保存QU服务器发送的上下文理解的结果,以及根据保存的上下文理解的结果确定上述用户输入的语音的意图,并根据上述意图生成播报结果。
具体地,对上述语音进行识别包括:通过静音检测技术确定上述语音中每句话的起始和结束。
本实施例中,运用静音检测技术,语音识别服务器能够实现对句子的切分,即语音识别服务器能够确定上述语音中每句话的起始和结束。
具体地,将语音识别结果发送给多轮对话服务器可以为:在确定获得的语音识别结果达到预定的置信度之后,将达到上述预定的置信度的语音识别结果发送给多轮对话服务器。其中,该预定的置信度可以在具体实现时自行设定,本实施例对上述预定的置信度的大小不作限定。
本实施例中,用户在向上述终端输入语音的时候,语音识别服务器也在不断地对终端发送来的语音进行识别,当语音识别服务器确定已获得的语音识别结果已达到预定的置信度时,语音识别服务器将达到上述预定的置信度的语音识别结果发送给多轮对话服务器,以便多轮对话服务器按照本发明图1所示实施例步骤102~步骤104描述的方式,确定用户输入的语音的意图,进而生成有效的播报结果,发送给上述终端进行语音播报,也就是说,如果终端接收到了播报结果,就可以打断用户的语音输入,直接向用户播报获得的播报结果。
步骤203,接收多轮对话服务器发送的播报结果,将上述播报结果发送给上述终端进行语音播报。
上述实施例中,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收终端发送的语音之后,对上述语音进行识别,然后将语音识别结果发送给多轮对话服务器,以便多轮对话服务器根据上述语音识别结果确定用户输入的语音的意图,并根据上述意图生成播报结果,然后语音识别服务器接收多轮对话服务器发送的播报结果,并将上述播报结果发送给终端进行语音播报;从而可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,实现人机交互过程中不需要反复切换录音和播报两种状态,实现人机交互全双工的通信方式,进而可以使得多轮对话更连贯。
图3为本发明人机语音交互方法再一个实施例的流程图,如图3所示,该人机语音交互方法可以包括:
步骤301,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收使用上述终端的用户输入的语音。
具体地,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收使用上述终端的用户输入的语音可以为:在用户使用的终端播报语音识别服务器发送的播报结果的过程中,通过回声消除技术,消除播放的从文本到语音(Text to Speech;以下简称:TTS)语音的输入,仅接收上述用户输入的语音。
本实施例中,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,用户仍然可以向终端输入语音,也就是说,用户可以通过向终端输入语音打断终端的语音播报,也可以直接对终端播报的播报结果进行反馈,影响终端接下来的播报内容,从而可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,进而可以实现人机交互过程中不需要反复切换录音和播报两种状态。
步骤302,将上述用户输入的语音发送给上述语音识别服务器,以使语音识别服务器对上述语音进行识别,并将语音识别结果发送给多轮对话服务器,由多轮对话服务器将上述语音识别结果发送给QU服务器进行上下文理解,接收并保存上述QU服务器发送的上下文理解的结果,以及根据保存的上下文理解的结果确定上述用户输入的语音的意图,并根据上述意图生成播报结果。
具体地,将用户输入的语音发送给上述语音识别服务器可以为:将用户输入的预定长度的语音发送给上述语音识别服务器。其中,上述预定长度可以在具体实现时自行设定,本实施例对上述预定长度的大小不作限定。
具体地,将用户输入的语音发送给上述语音识别服务器也可以为:通过静音检测技术确定上述用户输入的语音中每句话的起始和结束,只将包含语音的录音发送给上述语音识别服务器。
由于用户有时输入语音过长,并且往往是对细节的描述,于是可以设置预定长度,当 用户输入的语音达到该预定长度了,就将用户输入的预定长度的语音发送给上述语音识别服务器;或者,有时用户在输入语音的过程中会有停顿,于是可以通过静音检测技术确定上述用户输入的语音中每句话的起始和结束,只将包含语音的录音发送给上述语音识别服务器,以使语音识别服务器对上述语音进行识别,并将语音识别结果发送给多轮对话服务器,由多轮对话服务器将上述语音识别结果发送给QU服务器进行上下文理解,接收并保存上述QU服务器发送的上下文理解的结果,以及根据保存的上下文理解的结果确定上述用户输入的语音的意图,并根据上述意图生成播报结果。然后多轮对话服务器将播报结果发送给语音识别服务器,语音识别服务器将播报结果发送给终端,这时终端就可以打断用户的语音输入,对上述播报结果进行语音播报。
步骤303,接收并播报语音识别服务器发送的播报结果。其中,上述语音识别服务器发送的播报结果是多轮对话服务器发送给上述语音识别服务器的。
上述实施例中,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收使用上述终端的用户输入的语音,然后将上述用户输入的语音发送给语音识别服务器,以使语音识别服务器对上述语音进行识别,并将语音识别结果发送给多轮对话服务器,由多轮对话服务器根据该语音识别结果确定用户输入的语音的意图,进而根据上述意图生成播报结果;然后,终端接收并播报语音识别服务器发送的播报结果;从而可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,实现人机交互过程中不需要反复切换录音和播报两种状态,实现人机交互全双工的通信方式,进而可以使得多轮对话更连贯。
本发明图1、图2和图3所示实施例提供的人机语音交互方法中,终端、语音识别服务器、多轮对话服务器、QU服务器和资源接入服务器之间的连接关系可以如图4所示,图4为本发明人机语音交互方法中的连接关系一个实施例的示意图。
参见图4,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,终端接收使用上述终端的用户输入的语音。本发明中,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,用户仍然可以向终端输入语音,也就是说,用户可以通过向终端输入语音打断终端的语音播报,也可以直接对终端播报的播报结果进行反馈,从而可以实现以下两种对话场景。
对话场景一:用户打断终端的语音播报
用户:点餐
终端:你需要些什么?
用户:宫保鸡丁,北京烤鸭。
终端:好的,准备为你下单,宫保鸡丁一份…
用户:宫保鸡丁不要了,换成辣子鸡丁。
终端:好的,准备为你下单,宫保鸡丁一份,北京烤鸭一份。
对话场景二:用户反馈终端的语音播报
人:这几天天气如何?
机器:略好,今天天气…
人:恩
机器(不停顿):明天天气…
人:恩,继续
机器(不停顿):后天天气…
人:好了
机器:播报完毕。
然后,终端将上述用户输入的语音发送给上述语音识别服务器,语音识别服务器对上述语音进行识别,并将语音识别结果发送给多轮对话服务器,由多轮对话服务器将上述语音识别结果发送给QU服务器进行上下文理解,接收并保存上述QU服务器发送的上下文理解的结果,以及根据保存的上下文理解的结果确定上述用户输入的语音的意图,并根据上述意图生成播报结果。
这里由于用户有时输入语音过长,并且往往是对细节的描述,于是可以设置预定长度,当用户输入的语音达到该预定长度了,就将用户输入的预定长度的语音发送给上述语音识别服务器;或者,有时用户在输入语音的过程中会有停顿,于是可以通过静音检测技术确定上述用户输入的语音中每句话的起始和结束,只将包含语音的录音发送给上述语音识别服务器,以使语音识别服务器对上述语音进行识别,并将语音识别结果发送给多轮对话服务器。或者,由于用户在向上述终端输入语音的时候,语音识别服务器也在不断地对终端发送来的语音进行识别,因此当语音识别服务器确定已获得的语音识别结果已达到预定的置信度时,语音识别服务器将达到上述预定的置信度的语音识别结果发送给多轮对话服务器。
然后,由多轮对话服务器将上述语音识别结果发送给QU服务器进行上下文理解,接收并保存上述QU服务器发送的上下文理解的结果,以及根据保存的上下文理解的结果确定上述用户输入的语音的意图,并根据上述意图生成播报结果。然后多轮对话服务器将播报结果发送给语音识别服务器,语音识别服务器将播报结果发送给终端,这时终端就可以打断用户的语音输入,对上述播报结果进行语音播报,从而可以实现如下对话场景。
对话场景三:终端打断用户的语音输入。
用户:去哪儿玩比较好呢,最近挺无聊的想…
终端(打断):我清楚你的需求了,工体今晚有邓紫棋的演唱会,目前门票有优惠,可以考虑
用户:好的,下单吧。
终端:已为你购买今晚9点邓紫棋演唱会门票,票价xxx元。
另外,多轮对话服务器有学习能力,可以根据用户的用户信息(例如:用户的日程安排和/或听过的歌曲等)和用户的当前状态(例如:当前位置和/或当前对话内容等),分析用户的想法和意愿,获得适合推荐给用户的内容,然后多轮对话服务器可以触发云推送服务,可以通过上述云推送服务将适合推荐给用户的内容发送给上述终端,并发起与上述终端的对话,从而可以实现以下的对话场景。
对话场景四:根据用户的日程安排向用户推荐出租车信息
终端:你订了今天下午4点的计票,目前时间是下午2点,是否为你订一辆出租车。
用户:不用了,我自己开车去。
终端:你的车今天限行。
用户:OK,那帮我叫辆专车吧。
终端:好的,请稍等(….),王师傅已接单,车牌号是xxxx,预计3分钟到达。
用户:感谢。
本发明中,当终端对播报结果进行语音播报的时候,用户仍然可以向终端输入语音,然后终端将语音发送给语音识别服务器进行识别,语音识别服务器将语音识别结果发送给多轮对话服务器,多轮对话服务器将语音识别结果发送给QU服务器进行上下文理解,然后接收并保存上述QU服务器发送的上下文理解的结果,并根据保存的上下文理解的结果确定上述用户输入的语音的意图,然后根据上述意图生成播报结果返回给终端进行语音播报,可以实现以下5种状态:
1、终端保持语音播报,这种状态下,用户输入的语音可能是“啊哈”或者“有意思”);
2、终端停止当前的播报,结束当前话题,这种状态下,用户输入的语音可能是“知道了”或者“够了”);
3、多轮对话服务器连接资源接入服务器开启新话题,这种状态下,用户输入的语音可能是“插播下北京天气”;
4、多轮对话服务器连接资源接入服务器深入话题,这种状态下,用户输入的语音可能是“北京天气”和“上海呢”;
5、回到之前话题,这种状态下,用户输入的语音可能是“之前的笑话讲完了”;也可以多轮对话服务器主动询问,终端接收到的播报结果可能是“天气播报完了,还需要把之前段子讲完吗”。
综上所述,本发明可以在不需要用户手工干预(按键等操作)的情况下,维持对话,保证聊天效果。
图5为本发明人机语音交互装置一个实施例的结构示意图,本实施例中的人机语音交互装置可以作为多轮对话服务器,或者多轮对话服务器的一部分实现本发明图1所示实施例的流程,如图5所示,该人机语音交互装置可以包括:接收模块51、发送模块52、保存模块53、确定模块54和生成模块55。
其中,接收模块51,用于在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收上述语音识别服务器发送的语音识别结果,上述语音识别结果是语音识别服务器对使用上述终端的用户输入的语音进行识别后发送的;以及在发送模块52将上述语音识别结果发送给QU服务器进行上下文理解之后,接收上述QU服务器发送的上下文理解的结果。
本实施例中,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,使用上述终端的用户仍然可以继续输入语音,也就是说,该终端在对播报结果进行语音播报的过程中,仍在继续接收用户输入的语音,并持续地将用户输入的语音发送给语音识别服务器进行语音识别,然后语音识别服务器持续地将语音识别结果发送给多轮对话服务器,于是接收模块51持续地接收语音识别服务器发送的语音识别结果。从而可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,进而可以实现人机交互过程中不需要反复切换录音和播报两种状态。
发送模块52,用于将接收模块51接收的语音识别结果发送给QU服务器进行上下文理解。
保存模块53,用于保存接收模块51接收的上下文理解的结果。
确定模块54,用于根据保存模块53保存的上下文理解的结果确定上述用户输入的语音的意图。
生成模块55,用于根据确定模块54确定的意图生成播报结果。
发送模块52,还用于将生成模块55生成的播报结果发送给语音识别服务器,以便语音识别服务器将上述播报结果发送给终端进行语音播报。
本实施例中,生成模块55,具体用于根据确定模块54确定的意图从资源接入服务器获取与上述意图对应的信息,根据获取的信息生成播报结果。
本实施例中,接收模块51,具体用于接收上述语音识别服务器在确定获得的语音识别结果达到预定的置信度之后,发送的达到上述预定的置信度的语音识别结果。其中,该预定的置信度可以在具体实现时自行设定,本实施例对上述预定的置信度的大小不作限定。
本实施例中,用户在向上述终端输入语音的时候,语音识别服务器也在不断地对终端 发送来的语音进行识别,当语音识别服务器确定已获得的语音识别结果已达到预定的置信度时,语音识别服务器将达到上述预定的置信度的语音识别结果发送给多轮对话服务器,以便确定模块54确定用户输入的语音的意图,进而由生成模块55生成有效的播报结果,发送模块52将该播报结果发送给上述终端进行语音播报,也就是说,如果终端接收到播报结果了,就可以打断用户的语音输入,直接向用户播报获得的播报结果。
本实施例中,进一步地,上述人机语音交互装置还可以包括:获取模块56,用于根据上述用户的用户信息和当前状态,获得适合推荐给上述用户的内容;发送模块52,还用于触发云推送服务,通过上述云推送服务将适合推荐给上述用户的内容发送给上述终端,并发起与上述终端的对话。
上述人机语音交互装置中,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收模块51可以接收语音识别服务器发送的语音识别结果,确定模块54根据上述语音识别结果确定用户输入的语音的意图,生成模块55根据确定的意图生成播报结果,然后发送模块52将播报结果发送给语音识别服务器,由语音识别服务器将上述播报结果发送给终端进行语音播报,从而可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,实现人机交互过程中不需要反复切换录音和播报两种状态,实现人机交互全双工的通信方式,进而可以使得多轮对话更连贯。
图6为本发明人机语音交互装置另一个实施例的结构示意图,本实施例中的人机语音交互装置可以作为语音识别服务器,或者语音识别服务器的一部分实现本发明图2所示实施例的流程,如图6所示,该人机语音交互装置可以包括:接收模块61、发送模块62和识别模块63;
其中,接收模块61,用于在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收终端发送的语音,上述语音是使用上述终端的用户输入给上述终端的;以及在发送模块62将语音识别结果发送给多轮对话服务器之后,接收多轮对话服务器发送的播报结果。
本实施例中,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收模块61还可以接收上述终端发送的语音,也就是说,在人机语音交互的过程中,语音播报和用户的语音输入同时进行,从而可以实现人机交互过程中不需要反复切换录音和播报两种状态。
识别模块63,用于对接收模块61接收的语音进行识别。其中,识别模块63,具体用于通过静音检测技术确定上述语音中每句话的起始和结束。本实施例中,运用静音检测技术,识别模块63能够实现对句子的切分,即识别模块63能够确定上述语音中每句话的起始和结束。
发送模块62,用于将识别模块63识别的语音识别结果发送给多轮对话服务器,以便多轮对话服务器将上述语音识别结果发送给QU服务器进行上下文理解,接收并保存上述QU服务器发送的上下文理解的结果,以及根据保存的上下文理解的结果确定用户输入的语音的意图,并根据上述意图生成播报结果;以及在接收模块61接收多轮对话服务器发送的播报结果之后,将上述播报结果发送给终端进行语音播报。
其中,发送模块62,具体用于在确定获得的语音识别结果达到预定的置信度之后,将达到上述预定的置信度的语音识别结果发送给多轮对话服务器。其中,该预定的置信度可以在具体实现时自行设定,本实施例对上述预定的置信度的大小不作限定。本实施例中,用户在向上述终端输入语音的时候,识别模块63也在不断地对终端发送来的语音进行识别,当确定已获得的语音识别结果已达到预定的置信度时,发送模块62将达到上述预定的置信度的语音识别结果发送给多轮对话服务器,以便多轮对话服务器按照本发明图1所示实施例步骤102~步骤104描述的方式,确定用户输入的语音的意图,进而生成有效的播报结果,发送给上述终端进行语音播报,也就是说,如果终端接收到了播报结果,就可以打断用户的语音输入,直接向用户播报获得的播报结果。
上述人机语音交互装置中,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收模块61接收终端发送的语音之后,识别模块63对上述语音进行识别,然后发送模块62将语音识别结果发送给多轮对话服务器,以便多轮对话服务器根据上述语音识别结果确定用户输入的语音的意图,并根据上述意图生成播报结果,然后接收模块61接收多轮对话服务器发送的播报结果,并由发送模块62将上述播报结果发送给终端进行语音播报;从而可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,实现人机交互过程中不需要反复切换录音和播报两种状态,实现人机交互全双工的通信方式,进而可以使得多轮对话更连贯。
图7为本发明人机语音交互装置再一个实施例的结构示意图,本实施例中的人机语音交互装置可以作为终端,或者终端的一部分实现本发明图3所示实施例的流程,如图7所示,该人机语音交互装置可以包括:接收模块71、发送模块72和播报模块73;
接收模块71,用于在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收使用上述终端的用户输入的语音;以及在发送模块72将上述语音发送给语音识别服务器之后,接收上述语音识别服务器发送的播报结果,上述语音识别服务器发送的播报结果是多轮对话服务器发送给上述语音识别服务器的;本实施例中,接收模块71,具体用于在上述终端播报语音识别服务器发送的播报结果的过程中,通过回声消除技术,消除播放的TTS语音的输入,仅接收上述用户输入的语音。
本实施例中,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,用户 仍然可以向终端输入语音,也就是说,用户可以通过向终端输入语音打断终端的语音播报,也可以直接对终端播报的播报结果进行反馈,影响终端接下来的播报内容,从而可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,进而可以实现人机交互过程中不需要反复切换录音和播报两种状态。
发送模块72,用于将接收模块71接收的语音发送给上述语音识别服务器,以使上述语音识别服务器对上述语音进行识别,并将语音识别结果发送给多轮对话服务器,由多轮对话服务器将上述语音识别结果发送给QU服务器进行上下文理解,接收并保存QU服务器发送的上下文理解的结果,以及根据保存的上下文理解的结果确定上述用户输入的语音的意图,并根据上述意图生成播报结果;
播报模块73,用于播报接收模块71接收的播报结果。
本实施例的一种实现方式中,发送模块72,具体用于将上述用户输入的预定长度的语音发送给上述语音识别服务器。其中,上述预定长度可以在具体实现时自行设定,本实施例对上述预定长度的大小不作限定。
本实施例的另一种实现方式中,发送模块72,具体用于通过静音检测技术确定上述用户输入的语音中每句话的起始和结束,只将包含语音的录音发送给语音识别服务器。
由于用户有时输入语音过长,并且往往是对细节的描述,于是可以设置预定长度,当用户输入的语音达到该预定长度了,发送模块72就将用户输入的预定长度的语音发送给上述语音识别服务器;或者,有时用户在输入语音的过程中会有停顿,于是可以通过静音检测技术确定上述用户输入的语音中每句话的起始和结束,只将包含语音的录音发送给上述语音识别服务器,以使语音识别服务器对上述语音进行识别,并将语音识别结果发送给多轮对话服务器,由多轮对话服务器将上述语音识别结果发送给QU服务器进行上下文理解,接收并保存上述QU服务器发送的上下文理解的结果,以及根据保存的上下文理解的结果确定上述用户输入的语音的意图,并根据上述意图生成播报结果。然后多轮对话服务器将播报结果发送给语音识别服务器,语音识别服务器将播报结果发送给终端,这时终端就可以打断用户的语音输入,对上述播报结果进行语音播报。
上述人机语音交互装置,在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收模块71接收使用上述终端的用户输入的语音,然后发送模块72将上述用户输入的语音发送给语音识别服务器,以使语音识别服务器对上述语音进行识别,并将语音识别结果发送给多轮对话服务器,由多轮对话服务器根据该语音识别结果确定用户输入的语音的意图,进而根据上述意图生成播报结果;然后,接收模块71接收并由播报模块73播报语音识别服务器发送的播报结果;从而可以实现在人机语音交互的过程中,语音播报和用户的语音输入同时进行,实现人机交互过程中不需要反复切换录音和播报两种状态,实现人机交互全双工的通信方式,进而可以使得多轮对话更连贯。
需要说明的是,在本发明的描述中,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性。此外,在本发明的描述中,除非另有说明,“多个”的含义是两个或两个以上。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本发明的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本发明的实施例所属技术领域的技术人员所理解。
应当理解,本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(Programmable Gate Array;以下简称:PGA),现场可编程门阵列(Field Programmable Gate Array;以下简称:FPGA)等。
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。
此外,本发明各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。
上述提到的存储介质可以是只读存储器,磁盘或光盘等。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。
尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (22)

  1. 一种人机语音交互方法,其特征在于,包括:
    在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收所述语音识别服务器发送的语音识别结果,所述语音识别结果是所述语音识别服务器对使用所述终端的用户输入的语音进行识别后发送的;
    将所述语音识别结果发送给关键词理解服务器进行上下文理解,接收并保存所述关键词理解服务器发送的上下文理解的结果;
    根据保存的上下文理解的结果确定所述用户输入的语音的意图,根据所述意图生成播报结果;
    将所述播报结果发送给所述语音识别服务器,以便所述语音识别服务器将所述播报结果发送给所述终端进行语音播报。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述意图生成播报结果包括:
    根据所述意图从资源接入服务器获取与所述意图对应的信息,根据获取的信息生成播报结果。
  3. 根据权利要求1所述的方法,其特征在于,所述接收所述语音识别服务器发送的语音识别结果包括:
    接收所述语音识别服务器在确定获得的语音识别结果达到预定的置信度之后,发送的达到所述预定的置信度的语音识别结果。
  4. 根据权利要求1-3任意一项所述的方法,其特征在于,还包括:
    根据所述用户的用户信息和当前状态,获得适合推荐给所述用户的内容,并触发云推送服务,通过所述云推送服务将所述适合推荐给所述用户的内容发送给所述终端,并发起与所述终端的对话。
  5. 一种人机语音交互方法,其特征在于,包括:
    在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收所述终端发送的语音,所述语音是使用所述终端的用户输入给所述终端的;
    对所述语音进行识别,将语音识别结果发送给多轮对话服务器,以便所述多轮对话服务器将所述语音识别结果发送给关键词理解服务器进行上下文理解,接收并保存所述关键词理解服务器发送的上下文理解的结果,以及根据保存的上下文理解的结果确定所述用户输入的语音的意图,并根据所述意图生成播报结果;
    接收所述多轮对话服务器发送的播报结果,将所述播报结果发送给所述终端进行语音播报。
  6. 根据权利要求5所述的方法,其特征在于,所述对所述语音进行识别包括:
    通过静音检测技术确定所述语音中每句话的起始和结束。
  7. 根据权利要求5或6所述的方法,其特征在于,所述将语音识别结果发送给多轮对话服务器包括:
    在确定获得的语音识别结果达到预定的置信度之后,将达到所述预定的置信度的语音识别结果发送给多轮对话服务器。
  8. 一种人机语音交互方法,其特征在于,包括:
    在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收使用所述终端的用户输入的语音;
    将所述用户输入的语音发送给所述语音识别服务器,以使所述语音识别服务器对所述语音进行识别,并将语音识别结果发送给多轮对话服务器,由所述多轮对话服务器将所述语音识别结果发送给关键词理解服务器进行上下文理解,接收并保存所述关键词理解服务器发送的上下文理解的结果,以及根据保存的上下文理解的结果确定所述用户输入的语音的意图,并根据所述意图生成播报结果;
    接收并播报所述语音识别服务器发送的播报结果,所述语音识别服务器发送的播报结果是所述多轮对话服务器发送给所述语音识别服务器的。
  9. 根据权利要求8所述的方法,其特征在于,所述在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收使用所述终端的用户输入的语音包括:
    在所述终端播报语音识别服务器发送的播报结果的过程中,通过回声消除技术,消除播放的从文本到语音TTS语音的输入,仅接收所述用户输入的语音。
  10. 根据权利要求8或9所述的方法,其特征在于,所述将所述用户输入的语音发送给所述语音识别服务器包括:
    将所述用户输入的预定长度的语音发送给所述语音识别服务器。
  11. 根据权利要求8或9所述的方法,其特征在于,所述将所述用户输入的语音发送给所述语音识别服务器包括:
    通过静音检测技术确定所述用户输入的语音中每句话的起始和结束,只将包含语音的录音发送给所述语音识别服务器。
  12. 一种人机语音交互装置,其特征在于,包括:
    接收模块,用于在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收所述语音识别服务器发送的语音识别结果,所述语音识别结果是所述语音识别服务器对使用所述终端的用户输入的语音进行识别后发送的;以及在发送模块将所述语音识别结果发送给关键词理解服务器进行上下文理解之后,接收所述关键词理解服务器发送的上下文 理解的结果;
    所述发送模块,用于将所述接收模块接收的语音识别结果发送给关键词理解服务器进行上下文理解;
    保存模块,用于保存所述接收模块接收的上下文理解的结果;
    确定模块,用于根据所述保存模块保存的上下文理解的结果确定所述用户输入的语音的意图;
    生成模块,用于根据所述确定模块确定的意图生成播报结果;
    所述发送模块,还用于将所述生成模块生成的播报结果发送给所述语音识别服务器,以便所述语音识别服务器将所述播报结果发送给所述终端进行语音播报。
  13. 根据权利要求12所述的装置,其特征在于,
    所述生成模块,具体用于根据所述确定模块确定的意图从资源接入服务器获取与所述意图对应的信息,根据获取的信息生成播报结果。
  14. 根据权利要求12所述的装置,其特征在于,
    所述接收模块,具体用于接收所述语音识别服务器在确定获得的语音识别结果达到预定的置信度之后,发送的达到所述预定的置信度的语音识别结果。
  15. 根据权利要求12-14任意一项所述的装置,其特征在于,还包括:
    获取模块,用于根据所述用户的用户信息和当前状态,获得适合推荐给所述用户的内容;
    所述发送模块,还用于触发云推送服务,通过所述云推送服务将所述适合推荐给所述用户的内容发送给所述终端,并发起与所述终端的对话。
  16. 一种人机语音交互装置,其特征在于,包括:
    接收模块,用于在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收所述终端发送的语音,所述语音是使用所述终端的用户输入给所述终端的;以及在发送模块将语音识别结果发送给多轮对话服务器之后,接收所述多轮对话服务器发送的播报结果;
    识别模块,用于对所述接收模块接收的语音进行识别;
    所述发送模块,用于将所述识别模块识别的语音识别结果发送给多轮对话服务器,以便所述多轮对话服务器将所述语音识别结果发送给关键词理解服务器进行上下文理解,接收并保存所述关键词理解服务器发送的上下文理解的结果,以及根据保存的上下文理解的结果确定所述用户输入的语音的意图,并根据所述意图生成播报结果;以及在所述接收模块接收所述多轮对话服务器发送的播报结果之后,将所述播报结果发送给所述终端进行语音播报。
  17. 根据权利要求16所述的装置,其特征在于,
    所述识别模块,具体用于通过静音检测技术确定所述语音中每句话的起始和结束。
  18. 根据权利要求16或17所述的装置,其特征在于,
    所述发送模块,具体用于在确定获得的语音识别结果达到预定的置信度之后,将达到所述预定的置信度的语音识别结果发送给多轮对话服务器。
  19. 一种人机语音交互装置,其特征在于,包括:
    接收模块,用于在终端对语音识别服务器发送的播报结果进行语音播报的过程中,接收使用所述终端的用户输入的语音;以及在发送模块将所述语音发送给所述语音识别服务器之后,接收所述语音识别服务器发送的播报结果,所述语音识别服务器发送的播报结果是所述多轮对话服务器发送给所述语音识别服务器的;
    所述发送模块,用于将所述接收模块接收的语音发送给所述语音识别服务器,以使所述语音识别服务器对所述语音进行识别,并将语音识别结果发送给多轮对话服务器,由所述多轮对话服务器将所述语音识别结果发送给关键词理解服务器进行上下文理解,接收并保存所述关键词理解服务器发送的上下文理解的结果,以及根据保存的上下文理解的结果确定所述用户输入的语音的意图,并根据所述意图生成播报结果;
    播报模块,用于播报所述接收模块接收的播报结果。
  20. 根据权利要求19所述的装置,其特征在于,
    所述接收模块,具体用于在所述终端播报语音识别服务器发送的播报结果的过程中,通过回声消除技术,消除播放的从文本到语音TTS语音的输入,仅接收所述用户输入的语音。
  21. 根据权利要求19或20所述的装置,其特征在于,
    所述发送模块,具体用于将所述用户输入的预定长度的语音发送给所述语音识别服务器。
  22. 根据权利要求19或20所述的装置,其特征在于,
    所述发送模块,具体用于通过静音检测技术确定所述用户输入的语音中每句话的起始和结束,只将包含语音的录音发送给所述语音识别服务器。
PCT/CN2015/083207 2015-02-13 2015-07-02 人机语音交互方法和装置 WO2016127550A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510080163.X 2015-02-13
CN201510080163.XA CN104679472A (zh) 2015-02-13 2015-02-13 人机语音交互方法和装置

Publications (1)

Publication Number Publication Date
WO2016127550A1 true WO2016127550A1 (zh) 2016-08-18

Family

ID=53314597

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/083207 WO2016127550A1 (zh) 2015-02-13 2015-07-02 人机语音交互方法和装置

Country Status (2)

Country Link
CN (1) CN104679472A (zh)
WO (1) WO2016127550A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107799116A (zh) * 2016-08-31 2018-03-13 科大讯飞股份有限公司 多轮交互并行语义理解方法和装置
CN108492822A (zh) * 2018-02-23 2018-09-04 济南汇通远德科技有限公司 一种基于工业应用的语音识别方法
CN108600511A (zh) * 2018-03-22 2018-09-28 上海摩软通讯技术有限公司 智能语音助手设备的控制系统及方法
CN108831434A (zh) * 2018-05-29 2018-11-16 尹绍华 语音交互系统及方法
CN111916082A (zh) * 2020-08-14 2020-11-10 腾讯科技(深圳)有限公司 语音交互方法、装置、计算机设备和存储介质
CN112735423A (zh) * 2020-12-14 2021-04-30 美的集团股份有限公司 语音交互方法、装置、电子设备及存储介质
CN113257242A (zh) * 2021-04-06 2021-08-13 杭州远传新业科技有限公司 自助语音服务中的语音播报中止方法、装置、设备及介质
CN113569021A (zh) * 2021-06-29 2021-10-29 杭州摸象大数据科技有限公司 用户分类的方法、计算机设备和可读存储介质
US11605384B1 (en) 2021-07-30 2023-03-14 Nvidia Corporation Duplex communications for conversational AI by dynamically responsive interrupting content

Families Citing this family (67)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US20120309363A1 (en) 2011-06-03 2012-12-06 Apple Inc. Triggering notifications associated with tasks items that represent tasks to perform
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
EP3809407A1 (en) 2013-02-07 2021-04-21 Apple Inc. Voice trigger for a digital assistant
US10652394B2 (en) 2013-03-14 2020-05-12 Apple Inc. System and method for processing voicemail
US10748529B1 (en) 2013-03-15 2020-08-18 Apple Inc. Voice activated device for use with a voice-based digital assistant
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US9966065B2 (en) 2014-05-30 2018-05-08 Apple Inc. Multi-command single utterance input method
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
CN104679472A (zh) * 2015-02-13 2015-06-03 百度在线网络技术(北京)有限公司 人机语音交互方法和装置
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US10200824B2 (en) 2015-05-27 2019-02-05 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on a touch-sensitive device
US20160378747A1 (en) 2015-06-29 2016-12-29 Apple Inc. Virtual assistant for media playback
CN105070290A (zh) * 2015-07-08 2015-11-18 苏州思必驰信息科技有限公司 人机语音交互方法及系统
CN105161097A (zh) * 2015-07-23 2015-12-16 百度在线网络技术(北京)有限公司 语音交互方法及装置
US10331312B2 (en) 2015-09-08 2019-06-25 Apple Inc. Intelligent automated assistant in a media environment
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10740384B2 (en) 2015-09-08 2020-08-11 Apple Inc. Intelligent automated assistant for media search and playback
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10956666B2 (en) 2015-11-09 2021-03-23 Apple Inc. Unconventional virtual assistant interactions
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
CN106095833B (zh) * 2016-06-01 2019-04-16 竹间智能科技(上海)有限公司 人机对话内容处理方法
US10586535B2 (en) 2016-06-10 2020-03-10 Apple Inc. Intelligent digital assistant in a multi-tasking environment
DK201670540A1 (en) 2016-06-11 2018-01-08 Apple Inc Application integration with a digital assistant
DK179415B1 (en) 2016-06-11 2018-06-14 Apple Inc Intelligent device arbitration and control
DK180048B1 (en) 2017-05-11 2020-02-04 Apple Inc. MAINTAINING THE DATA PROTECTION OF PERSONAL INFORMATION
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
DK179745B1 (en) 2017-05-12 2019-05-01 Apple Inc. SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
DK201770429A1 (en) 2017-05-12 2018-12-14 Apple Inc. LOW-LATENCY INTELLIGENT AUTOMATED ASSISTANT
US10303715B2 (en) 2017-05-16 2019-05-28 Apple Inc. Intelligent automated assistant for media exploration
US20180336892A1 (en) 2017-05-16 2018-11-22 Apple Inc. Detecting a trigger of a digital assistant
CN107943834B (zh) * 2017-10-25 2021-06-11 百度在线网络技术(北京)有限公司 人机对话的实现方法、装置、设备及存储介质
CN109725798B (zh) * 2017-10-25 2021-07-27 腾讯科技(北京)有限公司 智能角色的切换方法及相关装置
CN107832439B (zh) * 2017-11-16 2019-03-08 百度在线网络技术(北京)有限公司 多轮状态追踪的方法、系统及终端设备
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
DK180639B1 (en) 2018-06-01 2021-11-04 Apple Inc DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT
CN110097883B (zh) * 2018-06-01 2020-04-28 苹果公司 用于在主设备处访问配套设备的呼叫功能的语音交互
CN109145853A (zh) * 2018-08-31 2019-01-04 百度在线网络技术(北京)有限公司 用于识别噪音的方法和装置
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
CN111292732B (zh) * 2018-12-06 2023-07-21 深圳市广和通无线股份有限公司 音频信息处理方法、装置、计算机设备和存储介质
CN110782625A (zh) * 2018-12-17 2020-02-11 北京嘀嘀无限科技发展有限公司 乘车安全报警方法、装置、电子设备及存储介质
CN109657091B (zh) * 2019-01-02 2021-06-22 百度在线网络技术(北京)有限公司 语音交互设备的状态呈现方法、装置、设备及存储介质
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
DK201970509A1 (en) 2019-05-06 2021-01-15 Apple Inc Spoken notifications
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
DK180129B1 (en) 2019-05-31 2020-06-02 Apple Inc. USER ACTIVITY SHORTCUT SUGGESTIONS
DK201970511A1 (en) 2019-05-31 2021-02-15 Apple Inc Voice identification in digital assistant systems
US11468890B2 (en) 2019-06-01 2022-10-11 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
CN110364152B (zh) * 2019-07-25 2022-04-01 深圳智慧林网络科技有限公司 语音交互方法、设备及计算机可读存储介质
CN110557451B (zh) * 2019-08-30 2021-02-05 北京百度网讯科技有限公司 对话交互处理方法、装置、电子设备和存储介质
CN112732340B (zh) * 2019-10-14 2022-03-15 思必驰科技股份有限公司 人机对话处理方法及装置
CN112700767B (zh) * 2019-10-21 2022-08-26 思必驰科技股份有限公司 人机对话打断方法及装置
US11038934B1 (en) 2020-05-11 2021-06-15 Apple Inc. Digital assistant hardware abstraction
US11061543B1 (en) 2020-05-11 2021-07-13 Apple Inc. Providing relevant data items based on context
US11490204B2 (en) 2020-07-20 2022-11-01 Apple Inc. Multi-device audio adjustment coordination
US11438683B2 (en) 2020-07-21 2022-09-06 Apple Inc. User identification using headphones
CN112820290A (zh) * 2020-12-31 2021-05-18 广东美的制冷设备有限公司 家电设备及其语音控制方法、语音装置、计算机存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103366729A (zh) * 2012-03-26 2013-10-23 富士通株式会社 语音对话系统、终端装置和数据中心装置
CN103413549A (zh) * 2013-07-31 2013-11-27 深圳创维-Rgb电子有限公司 语音交互的方法、系统以及交互终端
CN103971681A (zh) * 2014-04-24 2014-08-06 百度在线网络技术(北京)有限公司 一种语音识别方法及系统
CN104679472A (zh) * 2015-02-13 2015-06-03 百度在线网络技术(北京)有限公司 人机语音交互方法和装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8301436B2 (en) * 2003-05-29 2012-10-30 Microsoft Corporation Semantic object synchronous understanding for highly interactive interface
CN101178705A (zh) * 2007-12-13 2008-05-14 中国电信股份有限公司 一种自然语言理解方法和人机交互智能系统
CN101281745B (zh) * 2008-05-23 2011-08-10 深圳市北科瑞声科技有限公司 一种车载语音交互系统
CN203055434U (zh) * 2012-07-30 2013-07-10 刘强 基于云技术的家庭语音交互终端
US9576574B2 (en) * 2012-09-10 2017-02-21 Apple Inc. Context-sensitive handling of interruptions by intelligent digital assistant
CN104282306A (zh) * 2014-09-22 2015-01-14 奇瑞汽车股份有限公司 一种车载语音识别交互方法和终端、服务器

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103366729A (zh) * 2012-03-26 2013-10-23 富士通株式会社 语音对话系统、终端装置和数据中心装置
CN103413549A (zh) * 2013-07-31 2013-11-27 深圳创维-Rgb电子有限公司 语音交互的方法、系统以及交互终端
CN103971681A (zh) * 2014-04-24 2014-08-06 百度在线网络技术(北京)有限公司 一种语音识别方法及系统
CN104679472A (zh) * 2015-02-13 2015-06-03 百度在线网络技术(北京)有限公司 人机语音交互方法和装置

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107799116A (zh) * 2016-08-31 2018-03-13 科大讯飞股份有限公司 多轮交互并行语义理解方法和装置
CN108492822A (zh) * 2018-02-23 2018-09-04 济南汇通远德科技有限公司 一种基于工业应用的语音识别方法
CN108600511A (zh) * 2018-03-22 2018-09-28 上海摩软通讯技术有限公司 智能语音助手设备的控制系统及方法
CN108831434A (zh) * 2018-05-29 2018-11-16 尹绍华 语音交互系统及方法
CN111916082A (zh) * 2020-08-14 2020-11-10 腾讯科技(深圳)有限公司 语音交互方法、装置、计算机设备和存储介质
CN112735423A (zh) * 2020-12-14 2021-04-30 美的集团股份有限公司 语音交互方法、装置、电子设备及存储介质
CN112735423B (zh) * 2020-12-14 2024-04-05 美的集团股份有限公司 语音交互方法、装置、电子设备及存储介质
CN113257242A (zh) * 2021-04-06 2021-08-13 杭州远传新业科技有限公司 自助语音服务中的语音播报中止方法、装置、设备及介质
CN113569021A (zh) * 2021-06-29 2021-10-29 杭州摸象大数据科技有限公司 用户分类的方法、计算机设备和可读存储介质
CN113569021B (zh) * 2021-06-29 2023-08-04 杭州摸象大数据科技有限公司 用户分类的方法、计算机设备和可读存储介质
US11605384B1 (en) 2021-07-30 2023-03-14 Nvidia Corporation Duplex communications for conversational AI by dynamically responsive interrupting content

Also Published As

Publication number Publication date
CN104679472A (zh) 2015-06-03

Similar Documents

Publication Publication Date Title
WO2016127550A1 (zh) 人机语音交互方法和装置
US10964325B2 (en) Asynchronous virtual assistant
CN109994108B (zh) 用于聊天机器人和人之间的会话交谈的全双工通信技术
US10891952B2 (en) Speech recognition
CN108595645B (zh) 会议发言管理方法以及装置
CN109087637B (zh) 语音代理转发
US11949818B1 (en) Selecting user device during communications session
US20190196779A1 (en) Intelligent personal assistant interface system
JP6688227B2 (ja) 通話中翻訳
JP7222965B2 (ja) コンピュータによって実現される会議予約方法、装置、機器及び媒体
TWI535258B (zh) 語音接聽方法與行動終端裝置
TWI489372B (zh) 語音操控方法與行動終端裝置
CN114207710A (zh) 检测和/或登记热命令以由自动助理触发响应动作
US20210241775A1 (en) Hybrid speech interface device
CN109147779A (zh) 语音数据处理方法和装置
IE86422B1 (en) Method for voice activation of a software agent from standby mode
US10140988B2 (en) Speech recognition
CN113260974B (zh) 通信数据处理方法和系统
CN108962262A (zh) 语音数据处理方法和装置
US20170256259A1 (en) Speech Recognition
US20170256257A1 (en) Conversational Software Agent
CN112313930B (zh) 管理保持的方法和装置
CN107483736A (zh) 一种即时通信应用程序的消息处理方法及装置
KR20200024511A (ko) 대화 에이전트의 동작 방법 및 그 장치
JP7341323B2 (ja) 全二重による音声対話の方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15881711

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15881711

Country of ref document: EP

Kind code of ref document: A1