WO2020248524A1 - 人机对话方法及电子设备 - Google Patents

人机对话方法及电子设备 Download PDF

Info

Publication number
WO2020248524A1
WO2020248524A1 PCT/CN2019/120607 CN2019120607W WO2020248524A1 WO 2020248524 A1 WO2020248524 A1 WO 2020248524A1 CN 2019120607 W CN2019120607 W CN 2019120607W WO 2020248524 A1 WO2020248524 A1 WO 2020248524A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
answer instruction
client
sentence
time
Prior art date
Application number
PCT/CN2019/120607
Other languages
English (en)
French (fr)
Inventor
宋洪博
朱成亚
石韡斯
樊帅
Original Assignee
苏州思必驰信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州思必驰信息科技有限公司 filed Critical 苏州思必驰信息科技有限公司
Priority to EP19932635.6A priority Critical patent/EP3985661B1/en
Priority to JP2021572940A priority patent/JP7108799B2/ja
Priority to US17/616,969 priority patent/US11551693B2/en
Publication of WO2020248524A1 publication Critical patent/WO2020248524A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • the invention relates to the field of intelligent voice dialogue, in particular to a man-machine dialogue method and electronic equipment.
  • Unreasonable sentence breaks On the one hand, the user's speaking rhythm will be different in different scenarios. Just relying on acoustic features to break sentences will cause the situation that the user has not finished speaking and responded to the corresponding question in advance, and it will also cause the user to have said clearly. , But it takes a long time to wait. On the other hand, the upload audio is not continuous, and the server cannot accurately determine the actual interval between two sentences, and cannot determine whether the long interval between two sentences is caused by network delay, which leads to the situation that the response content cannot be reasonably decided .
  • an embodiment of the present invention provides a human-machine dialogue method applied to a server, including:
  • the start time point and end time point of the first audio it is determined whether the first audio is a short sentence.
  • the first upload from the client is received Second audio, using an audio decoder to generate a second recognition result of the second audio;
  • the answer instruction corresponding to the combined sentence is generated, and the answer instruction together with the feedback time mark of the answer instruction are sent to the client to complete the man-machine dialogue through the client, wherein the feedback time
  • the mark includes: the start time point and the end time point of the sentence corresponding to the answer instruction.
  • an embodiment of the present invention provides a human-machine dialogue method, which is applied to a client, and includes:
  • the answer instruction When the answer instruction times out, the answer instruction is discarded, and when the answer instruction does not time out, the answer instruction is fed back to the user to complete the man-machine dialogue.
  • an embodiment of the present invention provides a human-machine dialogue method, which is applied to a voice dialogue platform, the voice dialogue platform includes a server and a client, and is characterized in that the method includes:
  • the client continuously uploads the first audio and the second audio input by the user to the server, and uses the start time point and the end time point of the audio as input time marks;
  • the server receives the first audio uploaded by the user through the client, marks the start time point and the end time point of the first audio, and generates a first recognition result of the first audio by using an audio decoder;
  • the server side determines whether the first audio is a short sentence according to the start time point and the end time point of the first audio. When it is a short sentence, if within the preset heartbeat protection time range, the server receives the client Using an audio decoder to generate a second recognition result of the second audio for the second audio that is continuously uploaded;
  • the server sends at least a combination of the first recognition result and the second recognition result to the language prediction model to determine whether the combined sentence is a sentence
  • the server When it is a sentence, the server generates an answer instruction corresponding to the combined sentence, and sends the answer instruction together with a feedback time stamp of the answer instruction to the client, where the feedback time stamp includes: the answer instruction The start time and end time of the corresponding sentence;
  • the client receives the answer instruction sent by the server and the feedback time mark corresponding to the answer instruction, and determines the audio input by the user corresponding to the answer instruction by matching the input time mark with the feedback time mark;
  • the client determines whether the answer instruction is timed out according to the time offset between the input time mark of the audio input by the user and the current time of the client,
  • the answer instruction When the answer instruction times out, the answer instruction is discarded, and when the answer instruction does not time out, the answer instruction is fed back to the user to complete the man-machine dialogue.
  • an embodiment of the present invention provides a human-machine dialogue system applied to a server, including:
  • the recognition and decoding program module is used to receive the first audio uploaded by the user through the client, mark the start time point and the end time point of the first audio, and generate a first recognition result of the first audio by using an audio decoder;
  • the short sentence determination program module is used to determine whether the first audio is a short sentence according to the start time point and the end time point of the first audio. When it is a short sentence, if it is within the preset heartbeat protection time range , Receiving the second audio uploaded by the client, and generating a second recognition result of the second audio by using an audio decoder;
  • the sentence judgment program module is used to send at least the combination of the first recognition result and the second recognition result to the language prediction model to judge whether the combined sentence is a sentence
  • the answer instruction corresponding to the combined sentence is generated, and the answer instruction together with the feedback time mark of the answer instruction are sent to the client to complete the man-machine dialogue through the client, wherein the feedback time
  • the mark includes: the start time point and the end time point of the sentence corresponding to the answer instruction.
  • an embodiment of the present invention provides a human-machine dialogue system applied to a client, including:
  • An audio upload program module configured to continuously upload the first audio and the second audio input by the user to the server, and use the start time point and end time point of the audio as input time marks;
  • the audio matching program module is used to sequentially receive the answer instruction sent by the server and the feedback time mark corresponding to the answer instruction, and determine the user input corresponding to the answer instruction by matching the input time mark with the feedback time mark Audio
  • the man-machine dialogue program module is used to determine whether the answer instruction is overtime according to the time offset between the input time mark of the audio input by the user and the current time of the client terminal,
  • the answer instruction When the answer instruction times out, the answer instruction is discarded, and when the answer instruction does not time out, the answer instruction is fed back to the user to complete the man-machine dialogue.
  • an embodiment of the present invention provides a human-machine dialogue system, which is applied to a voice dialogue platform.
  • the voice dialogue platform includes a server and a client, and is characterized in that the method includes:
  • An audio upload program module for the client to continuously upload the first audio and the second audio input by the user to the server, using the start time point and end time point of the audio as input time marks;
  • the recognition and decoding program module is used for the server to receive the first audio uploaded by the user through the client, mark the start time point and the end time point of the first audio, and generate a first recognition result of the first audio by using an audio decoder;
  • a short sentence determination program module for the server to determine whether the first audio is a short sentence according to the start time point and the end time point of the first audio. When it is a short sentence, if it is within the preset heartbeat protection time Within the range, the server receives the second audio continuously uploaded by the client, and uses an audio decoder to generate a second recognition result of the second audio;
  • the sentence judgment program module is used for the server side to send at least the combination of the first recognition result and the second recognition result to the language prediction model to determine whether the combined sentence is a sentence
  • the server When it is a sentence, the server generates an answer instruction corresponding to the combined sentence, and sends the answer instruction together with a feedback time stamp of the answer instruction to the client, where the feedback time stamp includes: the answer instruction The start time and end time of the corresponding sentence;
  • the audio matching program module is used for the client to receive the answer instruction sent by the server and the feedback time mark corresponding to the answer instruction, and to determine the answer corresponding to the answer instruction by matching the input time mark with the feedback time mark User input audio;
  • the human-machine dialogue program module is used for the client to determine whether the answer instruction is overtime according to the time offset between the input time mark of the audio input by the user and the current time of the client,
  • the answer instruction When the answer instruction times out, the answer instruction is discarded, and when the answer instruction does not time out, the answer instruction is fed back to the user to complete the man-machine dialogue.
  • an electronic device which includes: at least one processor, and a memory communicatively connected with the at least one processor, wherein the memory stores instructions that can be executed by the at least one processor, The instructions are executed by the at least one processor, so that the at least one processor can execute the steps of the human-machine dialogue method of any embodiment of the present invention.
  • an embodiment of the present invention provides a storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the steps of the human-machine dialogue method of any embodiment of the present invention are implemented.
  • the beneficial effect of the embodiment of the present invention is that the heartbeat event is used to process the time interval between two sentences while ensuring that the sentence spoken by the user first is a short sentence. After ensuring that the two sentences can be combined into a complete sentence, the whole sentence Unreasonable sentence segmentation in the context of industrial dialogue.
  • the start time point and end time point of recording audio match the audio input by the user with the answer instruction returned by the server, which ensures the accuracy of the answer to the user.
  • the user and the user can be processed by setting different time offsets. Different situations in the interaction of intelligent voice devices solve the problem of redundancy in replies in full-duplex conversations.
  • FIG. 1 is a flowchart of a human-machine dialogue method applied to a server according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a human-machine conversation method applied to a client according to an embodiment of the present invention
  • FIG. 3 is a flowchart of a human-machine dialogue method applied to a voice dialogue platform according to an embodiment of the present invention
  • FIG. 4 is a schematic structural diagram of a human-machine dialogue system applied to a server according to an embodiment of the present invention
  • FIG. 5 is a schematic structural diagram of a human-machine dialogue system applied to a client according to an embodiment of the present invention
  • Fig. 6 is a schematic structural diagram of a human-machine dialogue system applied to a voice dialogue platform according to an embodiment of the present invention.
  • Figure 1 is a flowchart of a human-machine dialogue method provided by an embodiment of the present invention, which is applied to a server and includes the following steps:
  • S11 Receive the first audio uploaded by the user through the client, mark the start time point and the end time point of the first audio, and generate a first recognition result of the first audio by using an audio decoder;
  • S12 Determine whether the first audio is a short sentence according to the start time point and the end time point of the first audio. When it is a short sentence, if it is within the preset heartbeat protection time range, the client upload is received Using an audio decoder to generate a second recognition result of the second audio;
  • S13 Send at least a combination of the first recognition result and the second recognition result to a language prediction model, and determine whether the combined sentence is a sentence
  • the answer instruction corresponding to the combined sentence is generated, and the answer instruction together with the feedback time mark of the answer instruction are sent to the client to complete the man-machine dialogue through the client, wherein the feedback time
  • the mark includes: the start time point and the end time point of the sentence corresponding to the answer instruction.
  • the existing full-duplex dialogue between the user and the smart device has the following scenarios:
  • the user said "I want to hear” is an incomplete sentence, but the smart voice device responds to "I want to hear", adding a round of meaningless dialogue.
  • This method is to prevent the intelligent voice device from making meaningless dialogue responses to incomplete sentences such as "I want to hear”.
  • step S11 the same, when the user says: I want to listen to (short pause) Jay Chou’s song, because there is a short pause after "I want to listen”, it is determined as the first audio and "Jay Chou’s song” is determined as the second Audio.
  • the server receives the first audio "I want to listen” uploaded by the user through the smart voice device client, marks the start time and end time of the first audio, and generates a first recognition result of the first audio through an audio decoder.
  • step S12 it is determined whether the first audio is a short sentence according to the start time point and the end time point of the first audio. For example, since the recording length is proportional to the time, it can be based on the size of the received audio. Calculate the relative time of the audio. Then, the audio with a short talk time is determined as a short sentence. For example, "I want to hear" is a short sentence. When it is determined that the first audio is a short sentence, if the second audio uploaded by the client is received within the preset heartbeat protection time range, it further shows that the "first audio" is not finished. Among them, the heartbeat protection time is often used in heartbeat detection in network programs. When there is no data interaction between the client and the server for a while, the heartbeat is needed to detect whether the other party is alive. The heartbeat detection can be initiated by the client or the server.
  • step S13 at least the combination of the first recognition result "I want to hear” and the second recognition result "Jay Chou's song” is sent to the language model to determine the combined sentence Is it a complete sentence?
  • the heartbeat event is used to process the time interval between two sentences while ensuring that the sentence spoken by the user first is a short sentence.
  • Full duplex is solved after ensuring that the two sentences can be combined into a complete sentence. Unreasonable sentence segmentation in dialogue scenarios.
  • the method further includes:
  • the present application further provides a server, which includes: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores the memory that can be used by the at least one processor.
  • Executed instructions the instructions are executed by the at least one processor, so that the at least one processor can execute the following steps:
  • the start time point and end time point of the first audio it is determined whether the first audio is a short sentence.
  • the first upload from the client is received Second audio, using an audio decoder to generate a second recognition result of the second audio;
  • the answer instruction corresponding to the combined sentence is generated, and the answer instruction together with the feedback time mark of the answer instruction are sent to the client to complete the man-machine dialogue through the client, wherein the feedback time
  • the mark includes: the start time point and the end time point of the sentence corresponding to the answer instruction.
  • At least one processor of the server provided in this application is further configured to:
  • Figure 2 is a flowchart of a human-machine dialogue method provided by an embodiment of the present invention, which is applied to a client and includes the following steps:
  • S21 Continuously upload the first audio and the second audio input by the user to the server, and use the start time point and the end time point of the audio as input time marks;
  • S22 Receive an answer instruction sent by the server and a feedback time mark corresponding to the answer instruction in sequence, and determine the audio input by the user corresponding to the answer instruction by matching the input time mark with the feedback time mark;
  • the answer instruction When the answer instruction times out, the answer instruction is discarded, and when the answer instruction does not time out, the answer instruction is fed back to the user to complete the man-machine dialogue.
  • the existing full-duplex dialogue between the user and the smart device will have the following scenarios:
  • step S21 similarly, when the user says: I want to listen to a song, Jay Chou’s Daoxiang, successively transmit it to the server in turn, and at the same time, record the start time point and end time point of the audio locally as input time stamps;
  • step S22 since the user's "I want to listen to a song" and "Jay Chou's Daoxiang" are all complete sentences, two response instructions and feedback time stamps will be received from the server. In this embodiment, since the input is two whole sentences, there will be two instructions when receiving. If the sentence in embodiment 1 is used in this method, there will only be one instruction when receiving. Because it is a full-duplex dialogue, the client needs to know which input sentence corresponds to the response command returned by the server, and therefore matches and corresponds through the previous time stamp.
  • step S23 according to the input time mark of the audio input by the user, the time offset generated from the current time of the client terminal, where the offset generated by the current time of the client terminal can be adjusted according to specific conditions, for example, in a full-duplex dialogue . There are two situations:
  • the user's continuous second sentence input already implies the content of the first reply sentence of the smart voice device, making the first reply sentence meaningless, that is, the second sentence
  • the words are entered, and when the first sentence has not been answered, there is no need to reply to the first sentence.
  • the time offset is set to be related to the input time of the second sentence.
  • the two consecutive sentences entered by the user are irrelevant, such as "what time is it” and "order me a meal”.
  • the smart voice device replies in turn, and the content of the first reply and the second reply have no effect .
  • the server is more complicated to process and takes a long time, or due to network fluctuations, the time that the server has processed the response command to the client has been delayed for a long time (for example, 2 minutes, In a full-duplex dialogue, this delayed response will greatly affect the user experience), and these long-delayed response commands have become meaningless.
  • the time offset is set to be related to the preset response waiting time (such It's relatively common, and the specific implementation will not be repeated).
  • the offset generated by the current time of the client is set as the first case, it is determined according to the time offset that the answer instruction of the first sentence has timed out, and the answer instruction of the first sentence is discarded.
  • replying avoid redundant replies.
  • start time and end time of recording audio match the audio input by the user with the answer instruction returned by the server, which ensures the accuracy of the answer to the user.
  • time offset is used to deal with different situations in the interaction between the user and the intelligent voice device, and solves the problem of redundancy in the reply in the full-duplex dialogue.
  • the present application further provides a client terminal, which includes: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores the memory that can be processed by the at least one processor. Instructions executed by the processor, the instructions are executed by the at least one processor, so that the at least one processor can execute the following steps:
  • the answer instruction When the answer instruction times out, the answer instruction is discarded, and when the answer instruction does not time out, the answer instruction is fed back to the user to complete the man-machine dialogue.
  • Fig. 3 is a flowchart of a human-machine dialogue method provided by an embodiment of the present invention, which is applied to a voice dialogue platform and includes the following steps:
  • S31 The client continuously uploads the first audio and the second audio input by the user to the server, and uses the start time point and the end time point of the audio as input time marks;
  • the server receives the first audio uploaded by the user through the client, marks the start time point and the end time point of the first audio, and generates a first recognition result of the first audio by using an audio decoder;
  • the server determines whether the first audio is a short sentence according to the start time point and the end time point of the first audio. When it is a short sentence, if it is within the preset heartbeat protection time range, the server receives Use an audio decoder to generate a second recognition result of the second audio for the second audio continuously uploaded to the client;
  • the server sends at least a combination of the first recognition result and the second recognition result to the language prediction model to determine whether the combined sentence is a sentence, and when it is a sentence, the server generates the combined sentence Corresponding answer instruction, sending the answer instruction together with the feedback time mark of the answer instruction to the client, where the feedback time mark includes: the start time point and the end time point of the sentence corresponding to the answer instruction;
  • the client receives the answer instruction sent by the server and the feedback time mark corresponding to the answer instruction, and determines the audio input by the user corresponding to the answer instruction by matching the input time mark with the feedback time mark;
  • S36 The client determines whether the answer instruction is timed out according to the time offset between the input time stamp of the audio input by the user and the current time of the client, and discards the answer instruction when the answer instruction times out, When the answer instruction does not time out, the answer instruction is fed back to the user to complete the man-machine dialogue.
  • the method further includes:
  • the server When it is not a sentence, the server generates a first answer instruction corresponding to the first recognition result and a second answer instruction corresponding to the second recognition result, and combines the first answer instruction and the second answer instruction Send the corresponding feedback time stamp to the client;
  • the client receives the first and second response instructions sent by the server and the feedback time stamp corresponding to the response instruction, and determines the user corresponding to the response instruction by matching the input time stamp with the feedback time stamp Input audio;
  • the client determines whether the answer instruction is timed out according to the time offset between the input time mark of the audio input by the user and the current time of the client,
  • the answer instruction When the answer instruction times out, the answer instruction is discarded, and when the answer instruction does not time out, the answer instruction is fed back to the user to complete the man-machine dialogue.
  • the client and server are applied to the voice dialogue platform as a whole implementation.
  • the specific implementation steps have been described in the above embodiments, and will not be repeated here.
  • the heartbeat event is used to process the time interval between two sentences while ensuring that the sentence spoken by the user first is a short sentence.
  • Full duplex is solved after ensuring that the two sentences can be combined into a complete sentence.
  • the start time point and end time point of recording audio match the audio input by the user with the answer instruction returned by the server, which ensures the accuracy of the answer to the user.
  • the user and the user can be processed by setting different time offsets. Different situations in the interaction of intelligent voice devices solve the problem of redundancy in replies in full-duplex conversations.
  • the present application also provides a voice dialogue platform.
  • the voice dialogue platform includes a server and a client, which includes: at least one processor and a memory communicatively connected to the at least one processor, wherein The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the following steps:
  • the client continuously uploads the first audio and the second audio input by the user to the server, and uses the start time point and the end time point of the audio as input time marks;
  • the server receives the first audio uploaded by the user through the client, marks the start time point and the end time point of the first audio, and generates a first recognition result of the first audio by using an audio decoder;
  • the server side determines whether the first audio is a short sentence according to the start time point and the end time point of the first audio. When it is a short sentence, if within the preset heartbeat protection time range, the server receives the client Using an audio decoder to generate a second recognition result of the second audio for the second audio that is continuously uploaded;
  • the server sends at least a combination of the first recognition result and the second recognition result to the language prediction model to determine whether the combined sentence is a sentence
  • the server When it is a sentence, the server generates an answer instruction corresponding to the combined sentence, and sends the answer instruction together with a feedback time stamp of the answer instruction to the client, where the feedback time stamp includes: the answer instruction The start time and end time of the corresponding sentence;
  • the client receives the answer instruction sent by the server and the feedback time mark corresponding to the answer instruction, and determines the audio input by the user corresponding to the answer instruction by matching the input time mark with the feedback time mark;
  • the client terminal determines whether the answer instruction has timed out according to the time offset between the input time mark of the audio input by the user and the current time of the client terminal,
  • the answer instruction When the answer instruction times out, the answer instruction is discarded, and when the answer instruction does not time out, the answer instruction is fed back to the user to complete the man-machine dialogue.
  • At least one processor of the voice dialogue platform provided by the present application is further configured to:
  • the server When it is not a sentence, the server generates a first answer instruction corresponding to the first recognition result and a second answer instruction corresponding to the second recognition result, and combines the first answer instruction and the second answer instruction Send the corresponding feedback time stamp to the client;
  • the client receives the first and second response instructions sent by the server and the feedback time stamp corresponding to the response instruction, and determines the user corresponding to the response instruction by matching the input time stamp with the feedback time stamp Input audio;
  • the client determines whether the answer instruction is timed out according to the time offset between the input time mark of the audio input by the user and the current time of the client,
  • the answer instruction When the answer instruction times out, the answer instruction is discarded, and when the answer instruction does not time out, the answer instruction is fed back to the user to complete the man-machine dialogue.
  • Fig. 4 is a schematic structural diagram of a human-machine dialogue system provided by an embodiment of the present invention.
  • the system can execute the human-machine dialogue method described in any of the foregoing embodiments and is configured in a terminal.
  • the human-machine dialogue system provided in this embodiment is applied to a server and includes: a recognition and decoding program module 11, a short sentence determination program module 12, and a sentence judgment program module 13.
  • the recognition and decoding program module 11 is used to receive the first audio uploaded by the user through the client, mark the start time and end time of the first audio, and generate the first recognition result of the first audio by using an audio decoder;
  • the sentence determination program module 12 is used to determine whether the first audio is a short sentence according to the start time point and the end time point of the first audio.
  • the audio decoder is used to generate the second recognition result of the second audio; the sentence judgment program module 13 is used to send at least the combination of the first recognition result and the second recognition result to
  • the language prediction model determines whether the combined sentence is a sentence, and when it is a sentence, generates an answer instruction corresponding to the combined sentence, and sends the answer instruction together with the feedback time stamp of the answer instruction to the client to
  • the human-machine dialogue is completed through the client, wherein the feedback time mark includes: the start time point and the end time point of the sentence corresponding to the answer instruction.
  • the sentence judgment program module is further used for:
  • Fig. 5 is a schematic structural diagram of a human-machine dialogue system provided by an embodiment of the present invention.
  • the system can execute the human-machine dialogue method described in any of the foregoing embodiments and is configured in a terminal.
  • the human-machine dialogue system provided in this embodiment is applied to a client terminal and includes: an audio upload program module 21, an audio matching program module 22, and a human-machine dialogue program module 23.
  • the audio upload program module 21 is used to continuously upload the first audio and the second audio input by the user to the server, and the start time point and end time point of the audio are used as input time marks;
  • the audio matching program module 22 is used to sequentially receive The answer instruction sent by the server and the feedback time mark corresponding to the answer instruction are determined by matching the input time mark and the feedback time mark to determine the audio input by the user corresponding to the answer instruction;
  • the man-machine dialogue program module 23 It is used for judging whether the answer instruction is timed out according to the time offset between the input time mark of the audio input by the user and the current time of the client. When the answering instruction times out, the answering instruction is discarded. When the instruction does not time out, the answer instruction is fed back to the user to complete the man-machine dialogue.
  • FIG. 6 is a schematic structural diagram of a human-machine dialogue system provided by an embodiment of the present invention.
  • the system can execute the human-machine dialogue method described in any of the foregoing embodiments and is configured in a terminal.
  • the human-machine dialogue system provided by this embodiment is applied to a voice dialogue platform.
  • the voice dialogue platform includes a server and a client, including: an audio upload program module 31, a recognition and decoding program module 32, and a short sentence determination program module 33 , Sentence judgment program module 34, audio matching program module 35 and man-machine dialogue program module 36.
  • the audio upload program module 31 is used for the client to continuously upload the first audio and the second audio input by the user to the server, and use the start time point and end time point of the audio as input time stamps;
  • the recognition and decoding program module 32 uses Receive the first audio uploaded by the user through the client on the server side, mark the start time point and end time point of the first audio, and generate the first recognition result of the first audio using an audio decoder;
  • the short sentence determination program module 33 uses On the server side, according to the start time point and the end time point of the first audio, determine whether the first audio is a short sentence.
  • the server receives The second audio continuously uploaded by the client uses an audio decoder to generate a second recognition result of the second audio; the sentence judgment program module 34 is used for the server to send at least a combination of the first recognition result and the second recognition result To the language prediction model, determine whether the combined sentence is a sentence.
  • the server When it is a sentence, the server generates an answer instruction corresponding to the combined sentence, and sends the answer instruction together with the feedback time stamp of the answer instruction to the client Terminal, wherein the feedback time mark includes: the start time point and the end time point of the sentence corresponding to the answer instruction; the audio matching program module 35 is used for the client to receive the answer instruction sent by the server and the feedback corresponding to the answer instruction Time stamp, by matching the input time stamp and the feedback time stamp to determine the user input audio corresponding to the answer instruction; the human-machine dialogue program module 36 is used for the client to input audio according to the user input
  • the time mark is the time offset generated by the current time of the client, and it is judged whether the answer instruction is overtime. When the answer instruction times out, the answer instruction is discarded, and when the answer instruction does not expire, the answer The instructions are fed back to the user to complete the man-machine dialogue.
  • the short sentence determination program module is further used to: when it is not a sentence, the server side respectively generates a first answer instruction corresponding to the first recognition result And the second answer instruction of the second recognition result, sending the first answer instruction and the second answer instruction together with their corresponding feedback time stamps to the client;
  • the audio matching program module is used for the client to receive the first and second answer instructions sent by the server and the feedback time mark corresponding to the answer instruction respectively, and determine by matching the input time mark with the feedback time mark The audio input by the user corresponding to the answer instruction;
  • the human-machine dialogue program module is used for the client to determine whether the answer instruction is timed out according to the time offset between the input time mark of the audio input by the user and the current time of the client, and when the answer instruction times out, The answer instruction is discarded, and when the answer instruction does not time out, the answer instruction is fed back to the user to complete the man-machine dialogue.
  • the embodiment of the present invention also provides a non-volatile computer storage medium.
  • the computer storage medium stores computer-executable instructions, and the computer-executable instructions can execute the man-machine dialogue method in any of the foregoing method embodiments;
  • the nonvolatile computer storage medium of the present invention stores computer executable instructions, and the computer executable instructions are set as:
  • the client continuously uploads the first audio and the second audio input by the user to the server, and uses the start time point and the end time point of the audio as input time marks;
  • the server receives the first audio uploaded by the user through the client, marks the start time point and the end time point of the first audio, and generates a first recognition result of the first audio by using an audio decoder;
  • the server side determines whether the first audio is a short sentence according to the start time point and the end time point of the first audio. When it is a short sentence, if within the preset heartbeat protection time range, the server receives the client Using an audio decoder to generate a second recognition result of the second audio for the second audio that is continuously uploaded;
  • the server sends at least a combination of the first recognition result and the second recognition result to the language prediction model to determine whether the combined sentence is a sentence
  • the server When it is a sentence, the server generates an answer instruction corresponding to the combined sentence, and sends the answer instruction together with a feedback time stamp of the answer instruction to the client, where the feedback time stamp includes: the answer instruction The start time and end time of the corresponding sentence;
  • the client receives the answer instruction sent by the server and the feedback time mark corresponding to the answer instruction, and determines the audio input by the user corresponding to the answer instruction by matching the input time mark with the feedback time mark;
  • the client determines whether the answer instruction is timed out according to the time offset between the input time mark of the audio input by the user and the current time of the client,
  • the answer instruction When the answer instruction times out, the answer instruction is discarded, and when the answer instruction does not time out, the answer instruction is fed back to the user to complete the man-machine dialogue.
  • non-volatile computer-readable storage medium it can be used to store non-volatile software programs, non-volatile computer-executable programs and modules, such as program instructions corresponding to the man-machine dialogue method in the embodiment of the present invention.
  • Module One or more program instructions are stored in a non-volatile computer-readable storage medium, and when executed by a processor, execute the human-machine dialogue method in any of the foregoing method embodiments.
  • the non-volatile computer-readable storage medium may include a storage program area and a storage data area, where the storage program area can store an operating system and an application program required by at least one function; the storage data area can store an Data created, etc.
  • the non-volatile computer-readable storage medium may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • the non-volatile computer-readable storage medium may optionally include memories remotely provided with respect to the processor, and these remote memories may be connected to the human-machine dialogue device via a network. Examples of the aforementioned networks include but are not limited to the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively connected with the at least one processor, wherein the memory stores instructions that can be executed by the at least one processor The instructions are executed by the at least one processor, so that the at least one processor can execute the steps of the human-machine dialogue method of any embodiment of the present invention.
  • the clients in the embodiments of this application exist in various forms, including but not limited to:
  • Mobile communication equipment This type of equipment is characterized by mobile communication functions, and its main goal is to provide voice and data communications.
  • Such terminals include: smart phones, multimedia phones, functional phones, and low-end phones.
  • Ultra-mobile personal computer equipment This type of equipment belongs to the category of personal computers, has calculation and processing functions, and generally also has mobile Internet features.
  • Such terminals include: PDA, MID and UMPC devices, such as tablet computers.
  • Portable entertainment equipment This type of equipment can display and play multimedia content. Such equipment includes: audio, video players, handheld game consoles, e-books, as well as smart toys and portable car navigation devices.
  • the device embodiments described above are merely illustrative.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement it without creative work.
  • each implementation manner can be implemented by software plus a necessary general hardware platform, and of course, it can also be implemented by hardware.
  • the above technical solutions can be embodied in the form of software products, which can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., include a number of instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in each embodiment or some parts of the embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种人机对话方法和系统,该方法包括:接收用户通过客户端上传的第一音频,标记第一音频的开始时间点和结束时间点,利用音频解码器生成第一音频的第一识别结果(S11);根据第一音频的开始时间点和结束时间点,确定第一音频是否为短句,当为短句时,若在预设的心跳保护时间范围内,接收到客户端上传的第二音频,利用音频解码器生成第二音频的第二识别结果(S12);将至少第一识别结果和第二识别结果的组合发送至语言预测模型,当为一条语句时,生成组合语句对应的回答指令,将回答指令连同回答指令的反馈时间标记发送至客户端(S13)。该方法解决了全双工对话场景下的不合理断句以及对话中回复出现冗余的问题。

Description

人机对话方法及电子设备 技术领域
本发明涉及智能语音对话领域,尤其涉及一种人机对话方法及电子设备。
背景技术
在一般的问答系统里面,通常都是问一句答一句,或者使用多轮交互。全双工交互希望可以达到的效果是类似人与人之间打电话一样,不仅仅是一问一答,也可能是用户说多句话,然后机器人回答一下;甚至机器人可以主动提问来帮助交互,采用节奏控制技术,来根据用户话量大小与内容,调整自己的话量。
在实现本发明过程中,发明人发现相关技术中至少存在如下问题:
1、答非所问。根据现有设备的响应结果,到达客户端的语句会连续播报,当网络延时或服务端处理延时较大时,响应到达客户端,已经失去了时效性。由于对客户端的响应测量完全在服务端实现,客户端没有实现相对时间对齐的策略,不能有选择的去丢弃某些响应,保持与服务端相同的会话状态。如果用户已经开始了下一轮输入,而客户端此时连续播报多条之前输入的内容,会导致输入输出不对应,也就是答非所问的情况,从而导致用户体验较差。
2、不合理断句。一方面,用户在不同场景下,说话节奏会有所不同,仅仅靠着声学特征去断句,会导致出现用户还没说完,提前响应了相应的问题的情况,也会导致出现用户明明说完了,但是却要等待较长的时间的情况。另一方面,上传音频不连续,服务端不能准确判断两句话之间的实际间隔时间,不能判断是否是由于网络延时导致的两句话间隔较长,导致出现不能合理决策响应内容的情况。
发明内容
为了至少解决现有技术中由于回答失去了时效性,导致输入输出内容不对应,上下音频不连续导致不合理断句,从而不能合理决策响应内容的 问题。
第一方面,本发明实施例提供一种人机对话方法,应用于服务器,包括:
接收用户通过客户端上传的第一音频,标记所述第一音频的开始时间点和结束时间点,利用音频解码器生成第一音频的第一识别结果;
根据所述第一音频的开始时间点和结束时间点,确定所述第一音频是否为短句,当为短句时,若在预设的心跳保护时间范围内,接收到客户端上传的第二音频,利用音频解码器生成第二音频的第二识别结果;
将至少所述第一识别结果和所述第二识别结果的组合发送至语言预测模型,判断所述组合语句是否为一条语句,
当为一条语句时,生成所述组合语句对应的回答指令,将所述回答指令连同所述回答指令的反馈时间标记发送至客户端,以通过客户端完成人机对话,其中,所述反馈时间标记包括:所述回答指令对应语句的开始时间点和结束时间点。
第二方面,本发明实施例提供一种人机对话方法,应用于客户端,包括:
向服务器连续上传用户输入的第一音频以及第二音频,将所述音频的开始时间点和结束时间点作为输入时间标记;
依次接收服务器发送的回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
根据所述用户输入的音频的输入时间标记与客户端当前时间产生的时间偏移,判断所述回答指令是否超时,
当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
第三方面,本发明实施例提供一种人机对话方法,应用于语音对话平台,所述语音对话平台包括服务器端和客户端,其特征在于,所述方法包括:
客户端向服务器端连续上传用户输入的第一音频以及第二音频,将所述音频的开始时间点和结束时间点作为输入时间标记;
服务器端接收用户通过客户端上传的第一音频,标记所述第一音频的开始时间点和结束时间点,利用音频解码器生成第一音频的第一识别结果;
服务器端根据所述第一音频的开始时间点和结束时间点,确定所述第一音频是否为短句,当为短句时,若在预设的心跳保护时间范围内,服务器端接收到客户端连续上传的第二音频,利用音频解码器生成第二音频的第二识别结果;
服务器端将至少所述第一识别结果和所述第二识别结果的组合发送至语言预测模型,判断所述组合语句是否为一条语句,
当为一条语句时,服务器端生成所述组合语句对应的回答指令,将所述回答指令连同所述回答指令的反馈时间标记发送至客户端,其中,所述反馈时间标记包括:所述回答指令对应语句的开始时间点和结束时间点;
客户端接收服务器端发送的回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
客户端根据所述用户输入的音频的输入时间标记与所述客户端当前时间产生的时间偏移,判断所述回答指令是否超时,
当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
第四方面,本发明实施例提供一种人机对话系统,应用于服务器,包括:
识别解码程序模块,用于接收用户通过客户端上传的第一音频,标记所述第一音频的开始时间点和结束时间点,利用音频解码器生成第一音频的第一识别结果;
短句确定程序模块,用于根据所述第一音频的开始时间点和结束时间点,确定所述第一音频是否为短句,当为短句时,若在预设的心跳保护时间范围内,接收到客户端上传的第二音频,利用音频解码器生成第二音频的第二识别结果;
语句判断程序模块,用于将至少所述第一识别结果和所述第二识别结果的组合发送至语言预测模型,判断所述组合语句是否为一条语句,
当为一条语句时,生成所述组合语句对应的回答指令,将所述回答指令连同所述回答指令的反馈时间标记发送至客户端,以通过客户端完成人机对话,其中,所述反馈时间标记包括:所述回答指令对应语句的开始时间点和结束时间点。
第五方面,本发明实施例提供一种人机对话系统,应用于客户端,包括:
音频上传程序模块,用于向服务器连续上传用户输入的第一音频以及第二音频,将所述音频的开始时间点和结束时间点作为输入时间标记;
音频匹配程序模块,用于依次接收服务器发送的回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
人机对话程序模块,用于根据所述用户输入的音频的输入时间标记与客户端当前时间产生的时间偏移,判断所述回答指令是否超时,
当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
第六方面,本发明实施例提供一种人机对话系统,应用于语音对话平台,所述语音对话平台包括服务器端和客户端,其特征在于,所述方法包括:
音频上传程序模块,用于客户端向服务器端连续上传用户输入的第一音频以及第二音频,将所述音频的开始时间点和结束时间点作为输入时间标记;
识别解码程序模块,用于服务器端接收用户通过客户端上传的第一音频,标记所述第一音频的开始时间点和结束时间点,利用音频解码器生成第一音频的第一识别结果;
短句确定程序模块,用于服务器端根据所述第一音频的开始时间点和结束时间点,确定所述第一音频是否为短句,当为短句时,若在预设的心跳保护时间范围内,服务器端接收到客户端连续上传的第二音频,利用音频解码器生成第二音频的第二识别结果;
语句判断程序模块,用于服务器端将至少所述第一识别结果和所述第二识别结果的组合发送至语言预测模型,判断所述组合语句是否为一条语 句,
当为一条语句时,服务器端生成所述组合语句对应的回答指令,将所述回答指令连同所述回答指令的反馈时间标记发送至客户端,其中,所述反馈时间标记包括:所述回答指令对应语句的开始时间点和结束时间点;
音频匹配程序模块,用于客户端接收服务器端发送的回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
人机对话程序模块,用于客户端根据所述用户输入的音频的输入时间标记与所述客户端当前时间产生的时间偏移,判断所述回答指令是否超时,
当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
第七方面,提供一种电子设备,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行本发明任一实施例的人机对话方法的步骤。
第八方面,本发明实施例提供一种存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现本发明任一实施例的人机对话方法的步骤。
本发明实施例的有益效果在于:在确保用户先说出的语句是短句的同时使用心跳事件来处理两句话的时间间隔,在确保两句话可以组合成完整的语句后解决了全双工对话场景下的不合理断句。记录音频的开始时间点和结束时间点将用户输入的音频和服务器返回的回答指令进行匹配,保证了答复用户的准确性,在此基础上,通过设定不同的时间偏移,来处理用户与智能语音设备交互中的不同状况,解决了全双工对话中回复出现冗余的问题。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下 面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明一实施例提供的一种应用于服务器的人机对话方法的流程图;
图2是本发明一实施例提供的一种应用于客户端的人机对话方法的流程图;
图3是本发明一实施例提供的一种应用于语音对话平台的人机对话方法的流程图;
图4是本发明一实施例提供的一种应用于服务器的人机对话系统的结构示意图;
图5是本发明一实施例提供的一种应用于客户端的人机对话系统的结构示意图;
图6是本发明一实施例提供的一种应用于语音对话平台的人机对话系统的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
如图1所示为本发明一实施例提供的一种人机对话方法的流程图,应用于服务器,包括如下步骤:
S11:接收用户通过客户端上传的第一音频,标记所述第一音频的开始时间点和结束时间点,利用音频解码器生成第一音频的第一识别结果;
S12:根据所述第一音频的开始时间点和结束时间点,确定所述第一音频是否为短句,当为短句时,若在预设的心跳保护时间范围内,接收到客户端上传的第二音频,利用音频解码器生成第二音频的第二识别结果;
S13:将至少所述第一识别结果和所述第二识别结果的组合发送至语言预测模型,判断所述组合语句是否为一条语句,
当为一条语句时,生成所述组合语句对应的回答指令,将所述回答指令连同所述回答指令的反馈时间标记发送至客户端,以通过客户端完成人机对话,其中,所述反馈时间标记包括:所述回答指令对应语句的开始时间点和结束时间点。
在本实施方式中,现有的用户与智能设备的全双工对话会有以下场景:
用户:我想听(短停顿)周杰伦的歌
智能语音设备:你想听什么?
智能语音设备:下面为您播放周杰伦的稻香。
用户说“我想听”是不完整的句子,但是智能语音设备却对“我想听”做出回复,增加了一轮无意义的对话。本方法为了避免智能语音设备对“我想听”这类稍作停顿的不完整语句,做出无意义的对话回复。
对于步骤S11,同样的,当用户说:我想听(短停顿)周杰伦的歌,由于“我想听”后有短停顿,将其确定为第一音频,“周杰伦的歌”确定为第二音频。服务器接收用户通过智能语音设备客户端上传的第一音频“我想听”,标记所述第一音频开始时间点和结束时间点,通过音频解码器生成第一音频的第一识别结果。
对于步骤S12,根据所述第一音频的开始时间点和结束时间点,确定所述第一音频是否为短句,例如,由于录音长度和时间是正比的关系,从而可以根据收到音频的大小计算出音频的相对时间。进而将通话时间较短的音频确定为短句。例如“我想听”这就是短句。当确定第一音频为短句时,如果在预设置好的心跳保护时间范围内,接收到了客户端上传的第二音频,从而进一步的表现出“第一音频”没说完。其中,心跳保护时间在心跳检测在网络程序中常常被用到,在客户端和服务器之间暂时没有数据交互时,就需要心跳检测对方是否存活。心跳检测可以由客户端主动发起,也可以由服务器主动发起。
对于步骤S13,至少将所述第一识别结果“我想听”和所述第二识别结果“周杰伦的歌”的组合“我想听周杰伦的歌”发送至语言模型,来判断所组合的语句是否为一条完整的语句。
通过语言模型判断,确定“我想听周杰伦的歌”是一条完整的语句。从 而生成“我想听周杰伦的歌”对应的回答指令,从而将所述回答指令,并连同所述回答指令的反馈时间标记发送至客户端,从而通过客户端完成人机对话。(反馈时间标记为了解决答非所问的问题,在下述实施例中会进行说明)
通过该实施方式可以看出,在确保用户先说出的语句是短句的同时使用心跳事件来处理两句话的时间间隔,在确保两句话可以组合成完整的语句后解决了全双工对话场景下的不合理断句。
作为一种实施方式,在本实施例中,在判断所述组合语句是否为一条语句之后,所述方法还包括:
当不是一条语句时,分别生成对应于所述第一识别结果的第一回答指令以及所述第二识别结果的第二回答指令,将所述第一回答指令以及所述第二回答指令连同各自对应的反馈时间标记发送至客户端。
在本实施方式中,如果第一识别结果和第二识别结果组合不到同一条语句中时,此时,由于两句话说的内容不相关,也就涉及不到不合理断句的问题。进而分别生成对应于所述第一识别结果的第一回答指令和所述第二结果的第二回答指令,并连同各自的反馈时间标记发送至客户端。
通过该实施方式可以看出,当两句话不相关时,对用户的每一个对话,都有相应的回答,保证全双工对话的稳定运行。
在一些实施例中,本申请还提供一种服务器,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行以下步骤:
接收用户通过客户端上传的第一音频,标记所述第一音频的开始时间点和结束时间点,利用音频解码器生成第一音频的第一识别结果;
根据所述第一音频的开始时间点和结束时间点,确定所述第一音频是否为短句,当为短句时,若在预设的心跳保护时间范围内,接收到客户端上传的第二音频,利用音频解码器生成第二音频的第二识别结果;
将至少所述第一识别结果和所述第二识别结果的组合发送至语言预测模型,判断所述组合语句是否为一条语句,
当为一条语句时,生成所述组合语句对应的回答指令,将所述回答指令连同所述回答指令的反馈时间标记发送至客户端,以通过客户端完成人机对话,其中,所述反馈时间标记包括:所述回答指令对应语句的开始时间点和结束时间点。
在一些实施例中,本申请提供的服务器的至少一个处理器还配置为:
当不是一条语句时,分别生成对应于所述第一识别结果的第一回答指令以及所述第二识别结果的第二回答指令,将所述第一回答指令以及所述第二回答指令连同各自对应的反馈时间标记发送至客户端。
如图2所示为本发明一实施例提供的一种人机对话方法的流程图,应用于客户端,包括如下步骤:
S21:向服务器连续上传用户输入的第一音频以及第二音频,将所述音频的开始时间点和结束时间点作为输入时间标记;
S22:依次接收服务器发送的回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
S23:根据所述用户输入的音频的输入时间标记与客户端当前时间产生的时间偏移,判断所述回答指令是否超时,
当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
在本实施方式中,现有的用户与智能设备的全双工对话又会有以下场景:
用户:我想听首歌
用户:周杰伦的稻香
智能语音设备:你想听谁的歌?
智能语音设备:好的,为您播放周杰伦的稻香。
用户在第一句话回复后,又补充了第二句话,但是,由于回复的顺序与输入相对应,用户输入的话又过快,导致了用户在输入第二句话时,已经将后输出第一句的问题给解答了,使得智能语音设备输出的第一句话属于冗余的回复,本方法为了避免这种情况而进行了调整。
对于步骤S21,同样的,当用户说:我想听首歌,周杰伦的稻香,依次向服务器连续传送,同时,在本地记录所述音频的开始时间点和结束时间点作为输入时间标记;
对于步骤S22,由于用户说的“我想听首歌”、“周杰伦的稻香”都是完整的语句,会接收到服务器反馈的两条回答指令,以及反馈时间标记。在本实施例中,由于输入的是两个整句,在接收时,会有两个指令。如果在本方法使用实施例1中的语句,那么在接收时,只会有一个指令。由于是全双工对话,客户端要知道服务器返回的回答指令是对应的哪一条输入的语句,因此通过之前的时间标记进行匹配对应。
对于步骤S23,根据用户输入的音频的输入时间标记,与客户端当前时间产生的时间偏移,其中客户端当前时间产生的偏移可以根据具体的情况进行调整,例如,在全双工对话时,有两种情况:
第一种情况,为上述举例所述,用户连续的第二句输入,已经隐含了智能语音设备的第一回复语句的内容,使得第一回复语句已经无意义,也就是说,第二句话输入了,第一句话还没有回答时,第一句话就没有必要回复了,此时,时间偏移设定为与第二句话的输入时间相关。
第二种情况,用户连续输入的两句话没有关系,例如“现在几点了”“给我订个餐”,此时,智能语音设备依次回复,第一回复内容和第二回复内容没有影响。
在此基础上,由于用户输入的问题,服务器在处理起来比较复杂,占用的时间较长,或者由于网络波动,导致在服务器处理好回答指令后发送给客户端时间已经延迟很久(例如2分钟,全双工对话中,这种延迟回复会极度影响用户体验),这些延迟很久的回答指令也已经显得无意义了,此时,时间偏移设定为与预设的回复等待时间相关(这类比较常见,具体实施方式就不赘述了)。
因此,可以针对这不同的情况,对客户端当前时间产生的偏移进行不同的设定,来适应不同的情况。
通过将客户端当前时间产生的偏移设定为第一种情况时,就根据时间偏移,确定所述第一句话的回答指令已经超时,丢弃所述第一句话的回答指令,这样,在回复时,避免出现冗余的回复。
1、用户:我想听首歌
2、用户:周杰伦的稻香
智能语音设备:你想听谁的歌?(丢弃,不向用户输出)
3、智能语音设备:好的,为您播放周杰伦的稻香。
通过该实施方式可以看出,记录音频的开始时间点和结束时间点将用户输入的音频和服务器返回的回答指令进行匹配,保证了答复用户的准确性,在此基础上,通过设定不同的时间偏移,来处理用户与智能语音设备交互中的不同状况,解决了全双工对话中回复出现冗余的问题。
在一些实施例中,本申请还提供一种客户端,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行以下步骤:
向服务器连续上传用户输入的第一音频以及第二音频,将所述音频的开始时间点和结束时间点作为输入时间标记;
依次接收服务器发送的回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
根据所述用户输入的音频的输入时间标记与客户端当前时间产生的时间偏移,判断所述回答指令是否超时,
当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
如图3所示为本发明一实施例提供的一种人机对话方法的流程图,应用于语音对话平台,包括如下步骤:
S31:客户端向服务器端连续上传用户输入的第一音频以及第二音频,将所述音频的开始时间点和结束时间点作为输入时间标记;
S32:服务器端接收用户通过客户端上传的第一音频,标记所述第一音频的开始时间点和结束时间点,利用音频解码器生成第一音频的第一识别结果;
S33:服务器端根据所述第一音频的开始时间点和结束时间点,确定 所述第一音频是否为短句,当为短句时,若在预设的心跳保护时间范围内,服务器端接收到客户端连续上传的第二音频,利用音频解码器生成第二音频的第二识别结果;
S34:服务器端将至少所述第一识别结果和所述第二识别结果的组合发送至语言预测模型,判断所述组合语句是否为一条语句,当为一条语句时,服务器端生成所述组合语句对应的回答指令,将所述回答指令连同所述回答指令的反馈时间标记发送至客户端,其中,所述反馈时间标记包括:所述回答指令对应语句的开始时间点和结束时间点;
S35:客户端接收服务器端发送的回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
S36:客户端根据所述用户输入的音频的输入时间标记与所述客户端当前时间产生的时间偏移,判断所述回答指令是否超时,当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
作为一种实施方式,在本实施例中,在所述判断所述组合是否为一条语句之后,所述方法还包括:
当不是一条语句时,服务器端分别生成对应于所述第一识别结果的第一回答指令以及所述第二识别结果的第二回答指令,将所述第一回答指令以及所述第二回答指令连同各自对应的反馈时间标记发送至客户端;
客户端分别接收服务器端发送的第一、第二回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
客户端根据所述用户输入的音频的输入时间标记与所述客户端当前时间产生的时间偏移,判断所述回答指令是否超时,
当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
在本实施方式中,将客户端与服务器应用到语音对话平台中,作为一个实施整体。具体实施步骤在上述实施例中已经说明,在此不再赘述。
通过该实施方式可以看出,在确保用户先说出的语句是短句的同时使 用心跳事件来处理两句话的时间间隔,在确保两句话可以组合成完整的语句后解决了全双工对话场景下的不合理断句。记录音频的开始时间点和结束时间点将用户输入的音频和服务器返回的回答指令进行匹配,保证了答复用户的准确性,在此基础上,通过设定不同的时间偏移,来处理用户与智能语音设备交互中的不同状况,解决了全双工对话中回复出现冗余的问题。
在一些实施例中,本申请还提供一种语音对话平台,所述语音对话平台包括服务器端和客户端,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行以下步骤:
客户端向服务器端连续上传用户输入的第一音频以及第二音频,将所述音频的开始时间点和结束时间点作为输入时间标记;
服务器端接收用户通过客户端上传的第一音频,标记所述第一音频的开始时间点和结束时间点,利用音频解码器生成第一音频的第一识别结果;
服务器端根据所述第一音频的开始时间点和结束时间点,确定所述第一音频是否为短句,当为短句时,若在预设的心跳保护时间范围内,服务器端接收到客户端连续上传的第二音频,利用音频解码器生成第二音频的第二识别结果;
服务器端将至少所述第一识别结果和所述第二识别结果的组合发送至语言预测模型,判断所述组合语句是否为一条语句,
当为一条语句时,服务器端生成所述组合语句对应的回答指令,将所述回答指令连同所述回答指令的反馈时间标记发送至客户端,其中,所述反馈时间标记包括:所述回答指令对应语句的开始时间点和结束时间点;
客户端接收服务器端发送的回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
客户端根据所述用户输入的音频的输入时间标记与所述客户端当前 时间产生的时间偏移,判断所述回答指令是否超时,
当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
在一些实施例中,本申请提供的语音对话平台的至少一个处理器还配置为:
当不是一条语句时,服务器端分别生成对应于所述第一识别结果的第一回答指令以及所述第二识别结果的第二回答指令,将所述第一回答指令以及所述第二回答指令连同各自对应的反馈时间标记发送至客户端;
客户端分别接收服务器端发送的第一、第二回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
客户端根据所述用户输入的音频的输入时间标记与所述客户端当前时间产生的时间偏移,判断所述回答指令是否超时,
当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
如图4所示为本发明一实施例提供的一种人机对话系统的结构示意图,该系统可执行上述任意实施例所述的人机对话方法,并配置在终端中。
本实施例提供的一种人机对话系统,应用于服务器,包括:识别解码程序模块11,短句确定程序模块12和语句判断程序模块13。
其中,识别解码程序模块11用于接收用户通过客户端上传的第一音频,标记所述第一音频的开始时间点和结束时间点,利用音频解码器生成第一音频的第一识别结果;短句确定程序模块12用于根据所述第一音频的开始时间点和结束时间点,确定所述第一音频是否为短句,当为短句时,若在预设的心跳保护时间范围内,接收到客户端上传的第二音频,利用音频解码器生成第二音频的第二识别结果;语句判断程序模块13用于将至少所述第一识别结果和所述第二识别结果的组合发送至语言预测模型,判断所述组合语句是否为一条语句,当为一条语句时,生成所述组合语句对应的回答指令,将所述回答指令连同所述回答指令的反馈时间标记发送至客户端,以通过客户端完成人机对话,其中,所述反馈时间标记包括:所 述回答指令对应语句的开始时间点和结束时间点。
进一步地,在判断所述组合语句是否为一条语句之后,所述语句判断程序模块还用于:
当不是一条语句时,分别生成对应于所述第一识别结果的第一回答指令以及所述第二识别结果的第二回答指令,将所述第一回答指令以及所述第二回答指令连同各自对应的反馈时间标记发送至客户端。
如图5所示为本发明一实施例提供的一种人机对话系统的结构示意图,该系统可执行上述任意实施例所述的人机对话方法,并配置在终端中。
本实施例提供的一种人机对话系统,应用于客户端,包括:音频上传程序模块21,音频匹配程序模块22和人机对话程序模块23。
其中,音频上传程序模块21用于向服务器连续上传用户输入的第一音频以及第二音频,将所述音频的开始时间点和结束时间点作为输入时间标记;音频匹配程序模块22用于依次接收服务器发送的回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;人机对话程序模块23用于根据所述用户输入的音频的输入时间标记与客户端当前时间产生的时间偏移,判断所述回答指令是否超时,当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
如图6所示为本发明一实施例提供的一种人机对话系统的结构示意图,该系统可执行上述任意实施例所述的人机对话方法,并配置在终端中。
本实施例提供的一种人机对话系统,应用于语音对话平台,所述语音对话平台包括服务器端和客户端,包括:音频上传程序模块31,识别解码程序模块32,短句确定程序模块33,语句判断程序模块34,音频匹配程序模块35和人机对话程序模块36。
其中,音频上传程序模块31用于客户端向服务器端连续上传用户输入的第一音频以及第二音频,将所述音频的开始时间点和结束时间点作为输入时间标记;识别解码程序模块32用于服务器端接收用户通过客户端 上传的第一音频,标记所述第一音频的开始时间点和结束时间点,利用音频解码器生成第一音频的第一识别结果;短句确定程序模块33用于服务器端根据所述第一音频的开始时间点和结束时间点,确定所述第一音频是否为短句,当为短句时,若在预设的心跳保护时间范围内,服务器端接收到客户端连续上传的第二音频,利用音频解码器生成第二音频的第二识别结果;语句判断程序模块34用于服务器端将至少所述第一识别结果和所述第二识别结果的组合发送至语言预测模型,判断所述组合语句是否为一条语句,当为一条语句时,服务器端生成所述组合语句对应的回答指令,将所述回答指令连同所述回答指令的反馈时间标记发送至客户端,其中,所述反馈时间标记包括:所述回答指令对应语句的开始时间点和结束时间点;音频匹配程序模块35用于客户端接收服务器端发送的回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;人机对话程序模块36用于客户端根据所述用户输入的音频的输入时间标记与所述客户端当前时间产生的时间偏移,判断所述回答指令是否超时,当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
进一步地,在所述判断所述组合是否为一条语句之后,所述短句确定程序模块还用于:当不是一条语句时,服务器端分别生成对应于所述第一识别结果的第一回答指令以及所述第二识别结果的第二回答指令,将所述第一回答指令以及所述第二回答指令连同各自对应的反馈时间标记发送至客户端;
音频匹配程序模块,用于客户端分别接收服务器端发送的第一、第二回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
人机对话程序模块,用于客户端根据所述用户输入的音频的输入时间标记与所述客户端当前时间产生的时间偏移,判断所述回答指令是否超时,当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
本发明实施例还提供了一种非易失性计算机存储介质,计算机存储介质存储有计算机可执行指令,该计算机可执行指令可执行上述任意方法实施例中的人机对话方法;
作为一种实施方式,本发明的非易失性计算机存储介质存储有计算机可执行指令,计算机可执行指令设置为:
客户端向服务器端连续上传用户输入的第一音频以及第二音频,将所述音频的开始时间点和结束时间点作为输入时间标记;
服务器端接收用户通过客户端上传的第一音频,标记所述第一音频的开始时间点和结束时间点,利用音频解码器生成第一音频的第一识别结果;
服务器端根据所述第一音频的开始时间点和结束时间点,确定所述第一音频是否为短句,当为短句时,若在预设的心跳保护时间范围内,服务器端接收到客户端连续上传的第二音频,利用音频解码器生成第二音频的第二识别结果;
服务器端将至少所述第一识别结果和所述第二识别结果的组合发送至语言预测模型,判断所述组合语句是否为一条语句,
当为一条语句时,服务器端生成所述组合语句对应的回答指令,将所述回答指令连同所述回答指令的反馈时间标记发送至客户端,其中,所述反馈时间标记包括:所述回答指令对应语句的开始时间点和结束时间点;
客户端接收服务器端发送的回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
客户端根据所述用户输入的音频的输入时间标记与所述客户端当前时间产生的时间偏移,判断所述回答指令是否超时,
当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块,如本发明实施例中的人机对话 方法对应的程序指令/模块。一个或者多个程序指令存储在非易失性计算机可读存储介质中,当被处理器执行时,执行上述任意方法实施例中的人机对话方法。
非易失性计算机可读存储介质可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据人机对话装置的使用所创建的数据等。此外,非易失性计算机可读存储介质可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,非易失性计算机可读存储介质可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至人机对话装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
本发明实施例还提供一种电子设备,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行本发明任一实施例的人机对话方法的步骤。
本申请实施例的客户端以多种形式存在,包括但不限于:
(1)移动通信设备:这类设备的特点是具备移动通信功能,并且以提供话音、数据通信为主要目标。这类终端包括:智能手机、多媒体手机、功能性手机,以及低端手机等。
(2)超移动个人计算机设备:这类设备属于个人计算机的范畴,有计算和处理功能,一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等,例如平板电脑。
(3)便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器,掌上游戏机,电子书,以及智能玩具和便携式车载导航设备。
(4)其他具有语音对话功能的电子装置。
在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或 操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”,不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (10)

  1. 一种人机对话方法,应用于服务器,包括:
    接收用户通过客户端上传的第一音频,标记所述第一音频的开始时间点和结束时间点,利用音频解码器生成第一音频的第一识别结果;
    根据所述第一音频的开始时间点和结束时间点,确定所述第一音频是否为短句,当为短句时,若在预设的心跳保护时间范围内,接收到客户端上传的第二音频,利用音频解码器生成第二音频的第二识别结果;
    将至少所述第一识别结果和所述第二识别结果的组合发送至语言预测模型,判断所述组合语句是否为一条语句,
    当为一条语句时,生成所述组合语句对应的回答指令,将所述回答指令连同所述回答指令的反馈时间标记发送至客户端,以通过客户端完成人机对话,其中,所述反馈时间标记包括:所述回答指令对应语句的开始时间点和结束时间点。
  2. 根据权利要求1所述的方法,其中,在判断所述组合语句是否为一条语句之后,所述方法还包括:
    当不是一条语句时,分别生成对应于所述第一识别结果的第一回答指令以及所述第二识别结果的第二回答指令,将所述第一回答指令以及所述第二回答指令连同各自对应的反馈时间标记发送至客户端。
  3. 一种人机对话方法,应用于客户端,包括:
    向服务器连续上传用户输入的第一音频以及第二音频,将所述音频的开始时间点和结束时间点作为输入时间标记;
    依次接收服务器发送的回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
    根据所述用户输入的音频的输入时间标记与客户端当前时间产生的时间偏移,判断所述回答指令是否超时,
    当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
  4. 一种人机对话方法,应用于语音对话平台,所述语音对话平台包括服务器端和客户端,所述方法包括:
    客户端向服务器端连续上传用户输入的第一音频以及第二音频,将所述音频的开始时间点和结束时间点作为输入时间标记;
    服务器端接收用户通过客户端上传的第一音频,标记所述第一音频的开始时间点和结束时间点,利用音频解码器生成第一音频的第一识别结果;
    服务器端根据所述第一音频的开始时间点和结束时间点,确定所述第一音频是否为短句,当为短句时,若在预设的心跳保护时间范围内,服务器端接收到客户端连续上传的第二音频,利用音频解码器生成第二音频的第二识别结果;
    服务器端将至少所述第一识别结果和所述第二识别结果的组合发送至语言预测模型,判断所述组合语句是否为一条语句,
    当为一条语句时,服务器端生成所述组合语句对应的回答指令,将所述回答指令连同所述回答指令的反馈时间标记发送至客户端,其中,所述反馈时间标记包括:所述回答指令对应语句的开始时间点和结束时间点;
    客户端接收服务器端发送的回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
    客户端根据所述用户输入的音频的输入时间标记与所述客户端当前时间产生的时间偏移,判断所述回答指令是否超时,
    当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
  5. 根据权利要求4所述的方法,其中,在所述判断所述组合是否为一条语句之后,所述方法还包括:
    当不是一条语句时,服务器端分别生成对应于所述第一识别结果的第一回答指令以及所述第二识别结果的第二回答指令,将所述第一回答指令以及所述第二回答指令连同各自对应的反馈时间标记发送至客户端;
    客户端分别接收服务器端发送的第一、第二回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
    客户端根据所述用户输入的音频的输入时间标记与所述客户端当前时间产生的时间偏移,判断所述回答指令是否超时,
    当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
  6. 一种服务器,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行以下步骤:
    接收用户通过客户端上传的第一音频,标记所述第一音频的开始时间点和结束时间点,利用音频解码器生成第一音频的第一识别结果;
    根据所述第一音频的开始时间点和结束时间点,确定所述第一音频是否为短句,当为短句时,若在预设的心跳保护时间范围内,接收到客户端上传的第二音频,利用音频解码器生成第二音频的第二识别结果;
    将至少所述第一识别结果和所述第二识别结果的组合发送至语言预测模型,判断所述组合语句是否为一条语句,
    当为一条语句时,生成所述组合语句对应的回答指令,将所述回答指令连同所述回答指令的反馈时间标记发送至客户端,以通过客户端完成人机对话,其中,所述反馈时间标记包括:所述回答指令对应语句的开始时间点和结束时间点。
  7. 根据权利要求6所述的服务器,所述至少一个处理器还配置为:
    当不是一条语句时,分别生成对应于所述第一识别结果的第一回答指令以及所述第二识别结果的第二回答指令,将所述第一回答指令以及所述第二回答指令连同各自对应的反馈时间标记发送至客户端。
  8. 一种客户端,其包括:至少一个处理器,以及与所述至少一个处 理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行以下步骤:
    向服务器连续上传用户输入的第一音频以及第二音频,将所述音频的开始时间点和结束时间点作为输入时间标记;
    依次接收服务器发送的回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
    根据所述用户输入的音频的输入时间标记与客户端当前时间产生的时间偏移,判断所述回答指令是否超时,
    当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
  9. 一种语音对话平台,所述语音对话平台包括服务器端和客户端,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行以下步骤:
    客户端向服务器端连续上传用户输入的第一音频以及第二音频,将所述音频的开始时间点和结束时间点作为输入时间标记;
    服务器端接收用户通过客户端上传的第一音频,标记所述第一音频的开始时间点和结束时间点,利用音频解码器生成第一音频的第一识别结果;
    服务器端根据所述第一音频的开始时间点和结束时间点,确定所述第一音频是否为短句,当为短句时,若在预设的心跳保护时间范围内,服务器端接收到客户端连续上传的第二音频,利用音频解码器生成第二音频的第二识别结果;
    服务器端将至少所述第一识别结果和所述第二识别结果的组合发送至语言预测模型,判断所述组合语句是否为一条语句,
    当为一条语句时,服务器端生成所述组合语句对应的回答指令,将所 述回答指令连同所述回答指令的反馈时间标记发送至客户端,其中,所述反馈时间标记包括:所述回答指令对应语句的开始时间点和结束时间点;
    客户端接收服务器端发送的回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
    客户端根据所述用户输入的音频的输入时间标记与所述客户端当前时间产生的时间偏移,判断所述回答指令是否超时,
    当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
  10. 根据权利要求9所述的语音对话平台,所述至少一个处理器还配置为:
    当不是一条语句时,服务器端分别生成对应于所述第一识别结果的第一回答指令以及所述第二识别结果的第二回答指令,将所述第一回答指令以及所述第二回答指令连同各自对应的反馈时间标记发送至客户端;
    客户端分别接收服务器端发送的第一、第二回答指令以及所述回答指令对应的反馈时间标记,通过对所述输入时间标记与所述反馈时间标记进行匹配,确定所述回答指令对应的用户输入的音频;
    客户端根据所述用户输入的音频的输入时间标记与所述客户端当前时间产生的时间偏移,判断所述回答指令是否超时,
    当所述回答指令超时时,丢弃所述回答指令,当所述回答指令没有超时时,将所述回答指令反馈给用户,以完成人机对话。
PCT/CN2019/120607 2019-06-13 2019-11-25 人机对话方法及电子设备 WO2020248524A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP19932635.6A EP3985661B1 (en) 2019-06-13 2019-11-25 Method of man-machine interaction and voice dialogue platform
JP2021572940A JP7108799B2 (ja) 2019-06-13 2019-11-25 ヒューマンマシン対話方法及び電子デバイス
US17/616,969 US11551693B2 (en) 2019-06-13 2019-11-25 Method of man-machine interaction and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910510000.9 2019-06-13
CN201910510000.9A CN110223697B (zh) 2019-06-13 2019-06-13 人机对话方法及系统

Publications (1)

Publication Number Publication Date
WO2020248524A1 true WO2020248524A1 (zh) 2020-12-17

Family

ID=67816846

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/120607 WO2020248524A1 (zh) 2019-06-13 2019-11-25 人机对话方法及电子设备

Country Status (5)

Country Link
US (1) US11551693B2 (zh)
EP (1) EP3985661B1 (zh)
JP (1) JP7108799B2 (zh)
CN (1) CN110223697B (zh)
WO (1) WO2020248524A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112783324A (zh) * 2021-01-14 2021-05-11 科大讯飞股份有限公司 人机交互方法及设备、计算机存储介质
CN112995419A (zh) * 2021-02-05 2021-06-18 支付宝(杭州)信息技术有限公司 一种语音对话的处理方法和系统
CN113705250A (zh) * 2021-10-29 2021-11-26 北京明略昭辉科技有限公司 会话内容识别方法、装置、设备及计算机可读介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110223697B (zh) 2019-06-13 2022-04-22 思必驰科技股份有限公司 人机对话方法及系统
CN112786031B (zh) * 2019-11-01 2022-05-13 思必驰科技股份有限公司 人机对话方法及系统
CN112992136A (zh) * 2020-12-16 2021-06-18 呼唤(上海)云计算股份有限公司 智能婴儿监护系统及方法
CN114141236B (zh) * 2021-10-28 2023-01-06 北京百度网讯科技有限公司 语言模型更新方法、装置、电子设备及存储介质

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5774859A (en) * 1995-01-03 1998-06-30 Scientific-Atlanta, Inc. Information system having a speech interface
CN103413549A (zh) * 2013-07-31 2013-11-27 深圳创维-Rgb电子有限公司 语音交互的方法、系统以及交互终端
CN106469212A (zh) * 2016-09-05 2017-03-01 北京百度网讯科技有限公司 基于人工智能的人机交互方法和装置
CN108882111A (zh) * 2018-06-01 2018-11-23 四川斐讯信息技术有限公司 一种基于智能音箱的交互方法及系统
CN108920604A (zh) * 2018-06-27 2018-11-30 百度在线网络技术(北京)有限公司 语音交互方法及设备
CN109147779A (zh) * 2018-08-14 2019-01-04 苏州思必驰信息科技有限公司 语音数据处理方法和装置
CN109147831A (zh) * 2018-09-26 2019-01-04 深圳壹账通智能科技有限公司 一种语音连接播放方法、终端设备及计算机可读存储介质
CN109215642A (zh) * 2017-07-04 2019-01-15 阿里巴巴集团控股有限公司 人机会话的处理方法、装置及电子设备
CN110223697A (zh) * 2019-06-13 2019-09-10 苏州思必驰信息科技有限公司 人机对话方法及系统

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20000045128A (ko) * 1998-12-30 2000-07-15 김영환 단문,음성,영상 서비스가 가능한 코드분할다중접속 방식의통신 단말기
JP2004309631A (ja) * 2003-04-03 2004-11-04 Nippon Telegr & Teleph Corp <Ntt> 対話練習支援装置、方法及びプログラム
CN105070290A (zh) * 2015-07-08 2015-11-18 苏州思必驰信息科技有限公司 人机语音交互方法及系统
KR101942521B1 (ko) 2015-10-19 2019-01-28 구글 엘엘씨 음성 엔드포인팅
CN105845129A (zh) * 2016-03-25 2016-08-10 乐视控股(北京)有限公司 一种在音频中切分句子的方法和系统及视频文件的字幕自动生成方法和系统
CN108237616B (zh) 2016-12-24 2024-01-23 广东明泰盛陶瓷有限公司 一种陶瓷注模装置
CN107066568A (zh) * 2017-04-06 2017-08-18 竹间智能科技(上海)有限公司 基于用户意图预测的人机对话方法及装置
WO2019031268A1 (ja) 2017-08-09 2019-02-14 ソニー株式会社 情報処理装置、及び情報処理方法
CN110730952B (zh) * 2017-11-03 2021-08-31 腾讯科技(深圳)有限公司 处理网络上的音频通信的方法和系统
CN107920120A (zh) * 2017-11-22 2018-04-17 北京小米移动软件有限公司 业务处理方法、装置及计算机可读存储介质
US10897432B2 (en) * 2017-12-04 2021-01-19 Microsoft Technology Licensing, Llc Chat-enabled messaging
CN108257616A (zh) * 2017-12-05 2018-07-06 苏州车萝卜汽车电子科技有限公司 人机对话的检测方法以及装置
JP7096707B2 (ja) 2018-05-29 2022-07-06 シャープ株式会社 電子機器、電子機器を制御する制御装置、制御プログラムおよび制御方法
CN109584876B (zh) * 2018-12-26 2020-07-14 珠海格力电器股份有限公司 语音数据的处理方法、装置和语音空调
CN109741753B (zh) * 2019-01-11 2020-07-28 百度在线网络技术(北京)有限公司 一种语音交互方法、装置、终端及服务器

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5774859A (en) * 1995-01-03 1998-06-30 Scientific-Atlanta, Inc. Information system having a speech interface
CN103413549A (zh) * 2013-07-31 2013-11-27 深圳创维-Rgb电子有限公司 语音交互的方法、系统以及交互终端
CN106469212A (zh) * 2016-09-05 2017-03-01 北京百度网讯科技有限公司 基于人工智能的人机交互方法和装置
CN109215642A (zh) * 2017-07-04 2019-01-15 阿里巴巴集团控股有限公司 人机会话的处理方法、装置及电子设备
CN108882111A (zh) * 2018-06-01 2018-11-23 四川斐讯信息技术有限公司 一种基于智能音箱的交互方法及系统
CN108920604A (zh) * 2018-06-27 2018-11-30 百度在线网络技术(北京)有限公司 语音交互方法及设备
CN109147779A (zh) * 2018-08-14 2019-01-04 苏州思必驰信息科技有限公司 语音数据处理方法和装置
CN109147831A (zh) * 2018-09-26 2019-01-04 深圳壹账通智能科技有限公司 一种语音连接播放方法、终端设备及计算机可读存储介质
CN110223697A (zh) * 2019-06-13 2019-09-10 苏州思必驰信息科技有限公司 人机对话方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3985661A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112783324A (zh) * 2021-01-14 2021-05-11 科大讯飞股份有限公司 人机交互方法及设备、计算机存储介质
CN112783324B (zh) * 2021-01-14 2023-12-01 科大讯飞股份有限公司 人机交互方法及设备、计算机存储介质
CN112995419A (zh) * 2021-02-05 2021-06-18 支付宝(杭州)信息技术有限公司 一种语音对话的处理方法和系统
CN113705250A (zh) * 2021-10-29 2021-11-26 北京明略昭辉科技有限公司 会话内容识别方法、装置、设备及计算机可读介质
CN113705250B (zh) * 2021-10-29 2022-02-22 北京明略昭辉科技有限公司 会话内容识别方法、装置、设备及计算机可读介质

Also Published As

Publication number Publication date
US20220165269A1 (en) 2022-05-26
CN110223697B (zh) 2022-04-22
EP3985661B1 (en) 2024-02-28
US11551693B2 (en) 2023-01-10
EP3985661A1 (en) 2022-04-20
CN110223697A (zh) 2019-09-10
JP2022528582A (ja) 2022-06-14
JP7108799B2 (ja) 2022-07-28
EP3985661A4 (en) 2022-08-03

Similar Documents

Publication Publication Date Title
WO2020248524A1 (zh) 人机对话方法及电子设备
WO2021129262A1 (zh) 用于主动发起对话的服务端处理方法及服务器、能够主动发起对话的语音交互系统
US20110166862A1 (en) System and method for variable automated response to remote verbal input at a mobile device
CN109361527B (zh) 语音会议记录方法及系统
CN106547884A (zh) 一种替身机器人的行为模式学习系统
CN109671429B (zh) 语音交互方法及设备
CN107911361B (zh) 支持多会话的语音管理方法、装置、终端设备及存储介质
US11978443B2 (en) Conversation assistance device, conversation assistance method, and program
CN110246501B (zh) 用于会议记录的语音识别方法及系统
CN107645523A (zh) 一种情绪交互的方法和系统
TW202022560A (zh) 用於聊天機器人與人類通話的可編程智能代理機
WO2021082133A1 (zh) 人机对话模式切换方法
CN106131317A (zh) 自动播放与回复信息的方法与系统
WO2018166367A1 (zh) 一种实时对话中的实时提醒方法、装置、存储介质及电子装置
CN105741833A (zh) 语音通信数据处理方法和装置
CN106781762B (zh) 一种显示问题信息的方法、装置和系统
WO2024067597A1 (zh) 线上会议方法、装置、电子设备及可读存储介质
WO2021042584A1 (zh) 全双工语音对话方法
CN112862461A (zh) 会议进程控制方法、装置、服务器及存储介质
CN108182942B (zh) 一种支持不同虚拟角色交互的方法和装置
US20220308825A1 (en) Automatic toggling of a mute setting during a communication session
CN113889104A (zh) 一种语音交互方法、装置、计算机可读存储介质及服务器
CN114615381A (zh) 音频数据处理方法、装置、电子设备、服务器和存储介质
CN105578107A (zh) 多媒体通话呼叫建立过程和游戏的互动融合方法及装置
CN112133300B (zh) 多设备的交互方法、相关设备和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19932635

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021572940

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2019932635

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2019932635

Country of ref document: EP

Effective date: 20220113