CN114999470A - Control method and device for man-machine voice conversation and electronic equipment - Google Patents

Control method and device for man-machine voice conversation and electronic equipment Download PDF

Info

Publication number
CN114999470A
CN114999470A CN202110229744.0A CN202110229744A CN114999470A CN 114999470 A CN114999470 A CN 114999470A CN 202110229744 A CN202110229744 A CN 202110229744A CN 114999470 A CN114999470 A CN 114999470A
Authority
CN
China
Prior art keywords
control instruction
machine
machine end
voice
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110229744.0A
Other languages
Chinese (zh)
Inventor
陈克寒
李泽中
戴苏洋
刘小明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Innovation Co
Original Assignee
Alibaba Singapore Holdings Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Singapore Holdings Pte Ltd filed Critical Alibaba Singapore Holdings Pte Ltd
Priority to CN202110229744.0A priority Critical patent/CN114999470A/en
Publication of CN114999470A publication Critical patent/CN114999470A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application discloses a control method of man-machine voice conversation, which comprises the following steps: receiving a first voice stream of a user side for carrying out man-machine voice conversation and a second voice stream of a monitoring machine side for carrying out the man-machine voice conversation; obtaining a first state feature of the first voice stream in a first time slice and a second state feature of the second voice stream in the first time slice; selecting a corresponding control instruction from a set control instruction set according to the first state characteristic and the second state characteristic; the control instruction set comprises an instruction for controlling the machine end to broadcast and an instruction for controlling the machine end to mute; and after the first time slicing, controlling the machine end to carry out the man-machine voice conversation according to the matched control instruction. The method can control the machine end to respond the voice stream sent by the user timely and accurately at any time so as to reduce response delay and improve user experience.

Description

Control method and device for man-machine voice conversation and electronic equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for controlling a human-computer voice conversation, an electronic device, and a computer-readable storage medium.
Background
With the continuous development of computer technology, man-machine conversation technology, especially man-machine voice conversation technology, is widely applied to various fields, and great convenience is brought to the life of people.
At present, when a man-machine voice conversation is realized, the conversation is usually carried out based on a man-machine text conversation mode, for example, when a user interacts with an intelligent sound box, the user needs to wake up equipment first and then send out voice, and then the intelligent sound box responds based on the voice of the user; thereafter, the device will respond again, requiring the user to wake up the device again and speak the voice.
In the process of implementing the application, the inventor finds that different from text conversation, voice conversation often has the characteristics of persistence and exclusivity, in the process of transmitting voice information by one party, the other party can simultaneously understand the information and interrupt to reply in time, and the current man-machine voice conversation method is implemented based on a man-machine text conversation mode, and has the problem that conversation feedback cannot be timely and accurately carried out in the conversation process.
Disclosure of Invention
It is an object of the disclosed embodiments to provide a new solution for controlling human-machine voice dialogues.
In a first aspect of the present disclosure, a method for controlling a man-machine voice conversation is provided, the method including:
receiving a first voice stream of a user side for carrying out man-machine voice conversation and a second voice stream of a monitoring machine side for carrying out the man-machine voice conversation;
obtaining a first state feature of the first voice stream in a first time slice and a second state feature of the second voice stream in the first time slice;
selecting a corresponding control instruction from a set control instruction set according to the first state characteristic and the second state characteristic; the control instruction set comprises an instruction for controlling the machine end to broadcast and an instruction for controlling the machine end to mute;
and after the first time slice, controlling the machine end to carry out the man-machine voice conversation according to the control instruction.
Optionally, the instruction broadcasted by the control machine end includes at least one of a first control instruction for continuing current broadcasting, a second control instruction for starting new broadcasting, a third control instruction for broadcasting a set sentence bearing content, a fourth control instruction for broadcasting a set first turn question and answer content, and a fifth control instruction for broadcasting a set mute prompt content; and/or the presence of a gas in the gas,
the instructions for controlling the mute of the machine end comprise at least one of a sixth control instruction for stopping the current broadcast and a seventh control instruction for keeping the mute of the machine end.
Optionally, the obtaining a first state characteristic of the first voice stream in a first time slice and a second state characteristic of the second voice stream in the first time slice includes:
detecting an occurrence of a triggering event;
according to the detected trigger event, acquiring a first state characteristic of the first voice stream in a first time slice before the trigger event is detected and a second state characteristic of the second voice stream in the first time slice.
Optionally, the trigger event includes at least one of an event of starting the man-machine voice conversation, an event of a non-silent segment appearing in the first voice stream, an event of a non-silent segment appearing in the second voice stream, an event of a silent segment appearing in the second voice stream, and a set trigger time.
Optionally, the triggering event includes an event that a non-silent section occurs in the first voice stream, and the step of detecting the event that the non-silent section occurs in the first voice stream includes:
splitting the first voice stream to obtain a first mute section and a second mute section which are adjacent, wherein the first mute section is earlier than the second mute section;
under the condition that the time sequences of the first mute section and the second mute section are not connected, extracting a voice section between the first mute section and the second mute section as a non-mute section, and judging the event that the non-mute section occurs in the first voice stream.
Optionally, the selecting, in the set of control instructions, the control instruction corresponding to the first status feature and the second status feature includes:
judging whether the machine end has speaking right after the first time slicing according to the first state characteristic and the second state characteristic to obtain a judgment result;
and selecting a control instruction corresponding to the first state characteristic and the second state characteristic in the control instruction set according to the judgment result.
Optionally, the determining, according to the first state feature and the second state feature, whether the machine end has the right to speak after the first time slice to obtain a determination result includes:
determining that the machine end has the right to speak after the first time slice if the second state characteristic is that the machine end remains unmuted or the machine end changes from silence to unmuted;
the instruction that control machine end was reported includes the first control command that continues the current report, according to the judged result, in the control command set select with first status feature and the control command that second status feature corresponds include:
and under the condition that the machine end has the speaking right, selecting the first control instruction as the corresponding control instruction.
Optionally, the determining, according to the first state feature and the second state feature, whether the machine end has the right to speak after the first time slice to obtain a determination result includes:
determining that the machine end does not have speaking right after the first time slice if the first state characteristic indicates that a non-silent segment of the first voice stream occurs and the second state characteristic is that the machine end remains non-silent or the machine end changes from silent to non-silent;
the instruction for controlling the mute of the machine end comprises a sixth control instruction for stopping the current broadcast, and the step of selecting the control instruction corresponding to the first state characteristic and the second state characteristic in the control instruction set according to the judgment result comprises the following steps:
and in the case that the machine end does not have the speaking right, selecting the sixth control instruction as the corresponding control instruction.
Optionally, the first state feature represents that a non-silent section of the first voice stream occurs, and includes: the ue changes from mute to non-mute and/or the ue changes from non-mute to mute.
Optionally, the determining, according to the first state feature and the second state feature, whether the machine end has the right to speak after the first time slice to obtain a determination result includes:
determining that the machine side has speaking rights after the first time slice if the first state feature and the second state feature are both conversation onset states;
the instruction that the control machine end broadcasts includes the fourth control command of broadcasting the first turn of contents of asking answering of settlement, according to the judged result, in control command set selects with first state characteristic with the control command that second state characteristic corresponds, include:
and in the case that the machine end has the speaking right, selecting the fourth control instruction as the corresponding control instruction.
Optionally, the determining, according to the first state feature and the second state feature, whether the machine end has the right to speak after the first time slice to obtain a determination result includes:
determining that the machine end has speaking right after the first time slice if the first state characteristic is that the user end is changed from non-silent to silent and the second state characteristic is that the machine end remains silent;
the instruction that control machine end broadcasts includes the second control instruction that begins new broadcasting and/or broadcasts the third control instruction who accepts the content in the sentence of settlement, according to the judged result, in control instruction set selects with first status feature and the control instruction that second status feature corresponds, include:
and in the case that the machine side has the speaking right, selecting the second control instruction or the third control instruction as the corresponding control instruction.
Optionally, the determining, according to the first state feature and the second state feature, whether the machine end has the right to speak after the first time slice to obtain a determination result includes:
determining that the machine end does not have speaking right after the first time slice if the first status characteristic is that the user end is changed from non-silent to silent and the second status characteristic is that the machine end remains silent;
the instruction for controlling the mute of the machine end comprises a seventh control instruction for keeping the mute of the machine end, and the step of selecting the control instruction corresponding to the first state characteristic and the second state characteristic in the control instruction set according to the judgment result comprises the following steps:
and in the case that the machine end does not have the speaking right, selecting the seventh control instruction as the corresponding control instruction.
Optionally, the determining, according to the first state feature and the second state feature, whether the machine end has the right to speak after the first time slice to obtain a determination result includes:
determining that the machine end has speaking right after the first time slice if the first state characteristic is that the user end remains silent and the second state characteristic is that the machine end changes from non-silent to silent;
the instruction that control machine end was reported includes the fifth control command who broadcasts the silence suggestion content of setting, according to the judged result, control command set selection with first status feature and the control command that second status feature corresponds include:
and in the case that the machine side has the speaking right, selecting the fifth control instruction as the corresponding control instruction.
Optionally, the determining, according to the first state feature and the second state feature, whether the machine end has the right to speak after the first time slice to obtain a determination result includes:
determining that the machine end does not have speaking right after the first time slice if the first state characteristic is that the user end remains silent and the second state characteristic is that the machine end transitions from non-silent to silent;
the instruction for controlling the mute of the machine end comprises a seventh control instruction for keeping the mute of the machine end, and the step of selecting the control instruction corresponding to the first state characteristic and the second state characteristic in the control instruction set according to the judgment result comprises the following steps:
and in the case that the machine end does not have the speaking right, selecting the seventh control instruction as the corresponding control instruction.
Optionally, the controlling the machine end to perform the man-machine voice conversation according to the control instruction includes:
and sending the control instruction to the machine end so that the machine end carries out the man-machine voice conversation according to the control instruction.
Optionally, the performing, by the machine end, the man-machine voice conversation according to the corresponding control instruction includes:
the machine end obtains response information corresponding to the corresponding control instruction according to prestored mapping data, wherein the mapping data reflects the corresponding relation between each control instruction in the control instruction set and each set response information;
and carrying out man-machine voice conversation according to the obtained response information.
In a second aspect of the present disclosure, there is also provided a control device for a man-machine voice conversation, including:
the voice stream receiving module is used for receiving a first voice stream of a man-machine voice conversation carried out by a user side;
and the voice stream monitoring module is used for monitoring the machine section and the second voice stream for reversing the man-machine voice conversation.
A state obtaining module, configured to obtain a first state feature of the first voice stream in a first time slice and a second state feature of the second voice stream in the first time slice;
the decision module is used for selecting a control instruction corresponding to the first state characteristic and the second state characteristic in a set control instruction set; the control instruction set comprises an instruction for controlling the machine end to broadcast and an instruction for controlling the machine end to mute; and the number of the first and second groups,
and the execution module is used for controlling the machine end to carry out the man-machine voice conversation according to the control instruction after the first time slice.
According to a third aspect of the present disclosure, there is also provided an electronic device comprising the apparatus according to the second aspect of the present disclosure; alternatively, it comprises:
a memory for storing executable instructions;
a processor for operating the electronic device to perform the method according to the first aspect of the disclosure, according to the control of the executable instructions.
According to a fourth aspect of the present disclosure, there is also provided a computer-readable storage medium storing a computer program readable and executable by a computer, the computer program being adapted to perform the method according to the first aspect of the present disclosure when read and executed by the computer.
According to the embodiment of the disclosure, in the process of man-machine voice conversation, the electronic device can control the machine end to respond without waiting for the user to send a round of voice by receiving the first voice stream of man-machine voice conversation carried out by the user end and monitoring the second voice stream of man-machine voice conversation carried out by the machine end, but rather, in the process of receiving the first voice stream, by obtaining a first state characteristic of the first voice stream in a first time slice and a second state characteristic of the second voice stream in the first time slice, thereby selecting a control instruction for controlling the machine end to carry out the man-machine voice conversation after the first time slice according to the first state characteristic and the second state characteristic, and controlling the machine end to timely and accurately respond to the output voice of the user end after the first time slice according to the control instruction. The method can carry out conversation in a voice duplex mode when carrying out man-machine voice conversation, so that the electronic equipment can timely and accurately control the machine end to respond to the voice stream sent by a user at any time, the response delay is reduced, and the user experience is improved.
Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a schematic diagram of data processing in an existing human-computer voice conversation process provided by an embodiment of the present disclosure.
Fig. 2 is a scene schematic diagram of a control method of a man-machine voice conversation provided by an embodiment of the present disclosure.
Fig. 3 is a hardware configuration structural diagram of a human-machine voice conversation control system that can be used to implement the control method of human-machine voice conversation of the embodiment of the present disclosure.
Fig. 4 is a flowchart illustrating a control method of a man-machine voice conversation according to an embodiment of the present disclosure.
Fig. 5 is a schematic diagram of acquiring a non-silent segment according to an embodiment of the present disclosure.
Fig. 6 is a schematic diagram of an architecture for controlling a human-machine voice conversation according to an embodiment of the present disclosure.
Fig. 7 is a schematic block diagram of a control device for human-machine voice dialog according to an embodiment of the present disclosure.
Fig. 8a is a schematic functional block diagram of an electronic device according to one embodiment of the present disclosure.
Fig. 8b is a schematic functional block diagram of an electronic device according to another embodiment of the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
In the process of implementing the present application, the inventors found that, at present, when controlling a human-computer voice conversation, the implementation principle is generally implemented by adding an automatic speech recognition technology (ASR) and a Text-to-speech synthesis (TTS) technology on the basis of a Turn-based conversation interaction (TBC) mode of an existing Text conversation. As shown in fig. 1, the existing method generally includes that an electronic device recognizes and obtains a text of a round of voice uttered by a user through a built-in automatic voice recognition technology; obtaining semantic information of the text through Natural Language Understanding (NLU); obtaining response semantic information of the semantic information through a conversation control module; obtaining a response text corresponding to the response semantic information through a Natural Language Generation (NLG) technology; then, the response voice is synthesized and controlled to be output for the response text through the text voice synthesis technology.
As described in the background art, the existing method for controlling a human-computer voice conversation does not deal with the characteristics of continuity and exclusivity of the voice conversation, so that a machine end may not perform conversation feedback timely and accurately in the human-computer conversation process, so that the response delay is high, and the user experience is poor.
In view of the above problems, an embodiment of the present disclosure provides a method for controlling a human-machine voice conversation capable of implementing voice duplex, that is, during the human-machine voice conversation, two or more parties of the voice conversation are not blocked in the conversation process, and the conversation can be performed at any time, please refer to fig. 2, which is a scene schematic diagram of the method for controlling the human-machine voice conversation provided by the embodiment of the present disclosure. In practice, the method provided by this embodiment may be applied to a voice conversation robot of an enterprise, for example, a hotline intelligent customer service, where the voice conversation robot may be an electronic device that provides services such as pre-sale, post-sale, etc. of a product to a user by voice, for example, may be the server 1100 shown in fig. 2. In specific implementation, the user establishes a connection with the server 1100 through the terminal device 1200, and sends a first voice stream to the server 1100, for example, a voice stream that may be used for consulting a product use problem, and the server 1100 may receive the first voice stream and monitor and obtain a second voice stream for performing the man-machine voice conversation; then, a control instruction for controlling the man-machine voice conversation after the first time slice is selected according to the first state characteristic and the second state characteristic by respectively obtaining the first state characteristic of the first voice stream in the first time slice and the second state characteristic of the second voice stream in the first time slice; and then after the first time slice, response information for carrying out man-machine voice conversation is obtained according to the control instruction, and the second voice stream is continuously output according to the response information so as to timely and accurately respond to the voice content in the first voice stream sent by the user.
For example, when the user is describing a long problem and the server 1100 is continuously kept silent, in a certain time slice, the server 1100 detects that a silent section appears in a voice section of a first voice stream, that is, in the time slice, a first state characteristic of the first voice stream indicates that the user terminal is turned from non-silent to silent state, and a second state characteristic of a second voice stream indicates that the machine terminal is continuously kept silent, the control instruction after the time slice can be selected as a control instruction for controlling the content in the report set sentence, so as to control the content in the report sentence according to the control instruction, that is, after the time slice, the server 1100 can send out the speech in the sentence, for example, "yes", "right", "please continue", etc., so as to represent that the current conversation connection is normal; or, in the time slice, when the server 1100 can understand the user problem according to the voice sent by the user before the time slice, it may be determined that the control instruction after the time slice is a control instruction for starting a new broadcast, so as to interrupt the current voice of the user after the time slice according to the control instruction, and control the broadcast of the current voice of the user to the user problem, thereby enabling the user problem to be responded in time, reducing response delay, and improving user experience.
It should be noted that, the above is a scenario in which the method may be implemented, and in particular, the method may also be applied to other scenarios, for example, in the field of internet of things (IOT), and the method may be used in an intelligent voice interaction device, so that the intelligent voice interaction device may interact with a user timely and accurately. For example, for a smart speaker, different from a round-based dialogue interaction mode, the smart speaker may receive a first voice stream sent by a user after receiving a user wake-up word, and monitor a second voice stream sent by the smart speaker, and in the process, according to a first state feature of the first voice stream in a certain time slice and a second state feature of the second voice stream in the time slice, select a corresponding control instruction to interact with the user according to the control instruction. For example, when the user inquires about 'what is the weather today' from the smart sound box, when the user sends out the voice of 'the weather today', the smart sound box can interrupt the user and play the current weather information to the user, so that the response delay is reduced, and the user experience is improved; of course, if the user consults weather information of not today but "weather of today is not going to travel", the user can continuously send out voice of "whether to go to travel" in the process that the intelligent sound box plays the current weather information based on the voice of "weather of today", and the intelligent sound box can interrupt the current weather information being played according to the real-time voice and output response voice similar to "weather of today is suitable for going to indoor scenic spots and recommending to xxx museums according to the distance.
Fig. 3 is a hardware configuration structural diagram of a control system of a man-machine voice conversation that can be used to implement the control method of a man-machine voice conversation according to the embodiment of the present disclosure.
As shown in fig. 3, the control system 1000 of a man-machine voice conversation of the present embodiment includes a server 1100, a terminal apparatus 1200, and a communication network 1300.
The server 1100 may be, for example, a blade server, a rack server, or the like, and the server 1100 may also be a server cluster deployed in a cloud, which is not limited herein.
As shown in FIG. 3, server 1100 may include a processor 1110, a memory 1120, an interface device 1130, a communication device 1140, a display device 1150, and an input device 1160. The processor 1110 may be, for example, a central processing unit CPU or the like. The memory 1120 includes, for example, a ROM (read only memory), a RAM (random access memory), a nonvolatile memory such as a hard disk, and the like. The interface device 1130 includes, for example, a USB interface, a serial interface, and the like. The communication device 1140 is capable of wired or wireless communication, for example. The display device 1150 is, for example, a liquid crystal display panel. Input devices 1160 may include, for example, a touch screen, a keyboard, and the like.
In this embodiment, the server 1100 may be used to participate in implementing a method according to any embodiment of the present disclosure.
In any embodiment of the present disclosure, the memory 1120 of the server 1100 is configured to store instructions for controlling the processor 1110 to operate so as to support implementing a method according to any embodiment of the present disclosure. The skilled person can design the instructions according to the disclosed solution of the present disclosure. How the instructions control the operation of the processor is well known in the art and will not be described in detail herein.
Those skilled in the art will appreciate that although a number of devices are shown for the server 1100 in fig. 3, the server 1100 of embodiments of the present disclosure may refer to only some of the devices therein, for example, only the processor 1110 and the memory 1120.
As shown in fig. 3, the terminal apparatus 1200 may include a processor 1210, a memory 1220, an interface device 1230, a communication device 1240, a display device 1250, an input device 1260, an audio output device 1270, an audio input device 1280, and the like. The processor 1210 may be a central processing unit CPU, a microprocessor MCU, or the like. The memory 1220 includes, for example, a ROM (read only memory), a RAM (random access memory), a nonvolatile memory such as a hard disk, and the like. The interface device 1230 includes, for example, a USB interface, a headphone interface, and the like. The communication device 1240 can perform wired or wireless communication, for example. The display device 1250 is, for example, a liquid crystal display, a touch display, or the like. The input device 1260 may include, for example, a touch screen, a keyboard, and the like. The terminal apparatus 1200 may output the audio information through the audio output device 1270, the audio output device 1270 including a speaker, for example. The terminal apparatus 1200 may pick up voice information input by the user through the audio pickup device 1280, and the audio pickup device 1280 includes, for example, a microphone.
The terminal device 1200 may be a smartphone, laptop, desktop computer, tablet, wearable device, or the like.
It should be understood by those skilled in the art that although a plurality of means of the terminal device 1200 are shown in fig. 3, the terminal device 1200 of the embodiment of the present disclosure may refer to only some of the means therein, for example, only the processor 1210, the memory 1220 and the like.
The communication network 1300 may be a wireless network, a wired network, a local area network, or a wide area network. The terminal apparatus 1200 can communicate with the server 1100 through the communication network 1300.
The control system 1000 of a human-machine voice conversation shown in fig. 3 is merely illustrative and is in no way intended to limit the disclosure, its application, or uses. For example, although fig. 3 shows only one server 1100 and one terminal device 1200, it is not meant to limit the respective numbers, and multiple servers 1100 and/or multiple terminal devices 1200 may be included in the system 1000.
It should be noted that the method provided in any embodiment of the present disclosure may be used in the server 1100, and of course, when the method is implemented specifically, the method may also be applied to the terminal device 1200 according to needs, and is not limited specifically here.
Fig. 4 is a flowchart illustrating a control method for a man-machine voice conversation according to an embodiment of the present disclosure. The method provided by the embodiment can be applied to an electronic device, for example, the server 1100 shown in fig. 3. In addition, in this embodiment, if there is no special description, a case where a man-machine voice conversation is taken as an example of a scenario where a user interacts with a server through a terminal device is described, that is, the user sends a first voice stream to the server through the terminal device, the server generates response information in real time according to the first voice stream, and sends a second voice stream to the terminal device according to the response information.
As shown in fig. 4, the method for controlling a man-machine voice dialog of the present embodiment may include the following steps S4100-S4400, which are described in detail below.
Step S4100, receiving a first voice stream of the user terminal performing the man-machine voice conversation and a second voice stream of the monitoring machine terminal performing the man-machine voice conversation.
In this embodiment, the first voice stream is a data stream formed by voice uttered by the user during the process of performing a man-machine voice conversation at the user terminal, and the first voice stream may be generated by an audio pickup device, for example, a microphone, of the user terminal equipment collecting the voice uttered by the user.
The second voice stream is a data stream formed by voice sent by the machine end in the process of performing man-machine voice conversation at the machine end, namely, the data stream formed by the machine end responding to the voice content in the first voice stream; in a specific implementation, the server may recognize the speech in the first speech stream to obtain a corresponding response semantic, obtain a corresponding response text through a natural language generation technique, obtain a corresponding response speech through a text-to-speech synthesis technique, and output the response speech to form the speech in the speech stream.
In step S4200, a first state feature of the first voice stream in a first time slice and a second state feature of the second voice stream in the first time slice are obtained.
In practice, when a man-machine voice conversation is started or carried out, short silence often exists between words, sentences or sections during voice uttering of a user, for example, when the user is consulting a product problem for a hot-line intelligent robot, after the user utters a voice corresponding to "hello", i want … ", the user will generally continue to utter the voice after a period of silence, for example, about 300 ms.
Therefore, in order to enable the machine to timely and accurately respond to the real-time voice content sent by the user end, the electronic device may trigger the machine to respond by detecting a specific trigger event, in this embodiment, the trigger event may be, for example: at least one of an event of starting a man-machine voice conversation, an event of a first voice stream with a non-silent segment, an event of a first voice stream with a silent segment, an event of a second voice stream with a non-silent segment, an event of a second voice stream with a silent segment and a set trigger time.
The silence segment may be a voice segment whose silence duration is within a preset duration, where the preset duration may be, for example, 200ms to 400ms, and of course, the preset duration may also be set as needed, and is not particularly limited here; correspondingly, the non-silent segment may be a speech segment containing speech content.
Specifically, the obtaining a first state characteristic of the first voice stream at a first time slice and a second state characteristic of the second voice stream at the first time slice includes: detecting an occurrence of a triggering event; according to the detected trigger event, acquiring a first state characteristic of the first voice stream in a first time slice before the trigger event is detected and a second state characteristic of the second voice stream in the first time slice.
In this embodiment, the time slice may be a slice in which the occurrence time of two adjacent trigger events is the start-stop time. For example, in a voice stream, event 1 is a non-silent segment, event 2 is a silent segment, and event 1 and event 2 are chronologically connected, when event 2 is detected, a time slice may be obtained with the start time of event 1 and the start time of event 2 as start-stop times, respectively, and the time slice may be regarded as a time slice before event 2. Of course, in the specific implementation, the time slices may be divided as needed, and are not particularly limited herein.
The first state characteristic may be a characteristic representing a state change of the first voice stream at the user terminal within a time slice, for example, the first voice stream is characterized by being changed from mute to non-mute, non-mute to mute, continuously muted, and continuously non-mute; correspondingly, the second state feature may be a feature indicating a state change of the second voice stream in the corresponding time slice.
In one embodiment, when the triggering event includes a time when a non-silent section of the first voice stream occurs, the step of detecting the event that the non-silent section of the first voice stream occurs includes: splitting the first voice stream to obtain a first mute section and a second mute section which are adjacent, wherein the first mute section is earlier than the second mute section; under the condition that the time sequences of the first mute section and the second mute section are not connected, extracting a voice section between the first mute section and the second mute section as a non-mute section, and judging the event that the non-mute section occurs in the first voice stream.
That is, in specific implementation, an event that a non-silent section occurs in a voice stream can be determined by identifying a silent section in the voice stream sent by a user side, taking the silent section as a voice boundary in the voice stream, and taking a voice section between two adjacent silent sections with non-consecutive time sequences as a non-silent section, where the silent section in the voice stream can be detected by a Voice Activity Detection (VAD) technology, and a detailed processing procedure thereof is not described herein again.
Please refer to fig. 5, which is a schematic diagram of acquiring non-silence segments according to an embodiment of the disclosure. In particular implementation, as shown in fig. 5, a speech segment between two adjacent speech boundaries with non-consecutive time sequence can be regarded as a non-silence segment by recognizing a silence segment in a speech stream as a speech boundary, i.e., the non-silence segment is regarded as a minimum processing unit in a man-machine speech conversation, i.e., micro-turn.
To sum up, after the occurrence of the trigger event is detected and the first state feature of the first time slice of the first voice stream before the trigger event and the second state feature of the second voice stream in the first time slice are respectively obtained, the control instruction for controlling the machine end to respond after the time slice is determined according to the real-time conversation state of the current man-machine voice conversation represented by the first state feature and the second state feature, which is described in detail below.
Step S4300, selecting a corresponding control instruction from a set control instruction set according to the first state characteristic and the second state characteristic; the control instruction set comprises an instruction for controlling the machine end to broadcast and an instruction for controlling the machine end to mute.
In the process of the man-machine voice conversation, under the condition that a trigger event is detected, the machine end can make different responses aiming at different state changes of voice streams in the man-machine voice conversation, for example, when a user end sends out voice for a long time, in order to represent normal conversation connection, the machine end can timely broadcast words in sentences like 'kaihe', 'right' and the like, or can also keep silent; or, the current speech of the user side can be interrupted and the response voice can be directly broadcasted according to the situation; or, in the process of broadcasting the voice at the machine end, the user end may rush the call, that is, the user end may interrupt the current broadcasting of the machine end, and in this case, the machine end may select to keep continuing to broadcast the voice, may also select to keep silent according to the situation, or may select to re-broadcast a new response voice according to the voice sent when the user interrupts, and the like.
Therefore, in this embodiment, in order to facilitate the electronic device to make an accurate control command according to the state changes of the voice streams respectively sent by the user side and the machine side in the man-machine voice conversation, in this embodiment, the control command in the set control command set may be at least one of the following table 1:
table 1:
Figure BDA0002958632810000151
in this embodiment, the instruction for controlling the machine end to broadcast may include at least one of a first control instruction for continuing the current broadcast, a second control instruction for starting a new broadcast, a third control instruction for broadcasting a content in a set sentence, a fourth control instruction for broadcasting a set first turn question and answer content, and a fifth control instruction for broadcasting a set mute prompt content; and/or the instruction for controlling the mute of the machine end comprises at least one of a sixth control instruction for stopping the current broadcast and a seventh control instruction for keeping the mute of the machine end; of course, in specific implementation, other control instructions may be set according to needs to control the machine end to perform the man-machine voice conversation, which is not described herein again.
In one embodiment, the selecting the control instruction corresponding to the first status feature and the second status feature in the set of control instructions comprises: judging whether the machine end has speaking right after the first time slicing according to the first state characteristic and the second state characteristic to obtain a judgment result; and selecting a control instruction corresponding to the first state characteristic and the second state characteristic in the control instruction set according to the judgment result.
In implementation, according to the voice change condition of the first voice stream and the second voice stream in the first time slice represented by the first state feature and the second state feature respectively, it may be determined whether the machine side has the speaking right after the first time slice, that is, whether a response voice needs to be sent out; if the speaking right exists, a new broadcast can be made for the user voice in response to whether the voice keeps the current broadcast or interrupts the current broadcast, and the like, and if the speaking right does not exist, the machine end can be selected to be controlled to interrupt the current broadcast and keep the mute, or continue to keep the mute, and the like according to the state represented by the second state characteristic.
That is, in this embodiment, the selecting a control command corresponding to the first status feature and the second status feature in a set control command set includes: judging whether the machine end has speaking right after the first time slicing according to the first state characteristic and the second state characteristic to obtain a judgment result; according to the determination result, the control instruction corresponding to the first status feature and the second status feature is selected from the control instruction set, and the details are described below for different situations.
For convenience of description, please refer to table 2, which is a schematic table of determining whether the machine end has the speaking right after the first time slice and selecting the control command according to the first status feature and the second status feature, and the following describes each embodiment with reference to table 2, wherein in table 2, the first status feature is denoted by S _1, and the second status feature is denoted by S _ 2.
Table 2:
Figure BDA0002958632810000161
Figure BDA0002958632810000171
as shown in table 2, in an embodiment, the determining, according to the first state characteristic and the second state characteristic, whether the machine end has the right to speak after the first time slice, and obtaining a determination result, includes: determining that the machine end has the right to speak after the first time slice if the second state characteristic is that the machine end remains unmuted or the machine end changes from silence to unmuted; the instruction that control machine end was reported includes the first control command that continues the current report, according to the judged result, in the control command set select with first status feature and the control command that second status feature corresponds include: and under the condition that the machine end has the speaking right, selecting the first control instruction as the corresponding control instruction.
That is, in the case where the second state characteristic of the second voice stream emitted at the machine end at the first time slice indicates that the machine end is continuously emitting the response voice before the current time, or the machine end has started responding to the user voice, then at the next time, it may be determined that the machine end has the speaking right, that is, the machine end may be controlled to continuously emit the current response voice.
Continuing to refer to table 2, in an embodiment, the determining, according to the first state feature and the second state feature, whether the machine side has the right to speak after the first time slice to obtain a determination result includes: determining that the machine end does not have speaking right after the first time slice if the first state characteristic indicates that a non-silent section appears in the first voice stream and the second state characteristic is that the machine end remains non-silent or the machine end changes from silent to non-silent; the instruction for controlling the mute of the machine end comprises a sixth control instruction for stopping the current broadcast, and according to the judgment result, the control instruction corresponding to the first state characteristic and the second state characteristic is selected in the control instruction set, which comprises: in a case that the machine side does not have the speaking right, selecting the sixth control instruction as the corresponding control instruction, wherein the first state feature represents that the first voice stream has a non-silent section, and comprises: the ue changes from mute to non-mute and/or the ue changes from non-mute to mute.
That is, when both the machine side and the user side speak, since there is a possibility that the current speaking content of the machine side is incorrect and the user side is describing the problem again, when such a situation exists, it can be determined that the machine side does not have the speaking right after the time slice, that is, by issuing a sixth control instruction to stop the current broadcasting to the machine side, the machine side is controlled to stop speaking after the time slice, and the silence is kept, so as to re-understand the speaking content of the user.
Continuing to refer to table 2, in an embodiment, the determining, according to the first state feature and the second state feature, whether the machine side has the right to speak after the first time slice to obtain a determination result includes: determining that the machine side has speaking rights after the first time slice if the first state feature and the second state feature are both conversation onset states; the instruction that the control machine end broadcasts includes the fourth control command of broadcasting the first turn of contents of asking answering of settlement, according to the judged result, in control command set selects with first state characteristic with the control command that second state characteristic corresponds, include: and in the case that the machine end has the speaking right, selecting the fourth control instruction as the corresponding control instruction.
That is, at the beginning of the man-machine voice conversation, to improve the user experience, the machine side may be controlled to speak first to determine what question the user wants to consult, for example, the user may first inquire whether to consult the product usage or whether to return goods, etc. according to the user's latest order.
Continuing to refer to table 2, in an embodiment, the determining, according to the first status feature and the second status feature, whether the machine end has the right to speak after the first time slice, and obtaining a determination result includes: determining that the machine end has speaking right after the first time slice if the first state characteristic is that the user end is changed from non-silent to silent and the second state characteristic is that the machine end remains silent; the instruction that control machine end broadcasts includes the second control instruction that begins new broadcasting and/or broadcasts the third control instruction who accepts the content in the sentence of settlement, according to the judged result, in control instruction set selects with first status feature and the control instruction that second status feature corresponds, include: and in the case that the machine side has the speaking right, selecting the second control instruction or the third control instruction as the corresponding control instruction.
That is, in the case that the first status feature indicates that the user terminal stops speaking, it may be that the user has described the problem and waits for the machine terminal to respond; or, the problem is long, and the user takes a short break; in order to represent that the connection state of the current man-machine voice conversation is normal, the machine side can be judged to have the speaking right after the time slicing, so that the machine side can directly start responding to the speaking content of the user or send out a sentence holding word to indicate that the machine side continuously listens.
Continuing to refer to table 2, in an embodiment, the determining, according to the first state feature and the second state feature, whether the machine side has the right to speak after the first time slice to obtain a determination result includes: determining that the machine end does not have speaking right after the first time slice if the first state characteristic is that the user end is changed from non-silent to silent and the second state characteristic is that the machine end remains silent; the instruction for controlling the mute of the machine end comprises a seventh control instruction for keeping the mute of the machine end, and the step of selecting the control instruction corresponding to the first state characteristic and the second state characteristic in the control instruction set according to the judgment result comprises the following steps: and in the case that the machine end does not have the speaking right, selecting the seventh control instruction as the corresponding control instruction.
That is, when the user end stops speaking and changes to mute, since the user may not describe the problem, the electronic device may determine whether the user problem is described completely according to the context of the user end, and if the user problem is not described completely, the machine end needs to mute so that the user can continue to describe the problem.
Continuing to refer to table 2, in an embodiment, the determining, according to the first state feature and the second state feature, whether the machine side has the right to speak after the first time slice to obtain a determination result includes: determining that the machine end has speaking right after the first time slice if the first state characteristic is that the user end remains silent and the second state characteristic is that the machine end changes from non-silent to silent; the instruction that control machine end was reported includes the fifth control command who broadcasts the silence suggestion content of setting, according to the judged result, control command set selection with first status feature and the control command that second status feature corresponds include: and in the case that the machine end has the speaking right, selecting the fifth control instruction as the corresponding control instruction.
That is, under the condition that the user side keeps silent, and the machine side plays the response voice for the speech of the user and also switches to the silent state, the machine side can be judged to prepare to release the speech right and wait for the user side to take over the speech right again to consult a new question, at this time, the electronic equipment can judge that the machine side has the speech right and issue a control instruction for playing the set silent prompt content to the machine side, wherein the set silent prompt content can be, for example, "ask another question", and no special limitation is made here.
Continuing to refer to table 2, in an embodiment, the determining, according to the first status feature and the second status feature, whether the machine end has the right to speak after the first time slice, and obtaining a determination result includes: determining that the machine end does not have speaking right after the first time slice if the first state characteristic is that the user end remains silent and the second state characteristic is that the machine end transitions from non-silent to silent; the instruction for controlling the mute of the machine end comprises a seventh control instruction for keeping the mute of the machine end, and the step of selecting the control instruction corresponding to the first state characteristic and the second state characteristic in the control instruction set according to the judgment result comprises the following steps: and in the case that the machine end does not have the speaking right, selecting the seventh control instruction as the corresponding control instruction.
That is, under the condition that the user side keeps silent, and the machine side plays the response voice for the speech of the user and also switches to a silent state, the machine side can also determine that the machine side releases the speech right, and wait for the user side to re-phase the speech right to consult a new question, and at this time, the electronic equipment can determine that the machine side does not have the speech right any more, and issue a control instruction for keeping silent to the machine side.
In the above, how the electronic device selects the corresponding control command according to the first status feature and the second status feature is described in detail, and after the control command is determined, the machine end can be controlled to continue to perform the man-machine voice conversation according to the control command after the first time slice.
It should be noted that, in a specific implementation, after the first state feature and the second state feature are obtained, the corresponding control instruction may be decided by inputting the first state feature and the second state feature into an instruction decision model obtained by pre-training, wherein, when the decision model is trained, at least one of a total duration of real-time response speech (currentTtsDuration) of the machine end, a played duration of real-time response speech (currentttstaysttarttime), a total duration of real-time of the first speech stream (currentSayTime), a total duration of silence of the user end (silenceTime), a total duration of dialog (session), dialog history context information (microTurnContext), and a Query text (Query) of the user may be obtained as feature information to improve the accuracy of the decision model, and the training obtaining process of the model is not described herein again.
And after the step S4300, executing a step S4400, and after the first time slice, controlling the machine end to perform the man-machine voice conversation according to the control instruction.
In specific implementation, the controlling the machine end to perform the man-machine voice conversation according to the control instruction includes: and sending the control instruction to the machine end so that the machine end carries out the man-machine voice conversation according to the control instruction.
Wherein, the machine end carries out the man-machine voice conversation according to the corresponding control instruction, including: the machine end obtains response information corresponding to the corresponding control instruction according to prestored mapping data, wherein the mapping data reflects the corresponding relation between each control instruction in the control instruction set and each set response information; and carrying out man-machine voice conversation according to the obtained response information.
Please refer to table 3, which is a schematic table of the correspondence between the control commands and different response information:
table 3:
Figure BDA0002958632810000211
please refer to fig. 6, which is a schematic diagram illustrating an architecture of a man-machine voice dialog according to an embodiment of the present disclosure. As shown in fig. 6, for a first voice stream sent by a user side and a second voice stream obtained by a monitoring machine side, whether a specific trigger event occurs is detected through trigger event detection processing; if the occurrence of a trigger event is detected, respectively acquiring a first state characteristic of a first voice stream and a second state characteristic of a second voice stream of a first time slice before the trigger event; and then obtaining a corresponding control instruction through an instruction decision model, determining whether the control machine end stops the current broadcast or not according to the instruction type of the control instruction, or obtaining a response text through a question-answer processing module, and then performing text voice synthesis processing on the response text to control the machine end to continuously send response voice aiming at the real-time voice in the first voice stream.
In summary, in the control method for human-machine voice conversation provided by the embodiment, during the process of human-machine voice conversation, the electronic equipment can receive the first voice stream of the man-machine voice conversation carried out by the user side and the second voice stream of the man-machine voice conversation carried out by the monitoring machine side, so that the user does not need to wait for sending a round of voice and then control the machine side to respond, but rather, in the process of receiving the first voice stream, by obtaining a first state characteristic of the first voice stream in a first time slice and a second state characteristic of the second voice stream in the first time slice, thereby selecting a control instruction for controlling the machine end to carry out the man-machine voice conversation after the first time slice according to the first state characteristic and the second state characteristic, and controlling the machine end to timely and accurately respond to the output voice of the user end after the first time slice according to the control instruction. The method can carry out the conversation in the voice duplex mode when carrying out the man-machine voice conversation, thereby leading the machine end to timely and accurately respond the voice stream sent out by the user end at any time, reducing the response delay and improving the user experience.
Corresponding to the above embodiments, this embodiment further provides a control device for man-machine voice conversation, as shown in fig. 7, which is a schematic block diagram of the control device for man-machine voice conversation provided by the embodiment of the present disclosure.
As shown in fig. 7, the control device 7000 for man-machine voice conversation according to this embodiment includes a voice stream receiving module 7100, a voice stream monitoring module 7200, a decision module 7300, and an execution module 7400.
The voice stream receiving module 7100 is configured to receive a first voice stream of a human-computer voice conversation performed by a user side.
The voice stream monitoring module 7200 is configured to monitor a second voice stream of the man-machine voice conversation performed by the machine end.
In one embodiment, the voice stream monitoring module 7200, in obtaining the first state characteristic of the first voice stream in a first time slice and the second state characteristic of the second voice stream in the first time slice, may be configured to: detecting an occurrence of a trigger event; according to the detected trigger event, acquiring a first state characteristic of the first voice stream in a first time slice before the trigger event is detected and a second state characteristic of the second voice stream in the first time slice.
In one embodiment, the triggering event includes an event that the first voice stream has a non-silent section, and the voice stream monitoring module 7200, when detecting the event that the first voice stream has the non-silent section, may be configured to: splitting the first voice stream to obtain a first mute section and a second mute section which are adjacent, wherein the first mute section is earlier than the second mute section; under the condition that the time sequences of the first mute section and the second mute section are not connected, extracting a voice section between the first mute section and the second mute section as a non-mute section, and judging the event that the non-mute section occurs in the first voice stream.
The state obtaining module 7300 is configured to select, in a set of control commands, a control command corresponding to the first state feature and the second state feature; the control instruction set comprises an instruction for controlling the machine end to broadcast and an instruction for controlling the machine end to mute.
In an embodiment, the status obtaining module 7300 may be configured to, when selecting the control instruction corresponding to the first status feature and the second status feature in the set of control instructions, include: judging whether the machine end has speaking right after the first time slicing according to the first state characteristic and the second state characteristic to obtain a judgment result; and selecting a control instruction corresponding to the first state characteristic and the second state characteristic in the control instruction set according to the judgment result.
In an embodiment, the instruction broadcasted by the control machine end includes a first control instruction for continuing the current broadcast, and the state obtaining module 7300, when determining, according to the first state feature and the second state feature, whether the machine end has the right to speak after the first time slice, and obtaining a determination result, may be configured to: determining that the machine end has the right to speak after the first time slice if the second state characteristic is that the machine end remains unmuted or the machine end changes from silence to unmuted; when the state obtaining module 7300 selects the control instruction corresponding to the first state feature and the second state feature in the control instruction set, it may be configured to: and under the condition that the machine end has the speaking right, selecting the first control instruction as the corresponding control instruction.
In an embodiment, the instruction for controlling the machine end to mute includes a sixth control instruction for stopping broadcasting currently, and the status acquiring module 7300, when determining whether the machine end has the right to speak after the first time slice according to the first status feature and the second status feature, and obtaining a determination result, may be configured to: determining that the machine end does not have speaking right after the first time slice if the first state characteristic indicates that a non-silent section appears in the first voice stream and the second state characteristic is that the machine end remains non-silent or the machine end changes from silent to non-silent; when the state obtaining module 7300 selects the control instruction corresponding to the first state feature and the second state feature in the control instruction set, it may be configured to: and in the case that the machine end does not have the speaking right, selecting the sixth control instruction as the corresponding control instruction.
In an embodiment, the instruction broadcasted by the control machine end includes a fourth control instruction for broadcasting a set first round of question and answer content, and the state obtaining module 7300, when determining, according to the first state feature and the second state feature, whether the machine end has a speaking right after the first time slice, and obtaining a determination result, may be configured to: determining that the machine side has speaking rights after the first time slice if the first state feature and the second state feature are both conversation onset states; the state obtaining module 7300, when selecting the control command corresponding to the first state feature and the second state feature in the control command set according to the determination result, may be configured to: and in the case that the machine end has the speaking right, selecting the fourth control instruction as the corresponding control instruction.
In an embodiment, the instruction broadcasted by the control machine end includes a second control instruction for starting a new broadcast and/or a third control instruction for receiving a content in a sentence set for broadcast, and the state obtaining module 7300 determines whether the machine end has the right to speak after the first time slicing according to the first state feature and the second state feature, and when a determination result is obtained, may be configured to: determining that the machine end has speaking right after the first time slice if the first state characteristic is that the user end is changed from non-silent to silent and the second state characteristic is that the machine end remains silent; the state obtaining module 7300, when selecting the control command corresponding to the first state feature and the second state feature in the control command set according to the determination result, may be configured to: and in the case that the machine side has the speaking right, selecting the second control instruction or the third control instruction as the corresponding control instruction.
In an embodiment, the instruction for controlling the machine end to mute may include a seventh control instruction for the machine end to keep mute, and when determining whether the machine end has the right to speak after the first time slice according to the first status feature and the second status feature, the status acquiring module 7300 may be configured to: determining that the machine end does not have speaking right after the first time slice if the first state characteristic is that the user end is changed from non-silent to silent and the second state characteristic is that the machine end remains silent; the status obtaining module 7300 may be configured to, when selecting the control instruction corresponding to the first status feature and the second status feature in the control instruction set according to the determination result, perform: and in the case that the machine end does not have the speaking right, selecting the seventh control instruction as the corresponding control instruction.
In an embodiment, the instruction broadcasted by the control machine end includes a fifth control instruction for broadcasting set mute prompt content, and the status obtaining module 7300, when determining, according to the first status feature and the second status feature, whether the machine end has a speaking right after the first time slicing, and obtaining a determination result, may be configured to: determining that the machine end has speaking right after the first time slice if the first state characteristic is that the user end remains silent and the second state characteristic is that the machine end changes from non-silent to silent; the state obtaining module 7300, when selecting the control command corresponding to the first state feature and the second state feature in the control command set according to the determination result, may be configured to: and in the case that the machine side has the speaking right, selecting the fifth control instruction as the corresponding control instruction.
In an embodiment, the instruction for controlling the machine end to mute may include a seventh control instruction for the machine end to keep mute, and when determining whether the machine end has the right to speak after the first time slice according to the first status feature and the second status feature, the status acquiring module 7300 may be configured to: determining that the machine end does not have speaking right after the first time slice if the first status characteristic is that the user end remains silent and the second status characteristic is that the machine end changes from non-silent to silent; the state obtaining module 7300, when selecting the control command corresponding to the first state feature and the second state feature in the control command set according to the determination result, may be configured to: and in the case that the machine side does not have speaking right, selecting the seventh control instruction as the corresponding control instruction.
The execution module 7400 is configured to control the machine end to perform the human-computer voice conversation according to the control instruction after the first time slice.
In an embodiment, when controlling the machine end to perform the man-machine voice conversation according to the control instruction, the executing module 7400 may be configured to: and sending the control instruction to the machine end so that the machine end carries out the man-machine voice conversation according to the control instruction.
In this embodiment, when the execution module 7400 performs the human-computer voice dialog according to the corresponding control command, it may be configured to: the machine end obtains response information corresponding to the corresponding control instruction according to prestored mapping data, wherein the mapping data reflects the corresponding relation between each control instruction in the control instruction set and each set response information; and carrying out man-machine voice conversation according to the obtained response information.
Corresponding to the above embodiments, the present embodiment provides an electronic device, as shown in fig. 8a, the electronic device 100 includes a control apparatus 7000 for man-machine voice conversation according to any embodiment of the present disclosure.
In another embodiment, as shown in FIG. 8b, the electronic device 100 may include a memory 110 and a processor 120, the memory 110 being configured to store executable instructions; the processor 120 is configured to perform a method according to any of the method embodiments of the present disclosure under the control of the executable instructions.
Corresponding to the above embodiments, in this embodiment, a computer-readable storage medium is further provided, where a computer program that can be read and executed by a computer is stored, and when the computer program is read and executed by the computer, the computer program is configured to execute the method according to any of the above embodiments of the present disclosure.
The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the market, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the present disclosure is defined by the appended claims.

Claims (19)

1. A control method of man-machine voice conversation comprises the following steps:
receiving a first voice stream of a user side for carrying out man-machine voice conversation and a second voice stream of a monitoring machine side for carrying out the man-machine voice conversation;
obtaining a first state feature of the first voice stream in a first time slice and a second state feature of the second voice stream in the first time slice;
selecting a corresponding control instruction from a set control instruction set according to the first state characteristic and the second state characteristic; the control instruction set comprises an instruction for controlling the machine end to broadcast and an instruction for controlling the machine end to mute;
and after the first time slice, controlling the machine end to carry out the man-machine voice conversation according to the control instruction.
2. The method according to claim 1, wherein the instructions broadcasted by the control machine end include at least one of a first control instruction for continuing the current broadcast, a second control instruction for starting a new broadcast, a third control instruction for broadcasting a content in a set sentence, a fourth control instruction for broadcasting a set first turn question and answer content, and a fifth control instruction for broadcasting a set mute prompt content; and/or the presence of a gas in the gas,
the instructions for controlling the mute of the machine end comprise at least one of a sixth control instruction for stopping the current broadcast and a seventh control instruction for keeping the mute of the machine end.
3. The method according to claim 1, wherein the obtaining a first state characteristic of the first voice stream in a first time slice and a second state characteristic of the second voice stream in the first time slice comprises:
detecting an occurrence of a triggering event;
according to the detected trigger event, acquiring a first state characteristic of the first voice stream in a first time slice before the trigger event is detected and a second state characteristic of the second voice stream in the first time slice.
4. The method according to claim 3, wherein the trigger event comprises at least one of an event of turning on the man-machine voice conversation, an event of a non-silent segment of the first voice stream, an event of a non-silent segment of the second voice stream, an event of a silent segment of the second voice stream, and a set trigger time.
5. The method according to claim 3, wherein the triggering event includes an event of occurrence of a non-silent segment of the first voice stream, and the step of detecting an event of occurrence of a non-silent segment of the first voice stream includes:
splitting the first voice stream to obtain a first mute section and a second mute section which are adjacent, wherein the first mute section is earlier than the second mute section;
under the condition that the time sequences of the first mute section and the second mute section are not connected, extracting a voice section between the first mute section and the second mute section as a non-mute section, and judging the event that the non-mute section occurs in the first voice stream.
6. The method of claim 1, wherein said selecting a control instruction corresponding to the first and second status characteristics in a set of control instructions comprises:
judging whether the machine end has speaking right after the first time slicing according to the first state characteristic and the second state characteristic to obtain a judgment result;
and selecting the control instruction corresponding to the first state characteristic and the second state characteristic in the control instruction set according to the judgment result.
7. The method of claim 6, wherein said determining whether the machine end has the right to speak after the first time slice according to the first state feature and the second state feature, resulting in a determination, comprises:
determining that the machine end has the right to speak after the first time slice in the case that the second state characteristic is that the machine end remains unmuted or the machine end changes from muted to unmuted;
the instruction that control machine end was reported includes the first control command that continues the current report, according to the judged result, in the control command set select with first status feature and the control command that second status feature corresponds include:
and under the condition that the machine end has the speaking right, selecting the first control instruction as the corresponding control instruction.
8. The method of claim 6, wherein the determining whether the machine side has the right to speak after the first time slice according to the first state feature and the second state feature, resulting in a determination result, comprises:
determining that the machine end does not have speaking right after the first time slice if the first state characteristic indicates that a non-silent segment of the first voice stream occurs and the second state characteristic is that the machine end remains non-silent or the machine end changes from silent to non-silent;
the instruction for controlling the mute of the machine end comprises a sixth control instruction for stopping the current broadcast, and according to the judgment result, the control instruction corresponding to the first state characteristic and the second state characteristic is selected in the control instruction set, which comprises:
and under the condition that the machine side does not have speaking right, selecting the sixth control instruction as the corresponding control instruction.
9. The method of claim 8, wherein the first state feature representing an occurrence of a non-silent segment of the first voice stream comprises: the ue changes from mute to non-mute and/or the ue changes from non-mute to mute.
10. The method of claim 6, wherein the determining whether the machine side has the right to speak after the first time slice according to the first state feature and the second state feature, resulting in a determination result, comprises:
determining that the machine side has speaking rights after the first time slice if the first state feature and the second state feature are both conversation onset states;
the instruction that the control machine end broadcasts includes the fourth control command of broadcasting the first turn of contents of asking answering of settlement, according to the judged result, in control command set selects with first state characteristic with the control command that second state characteristic corresponds, include:
and in the case that the machine end has the speaking right, selecting the fourth control instruction as the corresponding control instruction.
11. The method of claim 6, wherein the determining whether the machine side has the right to speak after the first time slice according to the first state feature and the second state feature, resulting in a determination result, comprises:
determining that the machine end has speaking right after the first time slice if the first state characteristic is that the user end is changed from non-silent to silent and the second state characteristic is that the machine end remains silent;
the instruction that control machine end broadcasts includes the second control instruction that begins new broadcasting and/or broadcasts the third control instruction who accepts the content in the sentence of settlement, according to the judged result, in control instruction set selects with first status feature and the control instruction that second status feature corresponds, include:
and in the case that the machine side has the speaking right, selecting the second control instruction or the third control instruction as the corresponding control instruction.
12. The method of claim 6, wherein the determining whether the machine side has the right to speak after the first time slice according to the first state feature and the second state feature, resulting in a determination result, comprises:
determining that the machine end does not have speaking right after the first time slice if the first state characteristic is that the user end is changed from non-silent to silent and the second state characteristic is that the machine end remains silent;
the instruction for controlling the mute of the machine end comprises a seventh control instruction for keeping the mute of the machine end, and the step of selecting the control instruction corresponding to the first state characteristic and the second state characteristic in the control instruction set according to the judgment result comprises the following steps:
and in the case that the machine end does not have the speaking right, selecting the seventh control instruction as the corresponding control instruction.
13. The method of claim 6, wherein the determining whether the machine side has the right to speak after the first time slice according to the first state feature and the second state feature, resulting in a determination result, comprises:
determining that the machine end has speaking right after the first time slice if the first state characteristic is that the user end remains silent and the second state characteristic is that the machine end changes from non-silent to silent;
the instruction that control machine end was reported includes the fifth control command who broadcasts the silence suggestion content of setting, according to the judged result, control command set selection with first status feature and the control command that second status feature corresponds include:
and in the case that the machine end has the speaking right, selecting the fifth control instruction as the corresponding control instruction.
14. The method of claim 6, wherein the determining whether the machine side has the right to speak after the first time slice according to the first state feature and the second state feature, resulting in a determination result, comprises:
determining that the machine end does not have speaking right after the first time slice if the first state characteristic is that the user end remains silent and the second state characteristic is that the machine end transitions from non-silent to silent;
the instruction for controlling the mute of the machine end comprises a seventh control instruction for keeping the mute of the machine end, and the step of selecting the control instruction corresponding to the first state characteristic and the second state characteristic in the control instruction set according to the judgment result comprises the following steps:
and in the case that the machine end does not have the speaking right, selecting the seventh control instruction as the corresponding control instruction.
15. The method of claim 1, wherein the controlling the machine end to perform the human-machine voice conversation according to the control instruction comprises:
and sending the control instruction to the machine end so that the machine end carries out the man-machine voice conversation according to the control instruction.
16. The method of claim 15, wherein the machine end conducting the human-machine voice conversation according to the corresponding control instruction comprises:
the machine end obtains response information corresponding to the corresponding control instruction according to prestored mapping data, wherein the mapping data reflects the corresponding relation between each control instruction in the control instruction set and each set response information;
and carrying out man-machine voice conversation according to the obtained response information.
17. A control device for a human-machine voice conversation, comprising:
the voice stream receiving module is used for receiving a first voice stream of a man-machine voice conversation carried out by a user side;
the voice stream monitoring module is used for monitoring a second voice stream of the man-machine voice conversation carried out by the machine end;
a state obtaining module, configured to obtain a first state feature of the first voice stream in a first time slice and a second state feature of the second voice stream in the first time slice;
the decision module is used for selecting a control instruction corresponding to the first state characteristic and the second state characteristic in a set control instruction set; the control instruction set comprises an instruction for controlling the machine end to broadcast and an instruction for controlling the machine end to mute; and the number of the first and second groups,
and the execution module is used for controlling the machine end to carry out the man-machine voice conversation according to the control instruction after the first time slice.
18. An electronic device comprising the control apparatus of claim 17; alternatively, it comprises:
a memory for storing executable instructions;
a processor configured to operate the electronic device to perform the method according to any one of claims 1 to 16, according to the control of the executable instructions.
19. A computer-readable storage medium, in which a computer program is stored which is readable and executable by a computer, the computer program being adapted, when being read and executed by the computer, to carry out the method according to any one of claims 1 to 16.
CN202110229744.0A 2021-03-02 2021-03-02 Control method and device for man-machine voice conversation and electronic equipment Pending CN114999470A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110229744.0A CN114999470A (en) 2021-03-02 2021-03-02 Control method and device for man-machine voice conversation and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110229744.0A CN114999470A (en) 2021-03-02 2021-03-02 Control method and device for man-machine voice conversation and electronic equipment

Publications (1)

Publication Number Publication Date
CN114999470A true CN114999470A (en) 2022-09-02

Family

ID=83018040

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110229744.0A Pending CN114999470A (en) 2021-03-02 2021-03-02 Control method and device for man-machine voice conversation and electronic equipment

Country Status (1)

Country Link
CN (1) CN114999470A (en)

Similar Documents

Publication Publication Date Title
US12080280B2 (en) Systems and methods for determining whether to trigger a voice capable device based on speaking cadence
US11810554B2 (en) Audio message extraction
US9983849B2 (en) Voice command-driven database
US11138977B1 (en) Determining device groups
CN107895578B (en) Voice interaction method and device
US11949818B1 (en) Selecting user device during communications session
CN112201246B (en) Intelligent control method and device based on voice, electronic equipment and storage medium
US9293134B1 (en) Source-specific speech interactions
US10140986B2 (en) Speech recognition
US10192550B2 (en) Conversational software agent
TWI535258B (en) Voice answering method and mobile terminal apparatus
US10140988B2 (en) Speech recognition
JP2020109475A (en) Voice interactive method, device, facility, and storage medium
CN116417003A (en) Voice interaction system, method, electronic device and storage medium
CN112420044A (en) Voice recognition method, voice recognition device and electronic equipment
JPWO2019138651A1 (en) Information processing equipment, information processing systems, information processing methods, and programs
CN112562670A (en) Intelligent voice recognition method, intelligent voice recognition device and intelligent equipment
CN110767240B (en) Equipment control method, equipment, storage medium and device for identifying child accent
CN112767916A (en) Voice interaction method, device, equipment, medium and product of intelligent voice equipment
CN112700767B (en) Man-machine conversation interruption method and device
CN113611316A (en) Man-machine interaction method, device, equipment and storage medium
CN118366458A (en) Full duplex dialogue system and method, electronic equipment and storage medium
CN110660393B (en) Voice interaction method, device, equipment and storage medium
CN113096651A (en) Voice signal processing method and device, readable storage medium and electronic equipment
CN111292749A (en) Session control method and device of intelligent voice platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240307

Address after: 51 Belarusian Pasha Road, Singapore, Lai Zan Da Building 1 # 03-06, Postal Code 189554

Applicant after: Alibaba Innovation Co.

Country or region after: Singapore

Address before: Room 01, 45th Floor, AXA Building, 8 Shanton Road, Singapore

Applicant before: Alibaba Singapore Holdings Ltd.

Country or region before: Singapore

TA01 Transfer of patent application right