CN110299152A - Interactive output control method, device, electronic equipment and storage medium - Google Patents
Interactive output control method, device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN110299152A CN110299152A CN201910579362.3A CN201910579362A CN110299152A CN 110299152 A CN110299152 A CN 110299152A CN 201910579362 A CN201910579362 A CN 201910579362A CN 110299152 A CN110299152 A CN 110299152A
- Authority
- CN
- China
- Prior art keywords
- response data
- tag
- audio stream
- data
- execution sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 230000002452 interceptive effect Effects 0.000 title abstract description 6
- 230000004044 response Effects 0.000 claims abstract description 683
- 230000003993 interaction Effects 0.000 claims abstract description 40
- 238000003780 insertion Methods 0.000 claims description 40
- 230000037431 insertion Effects 0.000 claims description 40
- 238000012545 processing Methods 0.000 claims description 36
- 238000001514 detection method Methods 0.000 claims description 25
- 238000012937 correction Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 13
- 239000013589 supplement Substances 0.000 claims description 12
- 230000008569 process Effects 0.000 abstract description 8
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 241000282414 Homo sapiens Species 0.000 description 13
- 238000005516 engineering process Methods 0.000 description 9
- 230000000007 visual effect Effects 0.000 description 9
- 229910052573 porcelain Inorganic materials 0.000 description 7
- 230000014509 gene expression Effects 0.000 description 5
- 238000013518 transcription Methods 0.000 description 5
- 230000035897 transcription Effects 0.000 description 5
- 230000000295 complement effect Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical group [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000035807 sensation Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The present invention relates to field of artificial intelligence information, a kind of interactive output control method, device, electronic equipment and storage medium are disclosed, which comprises speech processes are carried out to the collected audio stream data of smart machine in real time;According to speech processes as a result, determining the response data for being directed to the audio stream data;Determine the corresponding execution sequence tags of the response data;The smart machine is controlled according to the execution sequence tags, executes the response data.Technical solution provided in an embodiment of the present invention, the control mode for executing response data is more flexible, enables smart machine to execute response data in a manner of close to mankind's natural interaction, so that human-computer interaction process is more natural.
Description
Technical Field
The invention relates to information in the technical field of artificial intelligence, in particular to an output control method and device of man-machine conversation, electronic equipment and a storage medium.
Background
With the rapid development of scientific technology, intelligent devices already have strong Processing capability, so that the intelligent devices can understand Natural Language to a certain extent like human beings, for example, based on Real-time speech transcription (ASR), NLP (Natural Language Processing), and other technologies, the speech input by users can be processed in Real time, and response data conforming to the Natural Language of human beings is output, so as to realize human-computer interaction.
Therefore, in the existing interactive mode, the control mode of outputting the response data is single and stylized, so that the intelligent device cannot have a natural and smooth conversation like a human being, and the user experience is reduced.
Disclosure of Invention
Embodiments of the present invention provide a method and an apparatus for controlling output of a human-computer interaction, an electronic device, and a storage medium, so as to solve the problem in the prior art that a control method for outputting response data is relatively single and programmable.
In a first aspect, an embodiment of the present invention provides a method for controlling output of a human-computer conversation, including:
carrying out voice processing on audio stream data acquired by intelligent equipment in real time;
determining response data for the audio stream data according to a voice processing result;
determining an execution sequence tag corresponding to the response data;
and controlling the intelligent equipment to execute the response data according to the execution sequence tag.
Optionally, the determining the execution sequence tag corresponding to the response data specifically includes:
determining a priority of the response data;
and determining an execution sequence tag corresponding to the response data based on the priority of the response data and the priority of the response data determined before the response data.
Optionally, the determining, based on the priority of the response data and the priority of the response data determined before the response data, an execution sequence tag corresponding to the response data specifically includes:
if the audio stream data corresponding to the response data and the determined audio stream data corresponding to the response data belong to the same audio stream data obtained by VAD detection, determining the arrangement position of the response data among the determined response data according to the priority of the response data and the priority of the determined response data and the sequence from high priority to low priority;
and determining an insertion tag corresponding to the response data according to the arrangement position of the response data, and using the insertion tag as an execution sequence tag corresponding to the response data, wherein the insertion tag is used for indicating the execution sequence of the response data identified by the insertion tag among the response data received by the intelligent device.
Optionally, the determining the priority of the response data specifically includes:
determining the priority of the response data based on the semantic recognition result of the audio stream data corresponding to the response data and/or the visual information acquired by the intelligent equipment; or determining the priority of the response data according to the audio stream data corresponding to the response data and the determined time information of the audio stream data corresponding to the response data.
Optionally, the determining the execution sequence tag corresponding to the response data specifically includes:
and if the audio stream data corresponding to the response data and the audio stream data corresponding to the response data determined last time belong to audio stream data obtained by different VAD detections, determining that the execution sequence tag corresponding to the response data is an interruption tag, wherein the interruption tag is used for indicating that the response data identified by the interruption tag can interrupt the response data currently executed by the intelligent device.
Optionally, the determining the execution sequence tag corresponding to the response data specifically includes:
and determining an execution sequence tag corresponding to the response data based on the response data and the response data determined before the response data.
Optionally, the determining, based on the response data and response data determined before the response data, an execution sequence tag corresponding to the response data specifically includes:
if the audio stream data corresponding to the response data and the audio stream data corresponding to the determined response data belong to the same audio stream data obtained by VAD detection, and the response data is the same as at least one response data in the determined response data, determining that the execution sequence tag corresponding to the response data is a skip tag, wherein the skip tag is used for indicating the intelligent device to skip the response data identified by the skip tag when executing the response data.
Optionally, the determining, based on the response data and response data determined before the response data, an execution sequence tag corresponding to the response data specifically includes:
if it is determined that the semantic recognition result of the audio stream data corresponding to the response data is supplement or correction of the semantic recognition result of the audio stream data corresponding to any response data in the determined response data, determining that the execution sequence tag corresponding to the response data is a replacement tag, where the replacement tag is used to instruct the smart device to replace the response data corresponding to the information identifier in the replacement tag with the response data identified by the replacement tag.
Optionally, the method further comprises:
if the slot position item with the slot position value in the semantic identification result of the audio stream data corresponding to the response data is the same as the slot position item with the slot position value missing in the semantic identification result of the audio stream data corresponding to any response data in the determined response data, determining that the semantic identification result of the audio stream data corresponding to the response data is a supplement to the semantic identification result of the audio stream data corresponding to any response data;
or if the semantic recognition result of the audio stream data corresponding to the response data contains a negative intention, and the semantic recognition result of the audio stream data corresponding to the response data is different from the slot value of the same slot item in the semantic recognition result of the audio stream data corresponding to any response data in the determined response data, determining that the semantic recognition result of the audio stream data corresponding to the response data is the correction of the semantic recognition result of the audio stream data corresponding to any response data.
In a second aspect, an embodiment of the present invention provides a method for controlling output of a human-computer conversation, including:
sending the collected audio stream data to a server;
receiving response data obtained based on the audio stream data and an execution sequence tag corresponding to the response data, wherein the response data are sent by the server;
and executing the response data according to the execution sequence tag.
Optionally, the executing the response data according to the execution sequence tag specifically includes:
and if the execution sequence tag is an insertion tag, executing the response data according to the execution sequence of the response data indicated by the insertion tag between the response data already received by the intelligent device.
Optionally, the executing the response data according to the execution sequence tag specifically includes:
if the execution sequence tag is an interruption tag, stopping the response data currently executed by the intelligent equipment, and executing the response data identified by the interruption tag.
Optionally, the executing the response data according to the execution sequence tag specifically includes:
and if the execution sequence tag is a skip tag, skipping the response data identified by the skip tag when executing the response data.
Optionally, the executing the response data according to the execution sequence tag specifically includes:
and if the execution sequence tag is a replacement tag, replacing response data corresponding to the information identifier in the replacement tag in the received response data with the response data identified by the replacement tag.
In a third aspect, an embodiment of the present invention provides an output control apparatus for man-machine interaction, including:
the voice processing module is used for carrying out voice processing on the audio stream data acquired by the intelligent equipment in real time;
a response data determination module for determining response data for the audio stream data according to a voice processing result;
the label determining module is used for determining an execution sequence label corresponding to the response data;
and the control module is used for controlling the intelligent equipment to execute the response data according to the execution sequence tag.
Optionally, the tag determination module is specifically configured to:
determining a priority of the response data;
and determining an execution sequence tag corresponding to the response data based on the priority of the response data and the priority of the response data determined before the response data.
Optionally, the tag determination module is specifically configured to:
if the audio stream data corresponding to the response data and the determined audio stream data corresponding to the response data belong to the same audio stream data obtained by VAD detection, determining the arrangement position of the response data among the determined response data according to the priority of the response data and the priority of the determined response data and the sequence from high priority to low priority;
and determining an insertion tag corresponding to the response data according to the arrangement position of the response data, and using the insertion tag as an execution sequence tag corresponding to the response data, wherein the insertion tag is used for indicating the execution sequence of the response data identified by the insertion tag among the response data received by the intelligent device.
Optionally, the tag determination module is specifically configured to:
determining the priority of the response data based on the semantic recognition result of the audio stream data corresponding to the response data and/or the visual information acquired by the intelligent equipment;
or determining the priority of the response data according to the audio stream data corresponding to the response data and the determined time information of the audio stream data corresponding to the response data.
Optionally, the tag determination module is specifically configured to:
and if the audio stream data corresponding to the response data and the audio stream data corresponding to the response data determined last time belong to audio stream data obtained by different VAD detections, determining that the execution sequence tag corresponding to the response data is an interruption tag, wherein the interruption tag is used for indicating that the response data identified by the interruption tag can interrupt the response data currently executed by the intelligent device.
Optionally, the tag determination module is specifically configured to:
and determining an execution sequence tag corresponding to the response data based on the response data and the response data determined before the response data.
Optionally, the tag determination module is specifically configured to:
if the audio stream data corresponding to the response data and the audio stream data corresponding to the determined response data belong to the same audio stream data obtained by VAD detection, and the response data is the same as at least one response data in the determined response data, determining that the execution sequence tag corresponding to the response data is a skip tag, wherein the skip tag is used for indicating the intelligent device to skip the response data identified by the skip tag when executing the response data.
Optionally, the tag determination module is specifically configured to:
if it is determined that the semantic recognition result of the audio stream data corresponding to the response data is supplement or correction of the semantic recognition result of the audio stream data corresponding to any response data in the determined response data, determining that the execution sequence tag corresponding to the response data is a replacement tag, where the replacement tag is used to instruct the smart device to replace the response data corresponding to the information identifier in the replacement tag with the response data identified by the replacement tag.
Optionally, the tag determination module is further configured to:
if the slot position item with the slot position value in the semantic identification result of the audio stream data corresponding to the response data is the same as the slot position item with the slot position value missing in the semantic identification result of the audio stream data corresponding to any response data in the determined response data, determining that the semantic identification result of the audio stream data corresponding to the response data is a supplement to the semantic identification result of the audio stream data corresponding to any response data; or,
and if the semantic recognition result of the audio stream data corresponding to the response data contains a negative intention, and the semantic recognition result of the audio stream data corresponding to the response data is different from the slot value of the same slot item in the semantic recognition result of the audio stream data corresponding to any response data in the determined response data, determining that the semantic recognition result of the audio stream data corresponding to the response data is the correction of the semantic recognition result of the audio stream data corresponding to any response data.
In a fourth aspect, an embodiment of the present invention provides an output control apparatus for human-computer interaction, including:
the data sending module is used for sending the collected audio stream data to the server;
the data receiving module is used for receiving response data which is sent by the server and obtained based on the audio stream data and an execution sequence tag corresponding to the response data;
and the execution module is used for executing the response data according to the execution sequence label.
Optionally, the execution module is specifically configured to: and if the execution sequence tag is an insertion tag, executing the response data according to the execution sequence of the response data indicated by the insertion tag between the response data already received by the intelligent device.
Optionally, the execution module is specifically configured to: if the execution sequence tag is an interruption tag, stopping the response data currently executed by the intelligent equipment, and executing the response data identified by the interruption tag.
Optionally, the execution module is specifically configured to: and if the execution sequence tag is a skip tag, skipping the response data identified by the skip tag when executing the response data.
Optionally, the execution module is specifically configured to: and if the execution sequence tag is a replacement tag, replacing response data corresponding to the information identifier in the replacement tag in the received response data with the response data identified by the replacement tag.
In a fifth aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of any one of the methods in the first or second aspects when executing the computer program.
In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which computer program instructions are stored, which, when executed by a processor, implement the steps of any one of the methods of the first or second aspects described above.
In a seventh aspect, an embodiment of the present invention provides a computer program product comprising a computer program stored on a computer-readable storage medium, the computer program comprising program instructions that, when executed by a processor, implement the steps of any of the methods of the first or second aspects.
According to the technical scheme provided by the embodiment of the invention, voice processing is carried out on audio stream data acquired by intelligent equipment, response data aiming at the audio stream data is determined according to a voice processing result, and then an execution sequence tag corresponding to the response data is determined, so that the intelligent equipment can be conveniently controlled to execute the response data sent by a server according to the execution sequence indicated by the execution sequence tag, the control mode of executing the response data becomes more flexible, the sequence of executing the response data corresponding to a plurality of sentences contained in the audio stream data can be selectively adjusted according to the audio stream data continuously input by a user, the intelligent equipment can respond to the input of the user in a mode close to human-computer natural interaction, and the human-computer interaction process is more natural.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of an output control method of a human-computer conversation according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating a method for controlling output of a human-machine interaction according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for determining an execution sequence tag based on priority of response data according to an embodiment of the present invention;
fig. 4 is a flowchart illustrating an output control method of a man-machine interaction according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an output control apparatus for man-machine interaction according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an output control apparatus for human-machine interaction according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.
For convenience of understanding, terms referred to in the embodiments of the present invention are explained below:
modality (modeity), colloquially, is "sense", and multimodal is the fusion of multiple senses. The robot operating system defines the interaction mode between the robot and the human as multi-mode interaction, namely man-machine interaction is carried out in various modes such as characters, voice, vision, actions, environment and the like, and the interaction mode between the human and the human is fully simulated.
Domain refers to the same type of data or resources, and services provided around these data or resources, such as leaders, encyclopedias, chats, weather, music, train tickets, etc.
Voice Activity Detection (VAD), also called Voice endpoint Detection, refers to detecting the existence of Voice in a noise environment, and is generally used in Voice processing systems such as Voice coding and Voice enhancement, and plays roles of reducing a Voice coding rate, saving a communication bandwidth, reducing energy consumption of a mobile device, improving a recognition rate, and the like. A representative VAD method of the prior art is ITU-T G.729Annex B. At present, a voice activity detection technology is widely applied to a voice recognition process, and a part of a segment of audio that really contains user voice is detected through the voice activity detection technology, so that a mute part of the audio is eliminated, and only the part of the audio that contains the user voice is recognized.
The morpheme is the smallest combination of pronunciation and meaning in a language, that is, a language unit must satisfy three conditions simultaneously, namely, "smallest, voiced, sense", to be called as morpheme, especially "smallest" and "sense".
Real-time speech transcription (Real-time ASR) is based on a deep full-sequence convolutional neural network framework, long connection between an application and a language transcription core engine is established through a WebSocket protocol, audio stream data can be converted into character stream data in Real time, a user can generate a text while speaking, and a recognized temporary recognition result is output generally according to morphemes as a minimum unit. For example, the captured audio stream is: the steps of ' today ' day ' gas ' how ' to ' how ' and ' like ' are sequentially identified according to the sequence of the audio stream, the temporary identification result ' today ' is output, then the temporary identification result ' today ' is output, and so on until the whole audio stream is identified, and the final identification result ' how the weather is today ' is obtained. The real-time voice transcription technology can also carry out intelligent error correction on the previously output temporary recognition result based on subsequent audio stream and semantic understanding of context, so as to ensure the accuracy of the final recognition result, that is, the temporary recognition result based on the audio stream real-time output continuously changes along with time, for example, the temporary recognition result output for the first time is gold, the temporary recognition result output for the second time is corrected to be today, the temporary recognition result output for the third time can be today Tian, the temporary recognition result output for the fourth time is corrected to be today weather, and so on, and the accurate final recognition result is obtained through continuous recognition and correction.
A generative model (generative model) refers to a model that can randomly generate observed data, especially given some implicit parameters. And generating a model to assign a joint probability distribution to the observed value and the labeled data sequence. In machine learning, generative models can be used to directly model data (e.g., sample data according to a probability density function for a variable) or to establish a conditional probability distribution among variables. The conditional probability distribution may be formed by a generative model according to bayesian theorem.
Any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.
However, as the expression of the user during speaking may lack normative, for example, the same semantic meaning is expressed by continuous multiple sentences, the words before changing are presented, such as "how big is? you are,? is, what is? is to go around? in toilet," go to cafe bar, "and the like, at this time, if the response data corresponding to each sentence is executed in sequence, the intelligent device appears to be trembled and wastes time, which causes the user to have a mechanical and hard feeling when executing the response data, and personification.
Therefore, the inventor of the present invention considers that the server performs voice processing on the audio stream data collected by the intelligent device, determines the response data for the audio stream data according to the voice processing result, and then determines the execution sequence tag corresponding to the response data, so that the intelligent device can be conveniently controlled to execute the response data sent by the server according to the execution sequence indicated by the execution sequence tag, the control mode of executing the response data becomes more flexible, the sequence of executing the response data corresponding to a plurality of sentences contained in the audio stream data can be selectively adjusted according to the audio stream data continuously input by the user, the intelligent device can respond to the input of the user in a mode close to human-computer interaction, and the human-computer interaction process is more natural.
Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.
Fig. 1 is a schematic view of an application scenario of an output control method for a human-computer conversation according to an embodiment of the present invention. During the interaction between the user 10 and the smart device 11, the smart device 11 continuously collects ambient sounds and continuously reports the ambient sounds to the server 12 in the form of audio stream data, where the audio stream data may include ambient sounds around the smart device 11 or speech sounds of other users in addition to the speech sounds of the user 10. The server 12 sequentially performs voice recognition processing and semantic recognition processing on the audio stream data continuously reported by the intelligent device 11, determines corresponding response data according to a semantic recognition result, and controls the intelligent device 11 to execute the response data so as to give feedback to the user.
In this application scenario, the smart device 11 and the server 12 are communicatively connected through a network, which may be a local area network, a wide area network, or the like. The smart device 11 may be a smart speaker, a robot, or the like, a portable device (e.g., a mobile phone, a tablet, a notebook, or the like), or a Personal Computer (PC). The server 12 may be any server, a server cluster composed of several servers, or a cloud computing center capable of providing voice recognition services.
The following describes a technical solution provided by an embodiment of the present invention with reference to an application scenario shown in fig. 1.
Referring to fig. 2, an embodiment of the present invention provides a method for controlling output of a man-machine conversation, which is applied to the server side shown in fig. 1, and includes the following steps:
s201, voice processing is carried out on audio stream data collected by the intelligent equipment in real time.
In the embodiment of the invention, after a user starts to talk with the intelligent device, the intelligent device can continuously collect the sound around the intelligent device, convert the sound into audio stream data and send the audio stream data to the server.
In specific implementation, the server can perform voice recognition on continuous audio stream data by using technologies such as real-time voice transcription and the like to obtain a voice recognition result, and then perform prediction, semantic recognition and the like on the voice recognition result to obtain a voice processing result. Specifically, the speech recognition result may be predicted based on a preset corpus to obtain a predicted text corresponding to the speech recognition result, and the semantic recognition result may be obtained based on the predicted text to serve as a final speech processing result. A large amount of texts (i.e., corpora) with complete semantics are stored in the corpus in advance, such as "how much weather is present", "which movies are shown recently", "blue and white porcelain is introduced", and the like.
It should be noted that the application scenario of the method according to the embodiment of the present invention is not limited to the above-mentioned speech processing scenario predicted in real time, and may also be applied to any existing speech processing scenario, for example, the server acquires a section of audio stream data that is sent by the smart device and is acquired, performs speech recognition on the audio stream data to obtain a speech recognition result, and then performs processing such as semantic recognition on the speech recognition result to obtain a speech processing result.
S202, according to the voice processing result, response data aiming at the audio stream data is determined.
The response data in the embodiment of the present invention is not limited to text data, audio data, image data, video data, voice broadcast, or control instructions, and the like, where the control instructions include but are not limited to: instructions for controlling the intelligent equipment to display expressions, instructions for controlling the motion of action components of the intelligent equipment (such as leading, navigation, photographing, dancing and the like) and the like.
In specific implementation, at least one preset response data may be configured for each corpus in the corpus in advance, when the response data needs to be determined according to the predicted text, the preset response data corresponding to the predicted text only needs to be acquired according to the corresponding relationship, and the preset response data is used as the response data of the voice processing result corresponding to the predicted text, that is, the response data for the audio stream data.
In specific implementation, semantic recognition can be performed on the predicted text to obtain a semantic recognition result of the predicted text, and response data is determined according to the semantic recognition result of the predicted text and serves as the response data for the audio stream data.
And S203, determining an execution sequence label corresponding to the response data.
The execution sequence tag in the embodiment of the present invention is used to indicate an execution sequence of the response data identified by the execution sequence tag among the response data already received by the smart device, that is, the smart device executes the received response data according to the execution sequence indicated by the execution sequence tag corresponding to the received response data.
And S204, controlling the intelligent equipment to execute the response data according to the execution sequence tag.
In specific implementation, the server sends the response data and the execution sequence tags corresponding to the response data to the intelligent device, and the intelligent device executes the received response data according to the execution sequence indicated by the execution sequence tags.
According to the method, after the response data aiming at the audio stream data input by the user in real time are determined, the execution sequence tag corresponding to the response data is determined, so that the intelligent device can be conveniently controlled to execute the response data sent by the server according to the execution sequence indicated by the execution sequence tag. Compared with the prior art, the control mode for executing the response data is more flexible, the sequence of the response data corresponding to the sentences contained in the executed audio stream data can be selectively adjusted according to the audio stream data continuously input by the user, so that the intelligent device can respond to the input of the user in a mode close to human natural interaction, and the human-computer interaction process is more natural.
As a possible implementation method, as shown in fig. 3, the step S203 specifically includes the following steps:
s301, determining the priority of the response data.
S302, determining an execution sequence label corresponding to the response data based on the priority of the response data and the priority of the response data determined before the response data.
In particular, the priority of the response data can be determined as follows:
in the first mode, the priority of the response data is determined based on the semantic recognition result of the audio stream data corresponding to the response data.
Specifically, the priority of the response data can be determined based on the domain information corresponding to the response data, assuming that the sentence input by the user is "what? introduces the following blue and white porcelain", identifying that the domain information corresponding to "what" is the query domain, identifying that the domain information corresponding to "introducing the following blue and white porcelain" is the explanation domain, and determining that the priority of the response data corresponding to "introducing the following blue and white porcelain" is higher than the priority of the query domain because the priority of the explanation domain is higher than that of "what?", and preferentially outputting the response data corresponding to "introducing the following blue and white porcelain".
In specific implementation, the priority corresponding to the domain information may be configured in advance, for example, the priority may be configured for each domain information according to the function of the smart device, for example, the priority of the road-asking domain may be set to be the highest for the guidance robot, and the priority of the explanation domain may be set to be the highest for the explanation robot.
In specific implementation, the priority corresponding to the domain information may also be dynamically adjusted, for example, the priority of each domain may be determined according to a current mode in which the intelligent device is located, and if the intelligent device has entered an interpreter mode, the priority of the interpretation domain is determined to be the highest.
And in the second mode, the priority of the response data is determined based on the visual information acquired by the intelligent equipment.
The visual information in the embodiment of the invention refers to information acquired by intelligent equipment through a camera, an optical sensor and other devices, and further, the visual information can comprise face information, expression information, action information, scene information, iris information, light sensation information and the like by combining technologies such as image processing, face recognition, iris recognition and the like.
The priority of response data is determined based on the user information determined by visual information, for example, whether the user inputs ' how to go in and out in the men's toilet? in the men's toilet is? ', and if the user is determined to be a boy student based on multi-mode input information, the priority of the response data corresponding to ' how to go in and out in the men's toilet? ' is determined to be higher than the priority of the response data corresponding to ' how to go in and out in the women's toilet ' is? '.
Information such as the number of users in the current scene and the interaction intention of each user can also be identified based on the visual information. Specifically, the number of users contained in the acquired image can be analyzed by using a face recognition technology, when a plurality of users exist, the interaction intention of each user can be analyzed, the priority of the response data is determined according to the intensity of the interaction intention of the user, and the stronger the interaction intention is, the higher the priority of the response data determined based on the audio stream data corresponding to the user is, so that the multi-user scene or the noisy environment can be well dealt with.
In particular implementation, the interaction intention may be determined by integrating the face information, the expression information, and the motion information, for example, when the user faces the smart device and the user's lips move, it indicates that the user has a high desire to interact with the smart device, when the user's face faces other directions or the user's lips do not move, it indicates that the user has a low desire to interact with the smart device, and when the user desires to face the screen of the smart device for a long time, it also indicates that the user has a high desire to interact with the smart device. On the basis, the interaction intention of the user can also be determined by combining the interaction distance, for example, when the user is far away from the intelligent device, the expectation that the user interacts with the intelligent device is low, and when the user is close to the intelligent device, the expectation that the user interacts with the intelligent device is high. And determining an expected value of interaction between the user and the intelligent equipment by integrating the various information, determining that the user expects to interact with the intelligent equipment and the intelligent equipment when the expected value is higher than a preset expected threshold value, and otherwise determining that the user does not expect to interact with the intelligent equipment. In specific implementation, the method can be used for analyzing the plurality of users in the acquired images one by one to accurately locate which users desire to interact with the intelligent device in a scene containing the plurality of users, so that only audio stream data input by the users are processed, and voices of other users are filtered out.
In specific implementation, the priority of the response data may also be determined by combining the first manner and the second manner, and details of the specific implementation are not repeated.
And in the third mode, the priority of the response data is determined according to the audio stream data corresponding to the response data and the determined time information of the audio stream data corresponding to the response data.
In the embodiment of the present invention, the time information of the audio stream data may be the time when the server receives the audio stream data. Specifically, the earlier the audio stream data is received, the higher the priority of the response data determined based on the audio stream data is, whereas the later the audio stream data is received, the lower the priority of the response data determined based on the audio stream data is.
In specific implementation, when the priority of the response data cannot be determined through the first mode and the second mode, the priority of the response data is determined according to the sequence of the audio stream data corresponding to the response data.
In specific implementation, for a plurality of response data determined based on audio stream data detected by the same VAD, the priorities of the plurality of response data may be determined according to the order of the time information of the determined response data, that is, the response data determined after the first determination is executed first and then the response data determined after the second determination is executed. In a specific implementation, the manner of determining the priority of the response data may be set as a default execution manner, that is, when the first manner or the second manner is not selected, the manner is selected by default.
In specific implementation, the execution sequence tag corresponding to the response data may be determined based on the priority of the response data and the priority of the response data determined before the response data as follows: if the audio stream data corresponding to the response data and the determined audio stream data corresponding to the response data belong to the audio stream data obtained by VAD detection, determining the arrangement position of the response data among the determined response data according to the priority of the response data and the priority of the determined response data and the arrangement sequence of the priorities from high to low; and determining an insertion tag corresponding to the response data according to the arrangement position of the response data, and using the insertion tag as an execution sequence tag corresponding to the response data, wherein the insertion tag is used for indicating the execution sequence of the response data identified by the insertion tag among the response data received by the intelligent device.
It should be noted that, if the audio stream data corresponding to the response data and the determined audio stream data corresponding to the response data belong to the same audio stream data obtained by VAD detection, it indicates that both the audio stream data corresponding to the response data and the determined audio stream data corresponding to the response data belong to a section of audio stream data continuously input by the user, for example, when the user continuously inputs "star crossing the movie, when the director is who the star? is, who the? mainly speaks?", the response data determined based on the section of audio stream data only need to be executed one by one according to a certain order, so that the intelligent device can respond to the content input by the user one by one like chatting with natural people.
In an implementation, the determined response data may include a preset number of response data determined before the current response data, or response data determined within a preset time period before the current response data.
The preset number in the embodiment of the present invention may be determined according to actual situations, for example, the preset number may be 1, 2, or 5. In practical applications, because the smart device executes the received response data in time, the number of the response data in the output queue of the smart device is usually small, for example, only 1 or 2 response data waiting for execution are usually available, and therefore, the preset number may be 4 or 5, and the like.
The server in the embodiment of the present invention may store historical response data in a preset time period, where the historical response data in the embodiment of the present invention is response data sent to the intelligent device.
The preset time period in the embodiment of the present invention may be preset according to an actual situation, for example, the preset time period may be 20 seconds, 30 seconds, or 1 minute, and if the preset time period is 20 seconds, only the response data determined in 20 seconds is obtained, and it is determined whether the audio stream data corresponding to the obtained response data and the audio stream data corresponding to the current response data belong to the same audio stream data obtained by VAD detection. In specific implementation, a preset time period may be dynamically determined according to the determined response data, for example, the estimated time of the intelligent device executing the response data may be estimated according to the determined response data, and the preset time period may be determined according to the estimated time.
In specific implementation, each time a response data is determined, the response data may be added to the sent list, and the response data in the sent list is ranked from high to low according to the corresponding priority, that is, the response data with higher priority is ranked in the sent list more forward. And after the response data is added into the sent list, determining the insertion tag corresponding to the response data according to the arrangement position of the response data in the sent list. For example, the inserted tag may include identification information of response data that is listed before the response data in the sent list, so that the smart device may add the response data identified by the inserted tag after the response data corresponding to the identification information in the inserted tag in the output queue, where the information identification is used to uniquely identify the corresponding response data. Of course, the inserted tag may also include identification information of response data that is listed after the response data in the sent list, so that the smart device may add the response data identified by the inserted tag to the output queue before the response data corresponding to the identification information in the inserted tag.
For example, as shown in table 1.1, response data a, response data B, and response data C are already stored in the sent list, the priority of the response data a is "4", the priority of the response data B is "2", the priority of the response data C is "1", and the priority of the newly generated response data D is "3", then the response data is added to the response data a in the sent list, the sent list shown in table 1.2 is obtained, then the identification information of the response data a that is listed before the response data D in table 1.2 is obtained, and the identification information of the response data a is added to the corresponding insert tag of the response data D. After receiving the response data D and the corresponding insertion label, the intelligent device finds the storage position of the response data A in the output queue according to the identification information of the response data A in the insertion label, and inserts the response data D into the output queue after the response data A.
TABLE 1.1
Priority level | Response data |
4 | A |
2 | B |
1 | C |
TABLE 1.2
Priority level | Response data |
4 | A |
3 | D |
2 | B |
1 | C |
In specific implementation, when the priority of a plurality of response data in the sent list is the same, the response data with the same priority may be sorted according to the time information of the audio stream data corresponding to the response data, that is, the earlier response data with the earlier time information is ranked and the later response data with the later time information is ranked.
In practical applications, the output queue in the smart device is only used for storing the received response data that is not executed, that is, the executed response data and the executed response data are not stored in the output queue. In order to reduce the data transmission amount between the server and the intelligent device, the intelligent device does not synchronize the execution progress of the response data in the output queue to the server, and the output queue of the intelligent device cannot be guaranteed to be completely consistent with the sent list in the server. For the situation that the response data corresponding to the identification information inserted into the tag cannot be found in the output queue, the intelligent device can directly insert the response data into the head of the output queue.
According to the method provided by the embodiment of the invention, the semantic recognition result of the audio stream data corresponding to the data, the visual information acquired by the intelligent device and the time information of the audio stream data corresponding to the response data can be responded, and the priority of the response data is determined, so that the arrangement position of the response data in the determined response data is accurately determined, and the insertion tag corresponding to the response data is determined, so that the intelligent device can execute the received response data according to the execution sequence indicated by the insertion tag, the method for executing the response data by the intelligent device is more in line with the habit of human conversation, and the man-machine conversation is more natural and smooth.
In practical application, if the audio stream data corresponding to the response data and the audio stream data corresponding to the response data determined last time belong to audio stream data obtained by different VAD detections, it indicates that the user inputs the audio stream data corresponding to the response data determined last time after a certain time interval, and based on the speaking habits of human beings, the user generally wants to get feedback in time, and particularly, a certain time interval exists between the last time and the previous words.
For this reason, as a possible implementation manner, step S203 specifically includes: and if the audio stream data corresponding to the response data and the audio stream data corresponding to the response data determined last time belong to audio stream data obtained by different VAD detections, determining that the execution sequence tag corresponding to the response data is an interruption tag, wherein the interruption tag is used for indicating that the response data identified by the interruption tag can interrupt the response data currently executed by the intelligent device.
For example, the audio stream data corresponding to the currently determined response data is "what star crossing is?", the audio stream data corresponding to the last determined response data is "what good looking movie recommendation?", "what good looking movie recommendation?", and "what star crossing is?" belonging to the audio stream data obtained by different VAD detections, and therefore, the server determines the response data a corresponding to the "what star crossing is?"1Is tagged as an interruption tag, response data A to which the interruption tag is to be attached1Sending the response data to the intelligent equipment, and the intelligent equipment receiving the response data A attached with the interrupt tag1Then, the currently executed response data is interrupted immediately, and the response data A is executed1。
According to the method provided by the embodiment of the invention, the intelligent device can directly break the currently executed response data after receiving the response data attached with the breaking tag by attaching the breaking tag to the response data, and then execute the response data identified by the breaking tag, so that the intelligent device can judge whether the currently executed response data needs to be broken or not based on the current speaking of the interactive opposite party like a human, and further enter a new conversation, so that the response of the intelligent device is more in line with the expectation of a user and is more anthropomorphic.
As another possible implementation manner, step S203 specifically includes: and determining an execution sequence label corresponding to the response data based on the response data and the response data determined before the response data.
In practical applications, the user may continuously express the same semantic meaning in multiple sentences, for example, "how big you are? you are? years old," what toilet is? and how much you go?, "and if corresponding response data is executed for each sentence input by the user, it will appear hard and time-consuming.
For this reason, in specific implementation, based on the response data and the response data determined before the response data, the execution sequence tag corresponding to the response data may be determined as follows: and if the audio stream data corresponding to the response data and the determined audio stream data corresponding to the response data belong to the same audio stream data obtained by VAD detection, and the response data is the same as at least one response data in the determined response data, determining that the execution sequence tag corresponding to the response data is a skip tag, wherein the skip tag is used for indicating the intelligent device to skip the response data identified by the skip tag when executing the response data.
For example, it is assumed that the sentence input by the user is "where toilet? is taken?", the response data corresponding to "where toilet? is determined as" right go 50 m left turn to toilet ", the semantic recognition result corresponding to" how go? "is determined as" how go to toilet? "based on the semantic recognition, the response data corresponding to" how go? "is" right go 50 m left turn to toilet ", at this time, the response data corresponding to" how go? "is the same as the response data corresponding to the previous sentence" where toilet is located? ", the execution sequence tag of the response data corresponding to" where toilet? "is a skip tag.
In particular, since a plurality of sentences expressing the same semantic meaning input by the user are generally continuous, in order to improve the processing efficiency, only the current response data may be compared with the response data determined last time, and if the audio stream data corresponding to the current response data and the audio stream data corresponding to the response data determined last time belong to the same audio stream data obtained by VAD detection and the current response data is the same as the response data determined last time, the execution sequence tag corresponding to the current response data is determined to be a skip tag.
According to the method provided by the embodiment of the invention, the same response data is attached with the skip label, so that the intelligent equipment can directly skip the response data identified by the skip label after receiving the response data attached with the skip label, the same response data is prevented from being repeatedly output, the intelligent equipment can identify a plurality of sentences which are continuously input by an interactive opposite party and express the same semantic meaning due to the non-standard oral expression habit or the intention of emphasizing like a human, and then only the response data of one sentence is output, and the interactive mode of the intelligent equipment is more in line with the habit of the human.
In practical application, because the user has poor logic when speaking, the first sentence of the user does not express a clear and complete meaning, and the second sentence supplements the expressed content. For example, the first sentence input by the user is "take me to rest room", which rest room is not clearly expressed is found, and then the second sentence "the one of the first floor" is immediately complemented, which indicates the meeting room that the user desires to go to the one floor, and after the first sentence is recognized, if it is not determined which rest room, the intelligent device outputs the following response data "ask which rest room to ask" to ask the user, and because the user immediately complements the second sentence, if the intelligent device continues to output "ask which rest room to ask" at this time, the user feels a sense of incongruity, or the user replies to the inquiry, which reduces the interaction efficiency.
For this reason, in specific implementation, based on the response data and the response data determined before the response data, the execution sequence tag corresponding to the response data may be determined as follows: if the semantic recognition result of the audio stream data corresponding to the response data is determined to be complementary to the semantic recognition result of the audio stream data corresponding to any response data in the determined response data, determining that the execution sequence tag corresponding to the response data is a replacement tag, wherein the replacement tag is used for indicating the intelligent device to replace the response data corresponding to the information identifier in the replacement tag with the response data identified by the replacement tag.
In specific implementation, it may be determined that the semantic recognition result of the audio stream data corresponding to the response data is complementary to the semantic recognition result of the audio stream data corresponding to any one of the determined response data by: and if the slot position item with the slot position value in the semantic recognition result of the audio stream data corresponding to the response data is the same as the slot position item with the missing slot position value in the semantic recognition result of the audio stream data corresponding to any response data in the determined response data, determining that the semantic recognition result of the audio stream data corresponding to the response data is complementary to the semantic recognition result of the audio stream data corresponding to any response data.
For example, according to a semantic recognition result of a first sentence "take me to go to a rest room" input by a user, it is determined that a slot value corresponding to a slot item lacking "floor" is not present, it cannot be determined which rest room the user wants to go to, and according to a second sentence "one floor" input by the user, a slot value corresponding to a slot item of "floor" is obtained as "one floor", and at this time, it is determined that "one floor" is a supplement to "take me to go to a rest room". Assuming that the response data a corresponding to the "take me to go to the rest room" and the response data B corresponding to the "first floor", a replacement tag is attached to the response data B, the replacement tag includes an information identifier of the response data a, and after receiving the response data B and the replacement tag corresponding to the response data B, the smart device finds the response data a in the output queue according to the information identifier of the response data a and replaces the response data a with the response data B, so that the smart device outputs the response data B without outputting the response data a.
In practical applications, a wrong utterance often occurs due to poor logicality when a user speaks, and at this time, the user immediately supplements a previous utterance to correct the error in the previous utterance. For example, the sentence input by the user is "take me to go to the lobby", then "do not go, go to the cafe" is corrected immediately, at this time, the intelligent device will output the response data corresponding to the two sentences respectively, that is, the user is taken to go to the lobby first, and then the user is taken to go to the cafe, obviously, the true purpose of the user is to go to the cafe.
For this reason, in specific implementation, based on the response data and the response data determined before the response data, the execution sequence tag corresponding to the response data may be determined as follows: and if the semantic recognition result of the audio stream data corresponding to the response data is determined to be the correction of the semantic recognition result of the audio stream data corresponding to any response data in the determined response data, determining that the execution sequence tag corresponding to the response data is a replacement tag, wherein the replacement tag is used for indicating the intelligent device to replace the response data corresponding to the information identifier in the replacement tag with the response data identified by the replacement tag.
In specific implementation, it may be determined that the semantic recognition result of the audio stream data corresponding to the response data is a correction of the semantic recognition result of the audio stream data corresponding to any one of the determined response data by: and if the semantic recognition result of the audio stream data corresponding to the response data contains negative intention and the semantic recognition result of the audio stream data corresponding to the response data is different from the slot value of the same slot item in the semantic recognition result of the audio stream data corresponding to any response data in the determined response data, determining that the semantic recognition result of the audio stream data corresponding to the response data is the correction of the semantic recognition result of the audio stream data corresponding to any response data.
In specific implementation, whether the voice recognition result of the audio stream data corresponding to the response data contains a negative intention or not may be recognized through technologies such as semantic recognition and semantic understanding, or whether the voice recognition result contains a negative intention may also be determined through a preset keyword, for example, the preset keyword may be a word such as "not", "wrong", and the like. When it is recognized that whether the voice recognition result of the audio stream data corresponding to the response data contains a negative intention, the slot information required by the response data of the previous sentence is matched from the determined response data, and whether the current voice recognition result is the correction of the previous voice recognition result can be determined.
For example, if the first sentence input by the user is "take me to go to the lobby", the second sentence is "no, go to the coffee hall", it is recognized that "no, go to the coffee hall" contains a negative intention, and the slot position value of the slot position item "place" in the semantic recognition result of "take me to go to the lobby" is "lobby", and it is determined that "no, go to the coffee hall" is to correct the slot position item "place" in "take me to go to the lobby". Assuming that the response data a corresponding to the "take me to go to the rest room" and the response data B corresponding to the "do not go to the coffee hall", a replacement tag is attached to the response data B, the replacement tag includes an information identifier of the response data a, after receiving the response data B and the replacement tag corresponding to the response data B, the intelligent device finds the response data a in the output queue according to the information identifier of the response data a, and replaces the response data a with the response data B, so that the intelligent device can output the response data B without outputting the response data a.
In particular, the user generally supplements or corrects the last input sentence in time, so to improve the processing efficiency, the current response data may be only compared with the response data determined last time, and if it is determined that the semantic recognition result of the audio stream data corresponding to the current response data is the correction or supplement of the semantic recognition result of the audio stream data of the response data determined last time, the execution sequence tag corresponding to the current response data is determined to be the replacement tag.
In specific implementation, the execution sequence tag may be determined as follows: and if the audio stream data corresponding to the current response data and the audio stream data corresponding to the response data determined last time belong to the same audio stream data obtained by VAD detection or the difference value of the time information between the audio stream data corresponding to the current response data and the audio stream data corresponding to the response data determined last time is smaller than the preset time difference, and the semantic identification result of the audio stream data corresponding to the current response data is the correction or supplement of the semantic identification result of the audio stream data of the response data determined last time, determining that the execution sequence label corresponding to the current response data is a replacement label.
Therefore, the method provided by the embodiment of the invention can effectively deal with various irregular situations when the user expresses the spoken language, so that the intelligent equipment is more intelligent during interaction.
Referring to fig. 4, an embodiment of the present invention provides a method for controlling output of a man-machine conversation, which is applied to the intelligent device side shown in fig. 1, and specifically includes the following steps:
s401, sending the collected audio stream data to a server.
S402, receiving response data obtained based on audio stream data and an execution sequence tag corresponding to the response data, wherein the response data are sent by a server.
The method executed by the server side may refer to the method shown in fig. 2, and is not described again.
And S403, executing the response data according to the execution sequence tag.
In specific implementation, the intelligent device acquires response data of the head of the queue from the output queue and outputs the response data. Further, the response data is deleted from the output queue.
As a possible implementation manner, step S403 specifically includes: and if the execution sequence tag is an insertion tag, executing the response data according to the execution sequence of the response data indicated by the insertion tag among the response data received by the intelligent equipment.
In specific implementation, when the execution sequence tag is an insertion tag, the response data identified by the insertion tag is inserted into the output queue at a position behind the response data corresponding to the identification information in the insertion tag, and if the response data corresponding to the identification information in the insertion tag does not exist in the output queue, the response data is added to the head of the output queue.
For example, after receiving the response data D and the corresponding insertion tag, the smart device finds the storage location of the response data a in the output queue shown in table 2.1 according to the identification information SID-a of the response data a in the insertion tag, and inserts the response data D into the response data a in the output queue to obtain the output queue shown in table 2.2.
TABLE 2.1
Identification information | Response data |
SID-A | A |
SID-B | B |
SID-C | C |
TABLE 2.2
Identification information | Response data |
SID-A | A |
SID-D | D |
SID-A | B |
SID-A | C |
In practical applications, the output queue in the smart device is only used for storing the received and unexecuted response data, that is, the executed response data and the executing response data are not stored in the output queue, and if the identification information in the insertion tag is not queried in the output queue, the response data can be directly inserted into the head of the output queue.
As a possible implementation manner, step S403 specifically includes: if the execution sequence tag is an interruption tag, stopping the response data currently executed by the intelligent equipment, and executing the response data identified by the interruption tag.
In specific implementation, after receiving the response data attached with the interrupt tag, the intelligent device directly terminates the currently executed response data and immediately executes the response data identified by the interrupt tag, without adding the response data identified by the interrupt tag to the output queue, and after waiting for the completion of the execution of the response data identified by the interrupt tag, executes the response data at the head of the queue in the output queue.
For example, the intelligent device executes response data corresponding to "introduce down blue and white porcelain", and after receiving response data corresponding to "take me to visit down museum", because an interruption tag is attached to "take me to visit down museum", the intelligent device terminates executing the response data corresponding to "introduce down blue and white porcelain", and directly executes the response data corresponding to "take me to visit down museum".
It should be noted that the response data carrying the interrupt tag does not need to be added to the output queue, but is directly output by the smart device.
As a possible implementation manner, step S403 specifically includes: if the execution sequence tag is a skip tag, the response data identified by the skip tag is skipped when the response data is executed.
In specific implementation, when the execution sequence tag is a skip tag, the smart device may directly delete the response data identified by the skip tag or add the response data to the tail of the output queue. If the intelligent device selects to add the response data identified by the skip tag to the tail of the output queue, when the response data in the output queue is executed, if the response data to be executed is attached with the skip tag, the response data is not executed, and the response data is directly deleted.
As a possible implementation manner, when step S403 specifically includes: and if the execution sequence tag is a replacement tag, replacing response data corresponding to the information identifier in the replacement tag in the received response data with the response data identified by the replacement tag.
In specific implementation, when the execution sequence tag is a replacement tag, the intelligent device replaces response data corresponding to the information identifier in the replacement tag in the output queue with the response data identified by the replacement tag.
For example, after receiving the response data B attached with the replacement tag, the smart device finds the response data a corresponding to the information identifier in the output queue according to the information identifier in the replacement tag, and replaces the response data a in the output queue with the response data B, so that the smart device executes the response data B without executing the response data a. In specific implementation, if the information identifier in the replacement tag does not exist in the output queue, it indicates that the response data a may have been executed, or the smart device is executing the response data a. If the intelligent equipment is executing the response data A, the current executing response data A can be interrupted, and the response data B is directly executed; if the response data A corresponding to the information identifier in the replacement tag does not exist in the output queue and the intelligent device is executing the response data A, the response data B can be stored at the head of the output queue, and the response data B is executed after the currently executing response data is executed.
In practical applications, there are many situations that make it impossible to determine response data in time: the processing time of voice recognition, semantic recognition or semantic understanding is long, or some non-semantic contents are recognized, such as non-semantic vocabularies like "ask once", "after you want", "take after you", "o", etc., which may cause that response data cannot be determined in time, so that the intelligent device end cannot receive the response data, and cannot reply to the user.
In order to address the above situation, on the basis of any one of the above embodiments, the method according to an embodiment of the present invention further includes the steps of: and if the response data returned by the server is not received after the timeout duration is exceeded and no response data exists in the output queue, outputting preset broadcast information. The preset broadcast information may be voice such as "kay, good", "kay", "go to XX immediately", "want me to think of something else". For example, generally, the intelligent device sends the audio stream data to the server, and the average time until response data corresponding to the audio stream data fed back by the server is received is 5 seconds, the timeout duration may be set to 30 seconds, and if the corresponding response data is still not received after 30 seconds and there is no response data waiting to be executed in the output queue at this time, the intelligent device may output preset broadcast information to pacify the user.
According to the output control method of the man-machine conversation, disclosed by the embodiment of the invention, the sequence of outputting the response data corresponding to each sentence can be selectively adjusted according to the input of a plurality of sentences by the user, so that the intelligent equipment can output the response data in a way close to the natural interaction of human beings, and the man-machine interaction process is more natural.
As shown in fig. 5, based on the same inventive concept as the above-mentioned output control method of human-computer interaction, an embodiment of the present invention further provides an output control device 50 of human-computer interaction, including: a speech processing module 501, a response data determination module 502, a tag determination module 503, and a control module 504.
The voice processing module 501 is configured to perform voice processing on audio stream data acquired by the intelligent device in real time.
The response data determination module 502 is configured to determine response data for the audio stream data according to the voice processing result.
The tag determination module 503 is configured to determine an execution sequence tag corresponding to the response data.
The control module 504 is configured to control the smart device to execute the response data according to the execution sequence tag.
Optionally, the tag determination module is specifically configured to: determining a priority of the response data; and determining the execution sequence label corresponding to the response data based on the priority of the response data and the priority of the response data determined before the response data.
Further, the tag determination module 503 is specifically configured to: if the audio stream data corresponding to the response data and the determined audio stream data corresponding to the response data belong to the audio stream data obtained by VAD detection, determining the arrangement position of the response data among the determined response data according to the priority of the response data and the priority of the determined response data and the arrangement sequence of the priorities from high to low; and determining an insertion tag corresponding to the response data according to the arrangement position of the response data, and using the insertion tag as an execution sequence tag corresponding to the response data, wherein the insertion tag is used for indicating the execution sequence of the response data identified by the insertion tag among the response data received by the intelligent device.
Further, the tag determination module 503 is specifically configured to: determining the priority of the response data based on the semantic recognition result of the audio stream data corresponding to the response data and/or the visual information acquired by the intelligent equipment; or determining the priority of the response data according to the audio stream data corresponding to the response data and the determined time information of the audio stream data corresponding to the response data.
Optionally, the tag determining module 503 is specifically configured to: and if the audio stream data corresponding to the response data and the audio stream data corresponding to the response data determined last time belong to audio stream data obtained by different VAD detections, determining that the execution sequence tag corresponding to the response data is an interruption tag, wherein the interruption tag is used for indicating that the response data identified by the interruption tag can interrupt the response data currently executed by the intelligent device.
Optionally, the tag determining module 503 is specifically configured to: and determining an execution sequence label corresponding to the response data based on the response data and the response data determined before the response data.
Further, the tag determination module 503 is specifically configured to: and if the audio stream data corresponding to the response data and the determined audio stream data corresponding to the response data belong to the same audio stream data obtained by VAD detection, and the response data is the same as at least one response data in the determined response data, determining that the execution sequence tag corresponding to the response data is a skip tag, wherein the skip tag is used for indicating the intelligent device to skip the response data identified by the skip tag when executing the response data.
Further, the tag determination module 503 is specifically configured to: and if the semantic recognition result of the audio stream data corresponding to the response data is determined to be supplement or correction of the semantic recognition result of the audio stream data corresponding to any response data in the determined response data, determining that the execution sequence tag corresponding to the response data is a replacement tag, wherein the replacement tag is used for indicating the intelligent device to replace the response data corresponding to the information identifier in the replacement tag with the response data identified by the replacement tag.
Further, the label determination module 503 is further configured to: if the slot position item with the slot position value in the semantic recognition result of the audio stream data corresponding to the response data is the same as the slot position item with the slot position value missing in the semantic recognition result of the audio stream data corresponding to any response data in the determined response data, determining that the semantic recognition result of the audio stream data corresponding to the response data is complementary to the semantic recognition result of the audio stream data corresponding to any response data; or if the semantic recognition result of the audio stream data corresponding to the response data contains a negative intention and the semantic recognition result of the audio stream data corresponding to the response data is different from the slot value of the same slot item in the semantic recognition result of the audio stream data corresponding to any response data in the determined response data, determining that the semantic recognition result of the audio stream data corresponding to the response data is the correction of the semantic recognition result of the audio stream data corresponding to any response data.
The human-computer conversation output control device and the human-computer conversation output control method provided by the embodiment of the invention adopt the same inventive concept, can obtain the same beneficial effects, and are not repeated herein.
As shown in fig. 6, based on the same inventive concept as the above-mentioned output control method of human-computer interaction, an embodiment of the present invention further provides an output control device 60 of human-computer interaction, including: a data sending module 601, a data receiving module 602 and an executing module 603.
The data sending module 601 is configured to send the collected audio stream data to the server.
The data receiving module 602 is configured to receive response data obtained based on the audio stream data and an execution sequence tag corresponding to the response data, where the response data is sent by the server.
The execution module 603 is configured to execute the response data according to the execution order tag.
Optionally, the executing module 603 is specifically configured to: and if the execution sequence tag is an insertion tag, executing the response data according to the execution sequence of the response data indicated by the insertion tag among the response data received by the intelligent equipment.
Optionally, the executing module 603 is specifically configured to: if the execution sequence tag is an interruption tag, stopping the response data currently executed by the intelligent equipment, and executing the response data identified by the interruption tag.
Optionally, the executing module 603 is specifically configured to: if the execution sequence tag is a skip tag, the response data identified by the skip tag is skipped when the response data is executed.
Optionally, the executing module 603 is specifically configured to: and if the execution sequence tag is a replacement tag, replacing response data corresponding to the information identifier in the replacement tag in the received response data with the response data identified by the replacement tag.
Optionally, the system further comprises an overtime broadcast module, configured to output preset broadcast information if no response data returned by the server is received after the overtime duration is exceeded and no response data exists in the output queue.
The human-computer conversation output control device and the human-computer conversation output control method provided by the embodiment of the invention adopt the same inventive concept, can obtain the same beneficial effects, and are not repeated herein.
Based on the same inventive concept as the above-mentioned man-machine conversation output control method, an embodiment of the present invention further provides an electronic device, which may be specifically a control device or a control system inside an intelligent device, or an external device communicating with the intelligent device, such as a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), a server, and the like. As shown in fig. 7, the electronic device 70 may include a processor 701 and a memory 702.
Memory 702 may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides the processor with program instructions and data stored in the memory. In the embodiment of the present invention, the memory may be used to store a program of an output control method of a man-machine conversation.
The processor 701 may be a CPU (central processing unit), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or a CPLD (Complex Programmable Logic Device), and implements the output control method of the man-machine interaction in any of the above embodiments according to an obtained program instruction by calling a program instruction stored in a memory.
An embodiment of the present invention provides a computer-readable storage medium for storing computer program instructions for the electronic device, which includes a program for executing the output control method of human-machine interaction.
The computer storage media may be any available media or data storage device that can be accessed by a computer, including but not limited to magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.
Based on the same inventive concept as the output control method of the human-computer conversation, an embodiment of the present invention provides a computer program product including a computer program stored on a computer-readable storage medium, the computer program including program instructions that, when executed by a processor, implement the output control method of the human-computer conversation in any of the above embodiments.
The above embodiments are only used to describe the technical solutions of the present application in detail, but the above embodiments are only used to help understanding the method of the embodiments of the present invention, and should not be construed as limiting the embodiments of the present invention. Variations or substitutions that may be readily apparent to one skilled in the art are intended to be included within the scope of the embodiments of the present invention.
Claims (10)
1. An output control method for a man-machine conversation, comprising:
carrying out voice processing on audio stream data acquired by intelligent equipment in real time;
determining response data for the audio stream data according to a voice processing result;
determining an execution sequence tag corresponding to the response data;
and controlling the intelligent equipment to execute the response data according to the execution sequence tag.
2. The method according to claim 1, wherein the determining the execution sequence tag corresponding to the response data specifically includes:
determining a priority of the response data;
and determining an execution sequence tag corresponding to the response data based on the priority of the response data and the priority of the response data determined before the response data.
3. The method according to claim 2, wherein the determining the execution sequence tag corresponding to the response data based on the priority of the response data and the priority of the response data determined before the response data specifically includes:
if the audio stream data corresponding to the response data and the determined audio stream data corresponding to the response data belong to the same audio stream data obtained by VAD detection, determining the arrangement position of the response data among the determined response data according to the priority of the response data and the priority of the determined response data and the sequence from high priority to low priority;
and determining an insertion tag corresponding to the response data according to the arrangement position of the response data, and using the insertion tag as an execution sequence tag corresponding to the response data, wherein the insertion tag is used for indicating the execution sequence of the response data identified by the insertion tag among the response data received by the intelligent device.
4. The method according to claim 1, wherein the determining the execution sequence tag corresponding to the response data specifically includes:
and if the audio stream data corresponding to the response data and the audio stream data corresponding to the response data determined last time belong to audio stream data obtained by different VAD detections, determining that the execution sequence tag corresponding to the response data is an interruption tag, wherein the interruption tag is used for indicating that the response data identified by the interruption tag can interrupt the response data currently executed by the intelligent device.
5. The method according to claim 1, wherein the determining the execution sequence tag corresponding to the response data specifically includes:
and determining an execution sequence tag corresponding to the response data based on the response data and the response data determined before the response data.
6. The method according to claim 5, wherein the determining, based on the response data and response data determined before the response data, an execution sequence tag corresponding to the response data specifically includes:
if the audio stream data corresponding to the response data and the audio stream data corresponding to the determined response data belong to the same audio stream data obtained by VAD detection, and the response data is the same as at least one response data in the determined response data, determining that the execution sequence tag corresponding to the response data is a skip tag, wherein the skip tag is used for indicating the intelligent device to skip the response data identified by the skip tag when executing the response data.
7. The method according to claim 5, wherein the determining, based on the response data and response data determined before the response data, an execution sequence tag corresponding to the response data specifically includes:
if it is determined that the semantic recognition result of the audio stream data corresponding to the response data is supplement or correction of the semantic recognition result of the audio stream data corresponding to any response data in the determined response data, determining that the execution sequence tag corresponding to the response data is a replacement tag, where the replacement tag is used to instruct the smart device to replace the response data corresponding to the information identifier in the replacement tag with the response data identified by the replacement tag.
8. An output control device for man-machine interaction, comprising:
the voice processing module is used for carrying out voice processing on the audio stream data acquired by the intelligent equipment in real time;
a response data determination module for determining response data for the audio stream data according to a voice processing result;
the label determining module is used for determining an execution sequence label corresponding to the response data;
and the control module is used for controlling the intelligent equipment to execute the response data according to the execution sequence tag.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium having computer program instructions stored thereon, which, when executed by a processor, implement the steps of the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910579362.3A CN110299152A (en) | 2019-06-28 | 2019-06-28 | Interactive output control method, device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910579362.3A CN110299152A (en) | 2019-06-28 | 2019-06-28 | Interactive output control method, device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110299152A true CN110299152A (en) | 2019-10-01 |
Family
ID=68029538
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910579362.3A Pending CN110299152A (en) | 2019-06-28 | 2019-06-28 | Interactive output control method, device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110299152A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110853639A (en) * | 2019-10-23 | 2020-02-28 | 天津讯飞极智科技有限公司 | Voice transcription method and related device |
CN110971685A (en) * | 2019-11-29 | 2020-04-07 | 腾讯科技(深圳)有限公司 | Content processing method, content processing device, computer equipment and storage medium |
CN111443801A (en) * | 2020-03-25 | 2020-07-24 | 北京百度网讯科技有限公司 | Man-machine interaction method, device, equipment and storage medium |
CN112185392A (en) * | 2020-09-30 | 2021-01-05 | 深圳供电局有限公司 | Voice recognition processing system for power supply intelligent client |
CN112185393A (en) * | 2020-09-30 | 2021-01-05 | 深圳供电局有限公司 | Voice recognition processing method for power supply intelligent client |
CN113284404A (en) * | 2021-04-26 | 2021-08-20 | 广州九舞数字科技有限公司 | Electronic sand table display method and device based on user actions |
CN113921006A (en) * | 2021-09-30 | 2022-01-11 | 安徽有声电子科技有限公司 | Voice signal amplifier in voice control system |
CN114490090A (en) * | 2022-04-02 | 2022-05-13 | 广东茉莉数字科技集团股份有限公司 | Internet data center demand response optimization method based on multi-objective evolutionary algorithm |
CN115408510A (en) * | 2022-11-02 | 2022-11-29 | 深圳市人马互动科技有限公司 | Plot interaction node-based skipping method and assembly and dialogue development system |
US11967152B2 (en) | 2019-11-19 | 2024-04-23 | Tencent Technology (Shenzhen) Company Limited | Video classification model construction method and apparatus, video classification method and apparatus, device, and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831894A (en) * | 2012-08-09 | 2012-12-19 | 华为终端有限公司 | Command processing method, command processing device and command processing system |
CN108520747A (en) * | 2018-03-29 | 2018-09-11 | 浙江吉利汽车研究院有限公司 | A kind of on-vehicle control apparatus with speech identifying function |
CN108847225A (en) * | 2018-06-04 | 2018-11-20 | 上海木木机器人技术有限公司 | A kind of robot and its method of the service of airport multi-person speech |
CN108877792A (en) * | 2018-05-30 | 2018-11-23 | 北京百度网讯科技有限公司 | For handling method, apparatus, electronic equipment and the computer readable storage medium of voice dialogue |
CN109671427A (en) * | 2018-12-10 | 2019-04-23 | 珠海格力电器股份有限公司 | Voice control method and device, storage medium and air conditioner |
-
2019
- 2019-06-28 CN CN201910579362.3A patent/CN110299152A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831894A (en) * | 2012-08-09 | 2012-12-19 | 华为终端有限公司 | Command processing method, command processing device and command processing system |
CN108520747A (en) * | 2018-03-29 | 2018-09-11 | 浙江吉利汽车研究院有限公司 | A kind of on-vehicle control apparatus with speech identifying function |
CN108877792A (en) * | 2018-05-30 | 2018-11-23 | 北京百度网讯科技有限公司 | For handling method, apparatus, electronic equipment and the computer readable storage medium of voice dialogue |
CN108847225A (en) * | 2018-06-04 | 2018-11-20 | 上海木木机器人技术有限公司 | A kind of robot and its method of the service of airport multi-person speech |
CN109671427A (en) * | 2018-12-10 | 2019-04-23 | 珠海格力电器股份有限公司 | Voice control method and device, storage medium and air conditioner |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110853639B (en) * | 2019-10-23 | 2023-09-01 | 天津讯飞极智科技有限公司 | Voice transcription method and related device |
CN110853639A (en) * | 2019-10-23 | 2020-02-28 | 天津讯飞极智科技有限公司 | Voice transcription method and related device |
US11967152B2 (en) | 2019-11-19 | 2024-04-23 | Tencent Technology (Shenzhen) Company Limited | Video classification model construction method and apparatus, video classification method and apparatus, device, and medium |
CN110971685A (en) * | 2019-11-29 | 2020-04-07 | 腾讯科技(深圳)有限公司 | Content processing method, content processing device, computer equipment and storage medium |
CN110971685B (en) * | 2019-11-29 | 2021-01-01 | 腾讯科技(深圳)有限公司 | Content processing method, content processing device, computer equipment and storage medium |
WO2021103741A1 (en) * | 2019-11-29 | 2021-06-03 | 腾讯科技(深圳)有限公司 | Content processing method and apparatus, computer device, and storage medium |
US12073820B2 (en) | 2019-11-29 | 2024-08-27 | Tencent Technology (Shenzhen) Company Limited | Content processing method and apparatus, computer device, and storage medium |
CN111443801A (en) * | 2020-03-25 | 2020-07-24 | 北京百度网讯科技有限公司 | Man-machine interaction method, device, equipment and storage medium |
CN111443801B (en) * | 2020-03-25 | 2023-10-13 | 北京百度网讯科技有限公司 | Man-machine interaction method, device, equipment and storage medium |
CN112185392A (en) * | 2020-09-30 | 2021-01-05 | 深圳供电局有限公司 | Voice recognition processing system for power supply intelligent client |
CN112185393A (en) * | 2020-09-30 | 2021-01-05 | 深圳供电局有限公司 | Voice recognition processing method for power supply intelligent client |
CN113284404A (en) * | 2021-04-26 | 2021-08-20 | 广州九舞数字科技有限公司 | Electronic sand table display method and device based on user actions |
CN113921006A (en) * | 2021-09-30 | 2022-01-11 | 安徽有声电子科技有限公司 | Voice signal amplifier in voice control system |
CN114490090B (en) * | 2022-04-02 | 2022-07-01 | 广东茉莉数字科技集团股份有限公司 | Internet data center demand response optimization method based on multi-objective evolutionary algorithm |
CN114490090A (en) * | 2022-04-02 | 2022-05-13 | 广东茉莉数字科技集团股份有限公司 | Internet data center demand response optimization method based on multi-objective evolutionary algorithm |
CN115408510B (en) * | 2022-11-02 | 2023-01-17 | 深圳市人马互动科技有限公司 | Plot interaction node-based skipping method and assembly and dialogue development system |
CN115408510A (en) * | 2022-11-02 | 2022-11-29 | 深圳市人马互动科技有限公司 | Plot interaction node-based skipping method and assembly and dialogue development system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110299152A (en) | Interactive output control method, device, electronic equipment and storage medium | |
US11749265B2 (en) | Techniques for incremental computer-based natural language understanding | |
KR102523982B1 (en) | Dynamic and/or context-specific hot words to invoke automated assistants | |
KR102535338B1 (en) | Speaker diarization using speaker embedding(s) and trained generative model | |
CN110634483B (en) | Man-machine interaction method and device, electronic equipment and storage medium | |
CN110288985B (en) | Voice data processing method and device, electronic equipment and storage medium | |
CN112236739B (en) | Adaptive automatic assistant based on detected mouth movement and/or gaze | |
CN112868060B (en) | Multimodal interactions between users, automated assistants, and other computing services | |
JP7566789B2 (en) | Two-pass end-to-end speech recognition | |
CN110998717A (en) | Automatically determining language for speech recognition of a spoken utterance received through an automated assistant interface | |
CN114041283A (en) | Automated assistant engaged with pre-event and post-event input streams | |
US11093110B1 (en) | Messaging feedback mechanism | |
CN112292724A (en) | Dynamic and/or context-specific hotwords for invoking automated assistants | |
CN110287303B (en) | Man-machine conversation processing method, device, electronic equipment and storage medium | |
US11216497B2 (en) | Method for processing language information and electronic device therefor | |
US10388325B1 (en) | Non-disruptive NUI command | |
CN108055617A (en) | Microphone awakening method and device, terminal equipment and storage medium | |
US20240055003A1 (en) | Automated assistant interaction prediction using fusion of visual and audio input | |
KR20190074508A (en) | Method for crowdsourcing data of chat model for chatbot | |
CN106980640B (en) | Interaction method, device and computer-readable storage medium for photos | |
US20230326369A1 (en) | Method and apparatus for generating sign language video, computer device, and storage medium | |
WO2023218268A1 (en) | Generation of closed captions based on various visual and non-visual elements in content | |
CN114424148B (en) | Electronic device and method for providing manual thereof | |
JP2016024378A (en) | Information processor, control method and program thereof | |
CN116483960B (en) | Dialogue identification method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191001 |
|
RJ01 | Rejection of invention patent application after publication |