CN112466302B

CN112466302B - Voice interaction method and device, electronic equipment and storage medium

Info

Publication number: CN112466302B
Application number: CN202011324401.4A
Authority: CN
Inventors: 吴震; 周茂仁; 刘兵; 崔亚峰; 革家象; 郭启行
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2022-09-23
Anticipated expiration: 2040-11-23
Also published as: CN112466302A

Abstract

The application discloses a voice interaction method and device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence such as voice technology and natural language processing. The specific implementation scheme is as follows: determining a first identification result corresponding to the query statement within first preset time after the query statement is acquired; determining a first reply sentence according to the first recognition result; determining a second identification result corresponding to the query statement within a second preset time after the query statement is acquired, wherein the length of the second preset time is greater than that of the first preset time; playing the first reply sentence in the case where the second recognition result coincides with the first recognition result. The method and the device can improve the voice interaction speed, and ensure the accuracy and reliability of the voice output sentences.

Description

Voice interaction method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to the field of artificial intelligence technologies such as speech technology and natural language processing, and in particular, to a method and an apparatus for speech interaction, an electronic device, and a storage medium.

Background

With the vigorous development of computer technology, artificial intelligence technology has also been developed rapidly, and intelligent devices such as smart televisions, smart speakers, VR glasses, and the like have also been used more and more widely. The voice interaction is an indispensable part in equipment such as smart televisions, smart sound boxes and VR glasses, and how to carry out the voice interaction fast and accurately seems to be vital.

Disclosure of Invention

The application provides a voice interaction method and device, electronic equipment and a storage medium.

According to an aspect of the present application, there is provided a method comprising:

determining a first identification result corresponding to the query statement within first preset time after the query statement is obtained;

determining a first reply sentence according to the first recognition result;

determining a second identification result corresponding to the query statement within second preset time after the query statement is obtained, wherein the length of the second preset time is greater than that of the first preset time;

in a case where the second recognition result coincides with the first recognition result, the first reply sentence is played.

According to another aspect of the application, there is provided an apparatus comprising:

the first identification module is used for determining a first identification result corresponding to the query statement within first preset time after the query statement is obtained;

the first reply module is used for determining a first reply sentence according to the first recognition result;

the second identification module is used for determining a second identification result corresponding to the query statement within a second preset time after the query statement is acquired, wherein the length of the second preset time is greater than that of the first preset time;

and the playing module is used for playing the first reply sentence under the condition that the second recognition result is consistent with the first recognition result.

According to another aspect of the present application, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of voice interaction as described in embodiments of the above-described aspect.

According to another aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon a computer program having computer instructions for causing the computer to perform the method of voice interaction described in the embodiments of the above-described aspect.

The method, the device, the electronic equipment and the storage medium for voice interaction have the following beneficial effects:

determining a first identification result corresponding to the query statement within a first preset time after the query statement is acquired; determining a first reply sentence according to the first recognition result; determining a second identification result corresponding to the query statement within a second preset time after the query statement is acquired, wherein the length of the second preset time is greater than that of the first preset time; playing the first reply sentence in the case where the second recognition result coincides with the first recognition result. Therefore, the method and the device can improve the voice interaction speed and ensure the accuracy and reliability of the voice output sentences.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

FIG. 1A is a schematic flow chart diagram illustrating a method of voice interaction in accordance with one embodiment of the present application;

FIG. 1B is a schematic diagram illustrating a process of a voice interaction method according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating a method of voice interaction according to another embodiment of the present application;

FIG. 3 is a flow chart illustrating a method of voice interaction according to yet another embodiment of the present application;

FIG. 4 is a schematic diagram of an apparatus for voice interaction according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an apparatus for voice interaction according to another embodiment of the present application;

FIG. 6 is a schematic diagram of an apparatus for voice interaction according to another embodiment of the present application;

FIG. 7 is a block diagram of an electronic device for implementing a method of voice interaction of an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Artificial intelligence is the subject of research that causes computers to simulate certain human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning technology, a deep learning technology, a big data processing technology, a knowledge map technology and the like.

Key technologies in the field of computers for speech technology are automatic speech recognition technology (ASR) and speech synthesis technology (TTS). The computer can listen, see, speak and feel, and the development direction of future human-computer interaction is provided, wherein the voice becomes the best human-computer interaction mode in the future, and the voice has more advantages than other interaction modes.

Natural language processing is the computer processing, understanding and use of human languages (such as chinese, english, etc.), which is a cross discipline between computer science and linguistics, also commonly referred to as computational linguistics. Since natural language is the fundamental hallmark that human beings distinguish from other animals. Without language, human thinking has not been talk about, so natural language processing embodies the highest task and context of artificial intelligence, that is, only when a computer has the capability of processing natural language, the machine has to realize real intelligence.

Methods, apparatuses, electronic devices, and storage media for voice interaction according to embodiments of the present application are described below with reference to the accompanying drawings.

Fig. 1A is a schematic flowchart of a voice interaction method according to an embodiment of the present application.

The method for optimizing the voice interaction speed according to the embodiment of the present application can be executed by a voice interaction apparatus provided in the embodiment of the present application, and the apparatus can be configured in an electronic device for voice interaction.

As shown in fig. 1A, the method of voice interaction includes:

step 101, determining a first identification result corresponding to the query statement within a first preset time after the query statement is acquired.

Wherein, the query sentence is a sentence which is input by the user and is desired to be replied by the electronic equipment. For example, the words may be words of different requirements such as "play a song", "open APP", "i want to order take out", and the like, which is not limited in this embodiment.

The first preset time is a period of time starting to be counted after the last tail sound of the user utterance falls, that is, the first preset time is a time for which a mute state detected by the electronic device lasts. The first recognition result is obtained by performing speech recognition on the query sentence within a first preset time after the user's voice falls.

The length of the first preset time can be set as required. In practical use, in order to reduce the waiting time of the user, the length of the first preset time may be set to 5ms, 10ms, 20ms, etc., which is only for illustration and cannot be taken as a limitation for setting the first preset time in this application.

Step 102, determining a first reply sentence according to the first recognition result.

Wherein the first reply sentence is a reply sentence obtained according to the first recognition result. There are various ways to determine the reply sentence, such as:

various recognition results and different corresponding reply sentences may be stored in advance. For example, when the recognition result is "listen to a songbook", the electronic device selects a corresponding reply sentence, which may be a "good, play songbook" reply sentence. When the recognition result is 'order take out', the answer sentence selected by the electronic equipment can be an 'open take out APP' answer sentence.

It should be noted that the recognition results are various, and the corresponding reply sentences are also various, which is only an example, and the present embodiment does not limit this.

Or the electronic equipment calls the corresponding dialogue service according to the first recognition result, the dialogue service judges the intention of the user according to the current text request and the dialogue state of the user, and pulls the corresponding resource to generate a first reply sentence to be broadcasted to the user.

Step 103, determining a second recognition result corresponding to the query statement within a second preset time after the query statement is acquired, wherein the length of the second preset time is greater than that of the first preset time.

The second preset time is a period of time counted after the user utterance is ended, and the length of the period of time is greater than that of the first preset time. It is understood that the second predetermined time is also the duration of the mute state detected by the electronic device.

In the embodiment of the application, in order to reduce the waiting time of the user, the electronic device may perform voice recognition on the query sentence to obtain a first recognition result within a first preset time after the query sentence is obtained, and then determine the first reply sentence according to the first recognition result. However, since the first preset time is short, it is possible that the mute state in the first preset time is caused by a pause of the user, that is, the first recognition result is a result corresponding to the fragmented query sentence, and thus the accuracy of the first reply sentence determined thereby is low.

In this embodiment, in order to avoid that the first recognition result is a result corresponding to a fragmented query statement, after the first recognition result is determined, the obtained voice signal may be further continuously recognized, so that a second recognition result corresponding to the query statement is determined within a second preset time after the query statement is obtained.

It can be understood that, because the second identification result is an identification result that is determined after the query statement is acquired and a mute state is long enough, that is, the second identification result is determined to be an identification result corresponding to the complete query statement.

In practical use, the second preset time may be set to be long enough within a range acceptable by a user, so that it is ensured that the second recognition result obtained after the second preset time is accurate.

And 104, playing the first reply sentence under the condition that the second recognition result is consistent with the first recognition result.

Specifically, the obtained first recognition result and the second recognition result are compared, and whether the two recognition results are consistent or not is judged. If the two recognition results are consistent, the first recognition result is complete and accurate, that is, the first reply sentence determined by the first recognition result is accurate, and the first reply sentence obtained before can be directly played.

It should be noted that, in the specific implementation, the second preset time may be long or short, and the specific operations performed may be different in order to improve the efficiency.

For example, when the second preset time is long enough and the second preset time has not ended yet, the electronic device has already determined the voice data to be played according to the first recognition result, that is, the first reply sentence is voice data. Therefore, when the second recognition result is obtained, the two recognition results are compared, and when the two recognition results are consistent, the voice data is directly played.

Or the second preset time is short, the electronic device acquires the second recognition result first, and meanwhile, the voice data to be played is not acquired according to the first recognition result, namely the first reply sentence is text data. At this time, the two recognition results can be compared, if the two recognition results are consistent, the first reply sentence is continuously subjected to voice conversion to obtain voice data, and if the two recognition results are inconsistent, the voice conversion operation is not required.

It should be noted that, in actual use, if the electronic device determines the first recognition result within a first preset time after acquiring the query statement, but detects the voice signal within a second preset time, at this time, it may be directly determined that the first recognition result is unavailable. And then, the obtained query sentence can be determined again, and the speech recognition is carried out on the obtained query sentence.

The following describes an embodiment of the present application with reference to fig. 1B by taking a specific voice interaction process as an example.

Fig. 1B is a schematic process diagram of a voice interaction method according to an embodiment of the present application.

As shown in fig. 1B, in the process of detecting voice activity, the electronic device detects that the user sends a voice signal that "i want to hear rice fragrance" and the end tone time of the voice signal is 1, and then continuously detects a mute state for a first preset time duration, that is, the mute state is always kept between time 1 and time 1.5, so that at time 1.5, a first recognition result is obtained, and then the electronic device performs a dialogue call and text-to-speech synthesis according to the first recognition result, and obtains a first reply sentence "good" at time 5. Meanwhile, from the moment 1 to a second preset time 2 later, the electronic device detects that all signals are mute signals, and then when the second preset time is over, namely the moment 2, the electronic device obtains a second recognition result. The first recognition result and the second recognition result can be compared, and when the first recognition result and the second recognition result are consistent, the first reply sentence 'good' is played.

In a specific implementation, it is also possible that, because the second preset time is relatively short, the electronic device obtains the second recognition result first and then obtains the first reply sentence, i.e. time 2 is before time 5. The above examples should not be construed as limiting the embodiments of the present application.

In this embodiment, the first recognition result and the second recognition result are respectively determined within the first preset time and the second preset time when the query sentence is acquired, and then the two recognition results are compared, and when the two recognition results are consistent, the first reply sentence determined by the first recognition result is played. The generating process of the reply sentence is parallel to the recognition process of the query sentence, so that the accuracy of the reply sentence is ensured, the waiting time is shortened, the electronic equipment can give a reply within the optimal response time, and the voice interaction speed is improved.

Since different users have different speaking characteristics, if the electronic device sets the same first preset time for all query sentences, the accuracy of the obtained first recognition result may be reduced. In the embodiment of the application, in order to improve the accuracy of the recognition result, first preset time may be set according to the attribute of the current interactive user, and then voice interaction may be performed according to the set first preset time. In the following, with reference to fig. 2, the method for voice interaction provided by the present application is further described by taking the user attribute analysis of the query statement as an example.

Step 201, determining the user attribute corresponding to the query statement.

The user attribute is used for representing the speaking speed of the user. For example, the user attribute may be an age, a gender, an occupation, and the like of the user, which is not limited in this embodiment.

Since the audio features of users of different ages or genders are usually different, in the embodiment of the present application, the user attribute may be determined according to the obtained audio feature of the query statement.

For example, the old people may speak more slowly, the tone is lower, the female speaking tone is higher, and the user may be classified as "old people", "female", and the like.

Or determining the user attribute corresponding to the current query statement according to the matching degree of the audio features corresponding to the current query statement and the known audio features of each user.

For example, three users using the current electronic device interact with the electronic device, so that the electronic device can determine respective audio features of the users according to historical interaction, compare the audio features with the determined audio features when the user receives the query again next time, and determine which user attribute belongs to according to the matching degree.

Alternatively, attribute division may be performed according to the content of the query statement.

For example, if the query statement is "i want to listen to a small story before sleep", it can be determined that the current user is "child". If the query statement is "how to possess eight abdominal muscles", it can be determined that the current user is "male". If the query statement is 'how mom communicates with child efficiently', it can be judged that the current user is 'woman'.

It should be noted that the above methods for determining the user attribute are only examples, and are not intended to be limitations of the method for determining the user attribute in the present application.

Step 202, determining the length of the first preset time according to the user attribute.

The users with different attributes have different speaking characteristics, and different first preset times can be set.

For example, "the elderly" and "children" may set a longer first preset time. The 'female' can set shorter first preset time, so that different first preset times are set according to different user attributes, on one hand, the first preset time can be guaranteed to be long enough for 'old people' and 'children', and on the other hand, the efficiency can be relatively improved for 'female'.

It should be noted that the user attributes are only examples, and this embodiment does not limit this.

Step 203, determining a first identification result corresponding to the query statement within a first preset time after the query statement is acquired.

Step 204, determining a first reply sentence according to the first recognition result.

Step 205, in a second preset time after the query statement is obtained, determining a second identification result corresponding to the query statement, where the length of the second preset time is greater than the length of the first preset time.

In step 206, in the case where the second recognition result is not identical to the first recognition result, a second reply sentence is determined based on the second recognition result.

Wherein the second reply sentence is a reply sentence obtained for the second recognition result. And comparing the first recognition result with the second recognition result, wherein if the two recognition results are inconsistent, the first recognition result is inaccurate, and the second recognition result is more accurate. Because the first preset time is shorter than the second preset time, the information contained in the first recognition result determined by the first preset time may not be complete enough, and the information contained in the second recognition result determined by the longer second preset time is more complete and accurate. At this time, the second reply sentence is determined using the second recognition result.

For example, when the query statement input by the user is identified, the obtained first identification result and the second identification result may be different, which may be caused by that the query statement of the user obtained by the electronic device after the first preset time is incomplete. Possibly, the speaking of the user is intermittent, the pause is carried out for a long time, the electronic equipment mistakenly considers that the speaking of the user is finished, so that the recognition is carried out in the first preset time, and a first recognition result which is not a complete sentence is obtained. Therefore, when the first recognition result and the second recognition result are compared, the first recognition result and the second recognition result are not consistent, and at the moment, the incomplete first recognition result is abandoned, and the second recognition result is used for analysis to obtain a second reply sentence.

Step 207, the second reply sentence is played.

Specifically, a second reply sentence is determined by analysis according to the second recognition result, and the second reply sentence is taken as a finally output sentence. Thus, after the second reply sentence is determined, the second reply sentence can be played.

According to the voice interaction method provided by the embodiment, the query sentence is subjected to the user attribute, and the second recognition result corresponding to the longer second preset time is obtained, so that the corresponding second reply sentence is obtained as the final output sentence, and the accuracy and reliability of the output reply sentence are ensured.

According to the embodiment, the accuracy of the obtained first recognition result may not be high enough, and in order to ensure the accuracy and improve the efficiency, after the first recognition result is obtained, semantic integrity detection may be performed on the first recognition result, and then other operations may be performed. The following describes the speech interaction method provided by the present application in detail by taking semantic integrity analysis of the first recognition result as an example with reference to fig. 3.

Step 301, performing voice activity detection on the detected voice signal.

The voice activity detection is to detect whether the acquired voice signal contains valid voice or silence within the current time period.

The electronic device always performs voice detection to detect whether the current voice is valid.

Step 302, under the condition that silent segments with a first preset time length are continuously acquired, determining a first recognition result corresponding to the acquired voice signal.

When the silent segments with the first preset length of time are continuously acquired, the voice signals acquired within the current first preset time are all the silent information, no effective voice information exists, and the electronic equipment can recognize the voice signals acquired before to obtain a first recognition result. The voice signal is the query sentence input by the user.

Step 303, performing semantic integrity detection on the first recognition result to determine a score of the first recognition result.

And performing semantic integrity detection on the first recognition result, namely detecting whether the first recognition result is a complete query statement, has complete semantics and has an accurate expressed meaning. And according to the obtained semantic integrity detection result, scoring the first recognition result, and obtaining a score of the corresponding integrity of the first recognition result.

In the case where the score is greater than the threshold value, a first reply sentence is determined based on the first recognition result, step 304.

Wherein the threshold is a preset value.

If the score is larger than the threshold value, the semantic integrity of the first recognition result is higher, the expressed meaning accuracy is higher, and the first recognition result is a complete sentence, so that the first recognition result can be used as the complete sentence, and the first reply sentence is determined according to the first recognition result.

Step 305, a first reply sentence is played.

If the first recognition result is determined to be accurate, the first recognition result does not need to wait for longer second preset time, and then the second recognition result corresponding to the first recognition result does not need to be obtained, so that the waiting time is reduced. The first reply sentence determined according to the first recognition result can be used as the final output sentence, so that the first reply sentence is directly played.

And step 306, determining a second identification result corresponding to the query statement within a second preset time after the query statement is acquired under the condition that the score is less than or equal to the threshold, wherein the length of the second preset time is greater than that of the first preset time.

When the score is smaller than or equal to the threshold value, the semantic integrity of the first recognition result is low, the first recognition result may not be a complete sentence, the expressed meaning is not accurate enough, and the first recognition result is unreliable. At this time, the first recognition result is abandoned, and then a second preset time is waited for to obtain a corresponding second recognition result.

Step 307, according to the second recognition result, determining a second reply sentence, and playing the second reply sentence.

The second recognition result is accurate and reliable, so that the second reply sentence determined according to the second recognition result is also accurate and reliable, and the second reply sentence can be played as a final output sentence.

It should be noted that, in practical use, when the semantic integrity score of the first recognition result is greater than the threshold value, and the first reply sentence is determined according to the first recognition result, the electronic device may already obtain the second recognition result, at this time, the two recognition results may be compared, and when the two recognition results are identical, the first reply sentence is directly played.

In this embodiment, the voice activity detection is performed on the detected voice signal, and then the semantic integrity detection is performed on the first recognition result to determine the score of the first recognition result, and if the score is greater than the threshold, the first reply sentence is played, and if the score is less than or equal to the threshold, the second reply sentence is determined according to the second recognition result and played. Namely, in the embodiment of the application, voice activity detection is performed, semantic integrity detection is performed on the first recognition result, and a corresponding reply sentence is selected as an output sentence according to the detection result, so that not only is the voice interaction speed improved, but also the accuracy and reliability of the output sentence are improved.

In order to implement the foregoing embodiments, an apparatus for voice interaction is further provided in the embodiments of the present application. Fig. 4 is a schematic structural diagram of a voice interaction apparatus according to an embodiment of the present application.

As shown in fig. 4, the voice interaction apparatus 400 includes: a first recognition module 410, a first reply module 420, a second recognition module 430 and a first play module 440.

The first identifying module 410 is configured to determine a first identifying result corresponding to the query statement within a first preset time after the query statement is obtained.

A first reply module 420, configured to determine a first reply sentence according to the first recognition result.

The second identifying module 430 is configured to determine a second identifying result corresponding to the query statement within a second preset time after the query statement is obtained, where a length of the second preset time is greater than a length of the first preset time.

A first playing module 440, configured to play the first reply sentence if the second recognition result is consistent with the first recognition result.

Fig. 5 is a schematic structural diagram of a voice interaction device according to another embodiment of the present application. As shown in fig. 5, the voice interaction apparatus 500 includes: a first identification module 510, a first reply module 520, a second identification module 530, a first determination module 540, a second determination module 550, a second reply module 560, and a second play module 570.

The first identifying module 510 is configured to determine a first identifying result corresponding to the query statement within a first preset time after the query statement is obtained.

It is understood that the first identification module 510 in the present embodiment may have the same function and structure as the first identification module in the above-described embodiment.

A first reply module 520, configured to determine a first reply sentence according to the first recognition result.

It is understood that the first reply module 520 in the present embodiment may have the same function and structure as the first reply module in the above-described embodiment.

A second identifying module 530, configured to determine a second identifying result corresponding to the query statement within a second preset time after the query statement is obtained, where a length of the second preset time is greater than a length of the first preset time.

It is understood that the second identification module 530 in the present embodiment may have the same function and structure as the second identification module in the above-described embodiment.

The first determining module 540 is configured to determine a user attribute corresponding to the query statement.

A second determining module 550, configured to determine the length of the first preset time according to the user attribute.

A second reply module 560, configured to determine a second reply sentence according to the second recognition result if the second recognition result is inconsistent with the first recognition result.

A second playing module 570, configured to play the second reply sentence.

Fig. 6 is a schematic structural diagram of a voice interaction device according to another embodiment of the present application. As shown in fig. 6, the voice interaction apparatus 600 includes: a first recognition module 610, a first reply module 620, a first play module 630, a third determination module 640, a second reply module 650, and a second play module 660.

The first identification module 610 includes:

a voice detection unit 6110, configured to perform voice activity detection on the detected voice signal;

the determining unit 6120 is configured to determine, when a silence segment of a first preset time length is continuously acquired, a first recognition result corresponding to the acquired voice signal.

It is understood that the first identification module 610 in this embodiment may have the same function and structure as the first identification module in any of the above embodiments.

A third determining module 620, configured to perform semantic integrity detection on the first recognition result to determine a score of the first recognition result.

A first reply module 630, configured to determine a first reply sentence according to the first recognition result.

It is understood that the first reply module 630 in this embodiment may have the same function and structure as the first reply module in any of the above embodiments.

The first playing module 640 is further configured to, when the score is greater than the threshold, end the operation of obtaining the second identification result corresponding to the query sentence, and play the first reply sentence, and further includes:

a voice conversion unit 6410, configured to perform voice conversion on the first reply sentence to acquire voice data to be played, if the second recognition result is consistent with the first recognition result;

a playing unit 6420, configured to play the voice data.

It is understood that the first playing module 640 in this embodiment may have the same function and structure as the first playing module in any of the above embodiments.

The second reply module 650 is further configured to determine a second reply sentence according to the second recognition result if the score is less than or equal to the threshold.

It is understood that the second reply module 650 in the present embodiment may have the same function and structure as the second reply module in the above-described embodiment.

A second playing module 660, configured to play the second reply sentence.

It is understood that the second playing module 660 in this embodiment may have the same function and structure as the second playing module in the above embodiment.

It should be noted that the explanation of the foregoing voice interaction method embodiment is also applicable to the voice interaction apparatus of the embodiment, and therefore, the explanation is not repeated here.

The voice interaction device of the embodiment of the application determines a first identification result corresponding to a query statement within a first preset time after the query statement is acquired; determining a first reply sentence according to the first recognition result; determining a second identification result corresponding to the query statement within a second preset time after the query statement is acquired, wherein the length of the second preset time is greater than that of the first preset time; in a case where the second recognition result coincides with the first recognition result, the first reply sentence is played. The method and the device can improve the speed of voice interaction and ensure the accuracy and reliability of voice output sentences.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 7 is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 7, the electronic apparatus includes: one or more processors 701, a memory 702, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing some of the necessary operations (e.g., as an array of servers, a group of blade servers, or a multi-processor system). One processor 701 is illustrated in fig. 7.

The memory 702 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of voice interaction provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of voice interaction provided herein.

The memory 702, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method of voice interaction in the embodiments of the present application (e.g., the first recognition module 610, the first reply module 620, the second recognition module 630, and the first play module 640 shown in fig. 4). The processor 701 executes various functional applications of the server and data processing, i.e., a method of implementing voice interaction in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory 702.

The memory 702 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the electronic device for voice interaction, and the like. Further, the memory 702 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 702 may optionally include memory located remotely from the processor 701, which may be connected over a network to an electronic device for methods of voice interaction. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method of voice interaction may further comprise: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or other means, and fig. 7 illustrates an example of a connection by a bus.

The input device 703 may receive input numeric or character information, and generate key signal inputs related to user settings and function control of the electronic apparatus of the method of voice interaction, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 704 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the conventional physical host and VPS (Virtual Private Server) service. The server may also be a server of a distributed system, or a server incorporating a blockchain.

According to the technical scheme of the embodiment of the application, a first identification result corresponding to a query statement is determined within first preset time after the query statement is obtained; determining a first reply sentence according to the first recognition result; determining a second identification result corresponding to the query statement within a second preset time after the query statement is acquired, wherein the length of the second preset time is greater than that of the first preset time; playing the first reply sentence in the case where the second recognition result coincides with the first recognition result. The method and the device can improve the speed of voice interaction and ensure the accuracy and reliability of voice output sentences.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments are not intended to limit the scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of voice interaction, comprising:

determining a first identification result corresponding to the query statement within first preset time after the query statement is acquired;

determining a first reply sentence according to the first recognition result, wherein the first reply sentence is text data;

determining a second identification result corresponding to the query statement within a second preset time after the query statement is acquired, wherein the length of the second preset time is greater than that of the first preset time;

under the condition that the second recognition result is consistent with the first recognition result, converting the first reply sentence into voice data, and playing the voice data;

before the determining the first recognition result corresponding to the query statement, the method further includes:

determining a user attribute corresponding to the query statement according to the audio feature corresponding to the query statement or the content corresponding to the query statement, wherein the user attribute comprises: the gender of the user and the age of the user;

determining the length of the first preset time according to the user attribute;

determining a first identification result corresponding to the query statement within a first preset time after the query statement is obtained, where the determining includes:

performing voice activity detection on the detected voice signal, comprising: detecting whether the query statement contains effective voice and whether the query statement contains silence in the current time period;

under the condition of continuously acquiring silent segments with a first preset time length, determining a first recognition result corresponding to the acquired voice signal;

after the determining the first recognition result corresponding to the query statement, the method further includes:

performing semantic integrity detection on the first recognition result to determine a score of the first recognition result;

under the condition that the score is larger than a threshold value, finishing the operation of obtaining a second identification result corresponding to the query sentence, and playing the first reply sentence;

determining a second reply sentence according to the second recognition result in a case where the score is less than or equal to a threshold;

and playing the second reply sentence.

2. The method of claim 1, further comprising:

determining a second reply sentence according to the second recognition result in the case that the second recognition result is inconsistent with the first recognition result;

and playing the second reply sentence.

3. The method according to any one of claims 1-2, wherein the first reply sentence is text data, and the playing the first reply sentence in the case where the second recognition result coincides with the first recognition result includes:

under the condition that the second recognition result is consistent with the first recognition result, performing voice conversion on the first reply sentence to obtain voice data to be played;

and playing the voice data.

4. An apparatus for voice interaction, comprising:

the first reply module is used for determining a first reply sentence according to the first recognition result, wherein the first reply sentence is text data;

the second identification module is used for determining a second identification result corresponding to the query statement within second preset time after the query statement is obtained, wherein the length of the second preset time is greater than that of the first preset time;

a first playing module, configured to convert the first reply sentence into voice data and play the voice data when the second recognition result is consistent with the first recognition result;

the device further comprises:

a first determining module, configured to determine, according to the audio feature corresponding to the query statement or the content corresponding to the query statement, a user attribute corresponding to the query statement, where the user attribute includes: the gender of the user and the age of the user;

the second determining module is used for determining the length of the first preset time according to the user attribute;

the first identification module comprises:

the voice detection unit is used for carrying out voice activity detection on the detected voice signals and comprises the following steps: detecting whether the query statement contains effective voice and whether the query statement contains silence in the current time period;

the determining unit is used for determining a first recognition result corresponding to the acquired voice signal under the condition that the silent segments with the first preset time length are continuously acquired;

a third determining module, configured to perform semantic integrity detection on the first recognition result to determine a score of the first recognition result;

the first playing module is further configured to, when the score is greater than a threshold, end the operation of obtaining the second recognition result corresponding to the query sentence, and play the first reply sentence;

a second reply module for determining a second reply sentence according to the second recognition result in the case where the score is less than or equal to a threshold value;

and the second playing module is used for playing the second reply sentence.

5. The apparatus for voice interaction according to claim 4,

the second reply module is further configured to determine a second reply sentence according to the second recognition result if the second recognition result is inconsistent with the first recognition result;

the second playing module is further configured to play the second reply sentence.

6. The apparatus of any one of claims 4-5, wherein the first playing module comprises:

a voice conversion unit, configured to perform voice conversion on the first reply sentence to obtain voice data to be played when the second recognition result is consistent with the first recognition result;

and the playing unit is used for playing the voice data.

7. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of voice interaction of any of claims 1-3.

8. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of voice interaction of any of claims 1-3.