CN113129896B

CN113129896B - Voice interaction method and device, electronic equipment and storage medium

Info

Publication number: CN113129896B
Application number: CN201911402606.7A
Authority: CN
Inventors: 刘浩; 耿磊
Original assignee: Beijing Orion Star Technology Co Ltd
Current assignee: Beijing Orion Star Technology Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2023-12-12
Anticipated expiration: 2039-12-30
Also published as: CN113129896A

Abstract

The embodiment of the invention provides a voice interaction method, a voice interaction device, electronic equipment and a storage medium, which relate to the technical field of data processing and comprise the following steps: after detecting the start of the voice to be recognized, requesting to establish connection with a server; if the connection with the server is not successfully established within the first preset time period, the collected voice to be recognized is recognized based on an offline voice recognition model, and an offline recognition result is obtained and displayed; if connection is successfully established with the server after the first preset time length, sending the voice to be recognized to the server, and receiving a cloud recognition result of the voice to be recognized, which is sent by the server; after the voice to be recognized is detected to be finished, if the updating requirement is met, updating the displayed recognition result according to the received cloud recognition result. By applying the scheme provided by the embodiment of the invention, the voice interaction efficiency can be improved.

Description

Voice interaction method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a voice interaction method, a device, an electronic device, and a storage medium.

Background

With the rapid development of artificial intelligence technology, various intelligent devices are increasingly widely used. In order to facilitate the use of users, some smart devices have a voice interaction function, so that users can interact with the smart devices through voice. The intelligent device recognizes the voice of the user in the process of voice interaction with the user, and responds to the voice based on the recognition result.

Since it takes a certain time to recognize speech and obtain response information for responding to the speech, recognition results are generally displayed in order to prevent the user from thinking that the smart device does not respond to the speech.

For example, when the intelligent device is a robot with a navigation function placed in a mall, the robot obtains a voice sent by a user and sent to a meeting room, and then sends the voice to a server. The server identifies the received voice to obtain a text of an identification result of 'taking me to the meeting room', the identification result is fed back to the robot, and the robot displays the text of 'taking me to the meeting room'.

Although the interaction between the robot and the user can be realized by the method, when the network is poor, the voice obtained by the robot cannot be timely sent to the server, the recognition result of the server cannot be timely returned to the robot, and the robot cannot timely display the voice recognition result, so that the interaction efficiency is low, and poor experience is brought to the user.

Disclosure of Invention

The embodiment of the invention aims to provide a voice interaction method, a voice interaction device, electronic equipment and a storage medium, so as to improve voice interaction efficiency. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a voice interaction method, where the method includes:

after detecting the start of the voice to be recognized, requesting to establish connection with a server;

if the connection with the server is not successfully established within the first preset time period, the collected voice to be recognized is recognized based on an offline voice recognition model, and an offline recognition result is obtained and displayed;

if connection is successfully established with the server after the first preset time length, sending the voice to be recognized to the server, and receiving a cloud recognition result of the voice to be recognized, which is sent by the server;

after the voice to be recognized is detected to be finished, if the updating requirement is met, updating the displayed recognition result according to the received cloud recognition result.

In one embodiment of the present invention, after detecting that the voice to be recognized is finished, if the update requirement is satisfied, updating the displayed recognition result according to the received cloud recognition result includes:

If the cloud recognition result of the voice fragment returned by the server is received within a second preset time after the voice to be recognized is detected to be finished, the updating requirement is determined to be met, and the displayed recognition result is updated according to the received cloud recognition result of the voice fragment, wherein the voice fragment comprises: the voice to be recognized is collected from the beginning of the voice to be recognized to the end of the voice to be recognized.

In one embodiment of the present invention, if the cloud recognition result of the voice segment returned by the server is received within a second preset time period after the voice to be recognized is detected to be ended, it is determined that the update requirement is met, and the displayed recognition result is updated according to the received cloud recognition result of the voice segment, including:

if the cloud recognition result of the voice to be recognized returned by the server is received before the voice to be recognized is detected to be finished, and the cloud recognition result of the voice fragment returned by the server is received within a second preset time period after the voice to be recognized is finished, the updating requirement is determined to be met, and the displayed recognition result is updated according to the received cloud recognition result.

In one embodiment of the invention, the method further comprises:

Determining the offline recognition success rate of the offline voice recognition model for correctly recognizing the voice fragments according to the offline recognition result of the voice fragments, wherein the voice fragments comprise: the voice to be recognized is collected from the beginning of the voice to be recognized to the ending of the voice to be recognized;

and if the offline recognition success rate reaches a preset threshold, carrying out semantic analysis on the offline recognition result of the voice fragment based on an offline semantic analysis model, obtaining response information corresponding to the voice fragment, and outputting the response information.

In one embodiment of the present invention, the determining the offline recognition success rate of the offline speech recognition model for correctly recognizing the speech segment according to the offline recognition result of the speech segment, if the offline recognition success rate reaches a preset threshold, performing semantic analysis on the offline recognition result of the speech segment based on an offline semantic analysis model to obtain response information corresponding to the speech segment, and outputting the response information includes:

if any one of the following conditions is met and the offline recognition success rate reaches a preset threshold, performing semantic analysis on the offline recognition result of the voice segment based on an offline semantic analysis model, obtaining response information corresponding to the voice segment, and outputting the response information:

The cloud identification result sent by the server is not received before the voice to be identified is detected to be ended;

detecting that cloud recognition results of the voice fragments are not received within a second preset time period after the voice to be recognized is finished;

and after receiving the cloud identification result of the voice fragment, not receiving the response information of the voice fragment sent by the server within a third preset time length.

In one embodiment of the present invention, the determining the offline recognition success rate of the offline speech recognition model for correctly recognizing the speech segment according to the offline recognition result of the speech segment includes:

counting the total number of characters corresponding to the voice fragments, and counting the number of characters successfully recognized in the offline recognition result of the voice fragments;

and calculating the ratio of the number of the characters to the total number of the characters to obtain the offline identification success rate.

In one embodiment of the invention, the method further comprises:

if the response information of the voice segment sent by the server is received within a third preset time after the cloud identification result of the voice segment is received, determining the received response information as the response information of the voice segment and outputting the response information, wherein the voice segment comprises: the voice to be recognized is collected from the beginning of the voice to be recognized to the end of the voice to be recognized.

In a second aspect, an embodiment of the present invention provides a voice interaction device, where the device includes:

the network connection module is used for requesting to establish connection with the server after detecting the start of the voice to be recognized;

the voice recognition module is used for recognizing the collected voice to be recognized based on the offline voice recognition model if the connection with the server is not successfully established within the first preset time length, and obtaining and displaying an offline recognition result;

the result receiving module is used for sending the voice to be recognized to the server if the connection with the server is successfully established after the first preset time length, and receiving a cloud recognition result of the voice to be recognized, which is sent by the server;

and the display updating module is used for updating the displayed recognition result according to the received cloud recognition result if the updating requirement is met after the voice to be recognized is detected to be ended.

In one embodiment of the present invention, the display update module is specifically configured to:

In one embodiment of the invention, the apparatus further comprises:

the response information obtaining module is configured to determine, according to an offline recognition result of the voice segment, an offline recognition success rate of the offline voice recognition model for correctly recognizing the voice segment, and if the offline recognition success rate reaches a preset threshold, perform semantic analysis on the offline recognition result of the voice segment based on an offline semantic analysis model, obtain response information corresponding to the voice segment, and output the response information, where the voice segment includes: the voice to be recognized is collected from the beginning of the voice to be recognized to the end of the voice to be recognized.

In one embodiment of the present invention, the response information obtaining module is specifically configured to:

calculating the ratio of the number of the characters to the total number of the characters to obtain the offline identification success rate;

In one embodiment of the invention, the apparatus further comprises:

the response information receiving module is configured to determine, if response information of the voice segment sent by the server is received within a third preset duration after a cloud identification result of the voice segment is received, the received response information is determined to be the response information of the voice segment, and output the response information, where the voice segment includes: the voice to be recognized is collected from the beginning of the voice to be recognized to the end of the voice to be recognized.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any one of the first aspects when executing a program stored on a memory.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium having a computer program stored therein, which when executed by a processor, implements the method steps of any of the first aspects.

In a fifth aspect, embodiments of the present invention also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform any of the above-described voice interaction methods.

The embodiment of the invention has the beneficial effects that:

when the scheme provided by the embodiment of the invention is applied to voice interaction, after the intelligent equipment detects the start of the voice to be recognized, the intelligent equipment requests to establish connection with the server, and if the connection with the server is not successfully established within the first preset time, the acquired voice to be recognized is recognized based on an offline voice recognition model, and an offline recognition result is obtained and displayed.

Therefore, when the scheme provided by the embodiment of the invention is applied to voice interaction, the offline recognition result of the intelligent equipment for recognizing the voice to be recognized can be displayed even under the condition of poor network, the user does not need to wait for the cloud recognition result sent by the server for a long time, and the voice interaction efficiency can be improved.

In addition, if the connection is successfully established with the server after the first preset time length, the voice to be recognized is sent to the server, and the cloud recognition result of the voice to be recognized, which is sent by the server, is received. Because the service end has sufficient running resources and storage space, the processor has excellent performance, and can run more complex models. Because the more complex model has richer sample data and stronger robustness during training, the cloud identification result obtained based on the more complex model has higher accuracy. Therefore, after the voice to be recognized is detected to be finished, if the updating requirement is met, the displayed recognition result is updated according to the received cloud recognition result. Therefore, on the premise of improving the voice interaction efficiency, the accuracy of the voice recognition result displayed to the user can be ensured to be higher.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a voice interaction method according to an embodiment of the present invention;

fig. 2 is a signaling diagram of a voice interaction process according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating another voice interaction method according to an embodiment of the present invention;

FIG. 4 is a signaling diagram of another voice interaction process according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a voice interaction device according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of another voice interaction device according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic flow chart of a voice interaction method according to an embodiment of the present invention, where the method may be applied to an intelligent device. For example, the smart device may be a robot, a smart phone, a tablet computer, a smart speaker, etc. The method includes the following steps 101 to 104.

Step 101, after detecting the start of the voice to be recognized, requesting to establish connection with the server.

Specifically, the intelligent device may collect surrounding voices and detect the collected voices, and when detecting that the voice to be recognized starts, it is indicated that the voice collected next is the voice to be recognized. Because the server side has the function of performing voice recognition on the voice to be recognized, the collected voice to be recognized can be sent to the server side, and the server side can recognize the voice to be recognized. When the intelligent device performs data interaction with the server, the intelligent device is realized based on network connection, so that the intelligent device needs to request to establish connection with the server.

Wherein, intelligent device can utilize sound collector such as adapter to gather pronunciation, and for example, above-mentioned adapter can be microphone etc.. The operation of collecting the voice can continuously run through the working process of the intelligent equipment. The intelligent device may continuously collect surrounding voices, or may collect surrounding voices in a preset period of time, for example, for a robot placed in a mall, the period of time for collecting voices may be a business period of the mall. The robot may start to collect surrounding voice after receiving the collection instruction, or may start to collect surrounding voice after detecting that the user approaches the intelligent device.

In one embodiment of the invention, after the connection is requested to be established with the server, whether the connection is successfully established with the server can be judged by sending a detection packet to the server. Specifically, the intelligent device can send a detection packet to the server, the server receives the detection packet and then returns confirmation information to the intelligent device, and the intelligent device can determine that the connection with the server is successfully established after receiving the confirmation information.

In one embodiment of the present invention, the VAD (Voice Activity Detection ) model may be used to detect whether the voice to be recognized starts or not, and the VAD model may also be used to detect whether the voice to be recognized ends or not. Specifically, the VAD model may detect a front end point and a rear end point of the voice to be recognized in the collected voices, and may consider that the voice to be recognized is detected to start when the front end point of the voice to be recognized is detected, and may consider that the voice to be recognized is detected to end when the rear end point of the voice to be recognized is detected.

The front end point may be understood as the first speech segment of the speech to be recognized, for example, the speech segment corresponding to the "band" in the collected speech "band me goes to the conference room" is the front end point.

The tail end point may be understood as the last speech segment of the speech to be recognized, for example, the speech segment corresponding to the collected speech "take me to meeting room" is the tail end point.

In one embodiment of the invention, the intensity of the collected voice can be detected, and when the intensity reaches the voice intensity threshold value, the voice to be recognized is considered to be detected. For example, a robot placed in a mall may collect surrounding noisy voices, where the noisy voices include: sound of a pedestrian speaking, sound of advertisement playing, etc. The noisy speech does not need to be recognized and is therefore not the sound to be recognized. While the user is talking to the robot, the sound made by the user needs to be recognized, i.e. the speech to be recognized. Since the user usually has a relatively short distance when talking with the robot, the intensity of the emitted sound is relatively high, and thus, the user can judge whether the voice is to be recognized or not by detecting the intensity of the collected voice. When the intensity of the collected voice reaches a voice intensity threshold, the user who sends the voice is considered to be close to the intelligent equipment, and the user is in a conversation with the intelligent equipment, so that the voice is considered to be the voice to be recognized.

In one embodiment of the invention, it may also be determined whether the detection of the voice to be recognized is started by detecting a wake-up voice. The wake-up voice is a voice which is preset and used for determining that a user starts to talk with the intelligent device. Specifically, wake-up voice may be preset, and when it is detected that the collected voice includes the wake-up voice, the user is considered to start to talk with the intelligent device, that is, the voice to be recognized starts, and the voice to be recognized is the voice to be recognized next. The wake-up voice may be "hey, siri" voice, or "elegance" voice.

Step 102, if connection with the server is not successfully established within the first preset time, the acquired voice to be recognized is recognized based on the offline voice recognition model, and an offline recognition result is obtained and displayed.

In one embodiment of the present invention, the first preset duration may be set according to the timeout duration, and the first preset duration may be set to be smaller than the timeout duration. The timeout duration is as follows: and the preset time length is used for judging the overtime of the connection establishment with the server. When the connection with the server is not established after the timeout period is reached, the connection is considered to be timeout, the timeout period may be 3 seconds, 5 seconds, 10 seconds, and the like, and the first preset period may be: 500 milliseconds, 800 milliseconds, 1000 milliseconds, etc., and the embodiments of the present invention are not limited in this regard.

Specifically, under the condition that connection with the server is not successfully established within a first preset duration, the network is poor, the time consumed by the intelligent equipment for sending the voice to be recognized to the server and the time consumed by the server for returning the recognition result to the intelligent equipment is long, in order to avoid the recognition result returned by the server for a long time by a user, the intelligent equipment directly recognizes the collected voice to be recognized based on an offline voice recognition model, recognizes the voice to be recognized offline, obtains an offline recognition result, and displays the offline recognition result.

In one embodiment of the present application, the smart device may display an offline identification result after each offline identification result is obtained. After all the voices to be recognized are recognized, all the offline recognition results may be displayed.

Under the condition that connection with the server is successfully established within a first preset time length, the network is good, the intelligent equipment sends the voice to be recognized to the server, and the time for the server to return the recognition result to the intelligent equipment is short, so that the collected voice to be recognized is sent to the server, after the server receives the voice to be recognized, the voice to be recognized is recognized, the recognition result is returned to the intelligent equipment, and after the intelligent equipment receives the recognition result, the recognition result is displayed. In one embodiment of the application, each time the intelligent device collects a voice frame, the voice frame can be sent to the server, each time the server receives a voice frame, the voice frame can be identified, the identification result of the voice frame is sent to the intelligent device, and each time the intelligent device receives an identification result, the identification result is displayed. Wherein, the voice frame is: the time length of each frame of voice can be 20 ms, 30 ms, 50 ms and the like.

In one embodiment of the invention, a character library can be constructed in advance, when the voice recognition is carried out, firstly, the phonemes of the voice to be recognized are recognized, then, the characters corresponding to the recognized phonemes are determined according to the character library, and then, the searched characters are combined into sentences. One phoneme may correspond to one character, or a plurality of phonemes may correspond to one character. The characters may be Chinese characters, numbers, english words, english letters, etc. The character library may include: the character used at high frequency in the actual application scene may be, for example, an "elevator", "supermarket", "parking lot", or the like, assuming that the actual application scene is a mall.

If the character corresponding to the voice to be recognized is determined based on the character library, the character corresponding to the voice to be recognized can be successfully recognized through the description; if the character corresponding to the voice to be recognized cannot be determined based on the character library, only information such as phonemes and the like of the voice to be recognized can be recognized, and the character corresponding to the voice to be recognized is difficult to recognize.

The intelligent device locally stores the character library, so that when the intelligent device cannot be connected to the server, the intelligent device can also recognize the collected voice to be recognized, and then the character corresponding to the voice to be recognized is determined based on the locally stored character library, so that the voice recognition efficiency can be improved.

The server can provide sufficient storage space and operation resources, so that the number of characters contained in a character library stored by the server is large, the range of recognizable voice frames is larger, and the success rate of recognizing the voice to be recognized is higher. Moreover, the performance of the server processor is stronger, and a more complex model can be operated to identify the voice to be identified, so that the accuracy of a cloud identification result obtained by identifying the voice to be identified by the server is higher.

Step 103, if the connection is successfully established with the server after the first preset time, sending the voice to be recognized to the server, and receiving a cloud recognition result of the voice to be recognized, which is sent by the server.

Specifically, after a first preset period of time, the intelligent device may continue to request to establish connection with the server. After the connection is successfully established with the server, the collected voice to be recognized is sent to the server, the server receives the voice to be recognized, the voice is recognized, a cloud recognition result is obtained, and the cloud recognition result is sent to the intelligent device.

Step 104, after detecting that the voice to be recognized is finished, if the update requirement is met, updating the displayed recognition result according to the received cloud recognition result.

The update request may be a preset request for determining whether to update the displayed offline recognition result.

Specifically, the accuracy of the cloud recognition result obtained by recognizing the voice to be recognized by the server is higher, so that in order to display a more accurate recognition result to the user, after the voice to be recognized is detected to be finished, if the update requirement is met, the displayed offline recognition result is updated to be the more accurate cloud recognition result.

In addition to the foregoing detection of whether the voice to be recognized ends or not by using the VAD model, in one embodiment of the present invention, the frequency and/or intensity of the voice to be recognized may be determined, and when the detected frequency is lower than a preset frequency threshold and/or the detected intensity is lower than a preset intensity threshold, and the duration of such voice reaches a preset duration threshold, the voice to be recognized ends is considered to be detected, where the preset duration threshold may be 2 seconds, 3 seconds, and the embodiment of the present invention is not limited thereto.

When the scheme provided by the embodiment is applied to voice interaction, after the intelligent equipment detects that the voice to be recognized starts, the intelligent equipment requests to establish connection with the server, and if connection with the server is not successfully established within a first preset time period, the acquired voice to be recognized is recognized based on an offline voice recognition model, and an offline recognition result is obtained and displayed.

Therefore, when the scheme provided by the embodiment is applied to voice interaction, the offline recognition result of the intelligent device for recognizing the voice to be recognized can be displayed even under the condition of poor network, the user does not need to wait for the cloud recognition result sent by the server for a long time, and the voice interaction efficiency can be improved.

In an embodiment of the present application, for the step 104, when updating the recognition result, if the cloud recognition result of the voice segment returned by the server is received within the second preset time period after the end of the voice to be recognized is detected, it is determined that the update requirement is met, and the displayed recognition result is updated according to the received cloud recognition result of the voice segment.

Wherein the speech segment comprises: the voice to be recognized is collected from the beginning of the voice to be recognized to the end of the voice to be recognized. It can be understood that the speech segments are: all voices to be recognized are from the beginning of the voices to be recognized to the end of the voices to be recognized.

The second preset time period may be 500 milliseconds, 800 milliseconds, 1000 milliseconds, etc. The second preset time period is 1 second, assuming that the time when the end of the voice to be recognized is detected is 15 seconds, and the time when the end of the voice to be recognized is detected is within the second preset time period, that is, within the 15 th to 16 th seconds.

Specifically, since the accuracy of the server for recognizing the voice to be recognized is high, in order to ensure the accuracy of the displayed recognition result, it is expected to display the cloud recognition result of the server for recognizing the voice to be recognized, so that the server continues to wait for a second preset time period.

And under the condition that the cloud end recognition result of the voice fragment is received in the second preset time, determining that the updating requirement is met, and updating the displayed offline recognition result by the cloud end recognition result of the voice fragment instead of displaying the offline recognition result of the voice fragment. Under the condition that the cloud end recognition result of the voice fragment is not received within the second preset time length, the method comprises the following steps: the cloud recognition result sent by the server is not received, or the cloud recognition result of part of the voice to be recognized sent by the server is received, and the recognition result of the voice fragment sent by the server is not received. Under the situation, the network is poor, the time for waiting to receive the cloud recognition result of the voice fragment returned by the server may be long, and in order to avoid long waiting time of the user, the displayed offline recognition result is not updated any more because the update requirement is considered to be unsatisfied.

In one embodiment of the present invention, if it is detected that the cloud end recognition result of the voice to be recognized returned by the server has been received before the voice to be recognized is finished, and the cloud end recognition result of the voice segment returned by the server is received within a second preset time period after the voice to be recognized is finished, it is determined that the update requirement is met, and the displayed recognition result is updated according to the received cloud end recognition result.

Specifically, it may be first determined whether a cloud recognition result sent by the server has been received before the end of the voice to be recognized is detected, and if the cloud recognition result has been received before the end of the voice to be recognized is detected, it is indicated that the voice to be recognized has been successfully connected to the server before the end of the voice to be recognized. In this case, the probability of receiving the cloud recognition result of the voice segment within the second preset duration is higher, so that the waiting for the second preset duration can be continued. And under the condition that the cloud identification result of the voice fragment sent by the server side is received in the second preset time period, determining that the updating requirement is met, and updating the displayed offline identification result with the received cloud identification result.

If the cloud recognition result is not received before the voice to be recognized is detected to be finished, the intelligent device is still not successfully connected to the server before the voice to be recognized is detected to be finished. Under the condition, the probability of successfully establishing connection with the server side within the second preset time period is low, and then the probability of receiving the cloud identification result of the voice fragment within the second preset time period is low, and the update requirement is considered not to be met, so that the cloud identification result sent by the server side can be received without waiting for the second preset time period, and the displayed offline identification result is directly used as a final display result.

The above-mentioned voice interaction scheme is described in the following with reference to the signaling diagram.

Referring to fig. 2, fig. 2 is a signaling diagram of a voice interaction process according to an embodiment of the present invention.

The intelligent equipment continuously collects voice, and when the front end point of the voice to be recognized is detected, the voice to be recognized is considered to be detected to start, so that connection is requested to be established with the server;

under the condition that the intelligent equipment does not successfully establish connection with the server within a first preset time period, the intelligent equipment identifies the acquired voice to be identified based on the offline voice identification model, and an offline identification result is obtained and displayed;

the intelligent equipment continuously requests to establish connection with the server, and if the intelligent equipment is successfully connected to the server after a first preset time period, the intelligent equipment sends voice to be recognized to the server;

the server receives the voice to be recognized, recognizes the voice to be recognized, and returns a cloud recognition result obtained by recognition to the intelligent device;

and the intelligent equipment determines that the updating requirement is met under the condition that the cloud end recognition result of the voice fragment sent by the server is received within a second preset time period after the voice to be recognized is detected, and updates the displayed offline recognition result according to the received cloud end recognition result.

Referring to fig. 3, fig. 3 is a schematic flow chart of another voice interaction method according to an embodiment of the present invention, which includes, in addition to the steps 101 to 104, the following steps 105 and 106:

step 105, determining the offline recognition success rate of the offline speech recognition model for correctly recognizing the speech fragments according to the offline recognition result of the speech fragments.

In one embodiment of the invention, when the intelligent device recognizes the voice to be recognized, for each voice frame, if the character corresponding to the voice frame can be found in the stored character library, the character corresponding to the voice frame can be recognized, that is, the voice frame is successfully recognized; if the character corresponding to the voice frame cannot be found in the character library, only syllable information of the voice frame, namely recognition failure of the voice frame, can be recognized.

The above offline identification success rate can be understood as: the characters in the speech segment that are successfully identified are the ratio of the total number of corresponding characters in the offline recognition result.

In this case, in one embodiment of the present invention, when calculating the success rate of offline recognition, the total number of characters corresponding to the speech segment may be counted, and the number of characters of the characters successfully recognized in the offline recognition result of the speech segment may be counted, and the ratio of the number of characters to the total number may be calculated, thereby obtaining the success rate of offline recognition.

Specifically, since each character corresponds to a syllable, the total number of syllables in the speech segment can be counted when counting the total number of characters corresponding to the offline recognition result. For example, assume that a voice segment is "with me going to a meeting room", where the total number of syllables is 6, that is, the total number of characters corresponding to an offline recognition result obtained by the intelligent device recognizing the voice segment is 6. The number of successfully recognized characters in the offline recognition result is the number of characters corresponding to the successfully recognized syllables.

In one embodiment of the present invention, when the offline recognition result includes the recognized characters and the marks representing the recognition failure, the sum of the number of the characters and the marks may be counted, as the total number of the characters corresponding to the offline recognition result, the number of the characters may be counted, and as the number of the successfully recognized characters. Wherein, the above-mentioned identification can be "x", "×", "? "etc., one identification corresponds to one syllable of the recognition failure in the offline recognition result.

For example, assuming that the speech segment is a voice corresponding to "please ask for where the milky tea shop is", the offline recognition result is "× where the milky tea shop is", where "×" is a sign indicating recognition failure, it can be seen that the total number of syllables in the speech segment is 8, and the number of successfully recognized characters is 6, and the offline recognition success rate is 75%.

In one embodiment of the invention, the confidence level of each offline recognition result can be obtained when the offline speech recognition model recognizes the speech frame. The confidence represents the probability that the offline recognition result is the character corresponding to the voice frame. The confidence threshold may be preset, and when the confidence level of the offline recognition result reaches the confidence threshold, the offline recognition result may be considered as a character corresponding to the speech frame.

In this case, when calculating the offline recognition success rate, the ratio of the speech frames whose confidence reaches the confidence threshold in the offline recognition result of each speech frame of the speech segment may be calculated as the offline recognition success rate.

For example, assume that a speech segment contains 6 speech frames, and the confidence level of each speech frame is shown in table 1 below:

TABLE 1

Speech frame	Voice frame 1	Voice frame 2	Voice frame 3	Speech frame 4	Speech frame 5	Speech frame 6
							Confidence level	0.8	0.3	0.2	0.6	0.9	0.7

Assuming that the confidence threshold is 0.6, as can be seen from table 1 above, in the speech segment, the speech frame with the confidence reaching the confidence threshold includes: the ratio of the voice frame 1, the voice frame 4, the voice frame 5 and the voice frame 6 to the total number of the voice frames is 4/6, and the ratio can be used as the success rate of offline recognition.

In one embodiment of the present invention, the offline speech recognition model may further output a confidence level of an offline recognition result of the speech segment, where the confidence level characterizes a probability that the offline recognition model is correct for recognizing the speech segment. In this case, the confidence level of the model output may be directly used as the offline recognition success rate.

And 106, if the success rate of the offline recognition reaches a preset threshold, carrying out semantic analysis on the offline recognition result of the voice fragment based on the offline semantic analysis model, obtaining response information corresponding to the voice fragment, and outputting the response information.

The preset threshold may be 80%, 90%, 60%, etc. Specifically, the preset threshold may be determined according to an application scenario. Under the condition of higher success rate of offline identification, the accuracy of response information obtained according to the offline identification result is also higher. Therefore, when the accuracy requirement of the application scene on the response information is high, a high preset threshold value, such as 80%, 90%, and the like, can be set. When the accuracy requirement of the application scene on the response information is low and the coverage rate of semantic analysis on the offline recognition result is expected to be high, a lower preset threshold value, such as 50%, 60% and the like, can be set.

Under the condition that the success rate of offline recognition reaches a preset threshold value, the intelligent equipment is higher in accuracy of recognizing the voice fragments, so that the offline recognition result of the voice fragments can be subjected to semantic analysis based on an offline semantic analysis model to obtain the semantic analysis result, response information corresponding to the voice fragments is obtained, and the response information is output.

The semantic analysis result contains meaning expected to be expressed by the user, and for example, the semantic analysis result can comprise requirements, targets and the like of the user. For example, assume that the offline recognition result of a speech segment is: the 'with me goes to the meeting room', the requirement of the user can be analyzed to be a 'leading function', and the goal is a 'meeting room'.

The response information is information for responding to the voice clip. The response information may be navigation guidance information, object introduction information, and the like. For example, assuming that the semantic analysis results in guiding the user to "meeting room" using the "lead function", the response information may be navigation map information.

In one embodiment of the invention, key information in the offline identification result can be extracted, and the requirements of the user are analyzed according to the key information. For example, assuming that the offline recognition result is "take me to go to meeting room", key information may be extracted to be "go to meeting room", and according to the key information, the requirement of the user is determined to be "leading function", and the target is "meeting room".

In one embodiment of the present invention, the correspondence between the semantic analysis result and the response information may be preset. Specifically, the correspondence may include a correspondence between the target and the response information in the semantic analysis result, and a correspondence between the requirement and the response information.

For example, the correspondence between the target and the response information may be a correspondence between the target "conference room" and the response information "conference room location information" in the semantic analysis result, so that when the semantic analysis result triggers to "conference room", the response information is determined to be "conference room location information".

The corresponding relation between the requirement and the response information can be the corresponding relation between the requirement 'go' and the response information 'lead function' in the semantic analysis result, so that when the semantic analysis result is triggered to 'go', the response information can be determined to lead the user to the target position, and the response to the user requirement is facilitated.

In one embodiment of the invention, when the response information is obtained, the received response information can be determined as the response information of the voice segment and output when the response information of the voice segment sent by the server is received within a third preset time period after the cloud identification result of the voice segment is received. Wherein the third preset duration may be 500 milliseconds, 800 milliseconds, 1000 milliseconds, etc.

Because the operation resources, storage resources and the like of the server are more sufficient compared with those of the intelligent equipment, the server can operate a more complex model to recognize the voice to be recognized, the cloud recognition result obtained by the server for recognizing the voice fragments is higher in accuracy than the offline recognition result obtained by the intelligent equipment, the semantic analysis result obtained by the server for semantic analysis according to the cloud recognition result is higher in accuracy, and finally the response information obtained by the server is higher in accuracy.

Therefore, if the response information sent by the server is received within the third preset time after the cloud identification result of the voice segment is received, the received response information is determined to be the response information for responding to the voice segment in order to ensure accuracy. If the response information sent by the server is not received within the third preset time after the cloud identification result of the voice fragment is received, the network is poor, and in order to ensure the response efficiency, the response information sent by the server is not waited, and the response information determined by the intelligent device based on the offline semantic analysis model is used as the response information for responding to the voice fragment.

Specifically, under the condition that the network is poor and the response information sent by the server cannot be obtained, the intelligent equipment can determine the response information for responding to the voice fragment offline; in the case of a good network, the server may determine the response information and send the determined response information to the intelligent device, where the intelligent device uses the received response information as the response information for responding to the speech segment.

In one embodiment of the invention, under the condition that the success rate of offline identification does not reach a preset threshold, the accuracy of identifying the voice fragment by the intelligent equipment is lower, and under the condition that the accuracy of the response information determined by the intelligent equipment according to the offline identification result is lower, the response information returned by the server and determined according to the cloud identification result can be waited, and if the network is worse and the response information returned by the server cannot be received after exceeding the preset waiting time, prompt information waiting for overtime can be made to the user. The user may choose to continue waiting, may choose to select a manual service, etc. If the waiting time does not exceed the preset waiting time, the information returned by the server is received, and then the response is made according to the received response information. Wherein the waiting time period is generally longer than the third preset time period.

In one embodiment of the invention, when the intelligent device does not determine the response information based on the offline semantic analysis model, the offline semantic analysis model is not supported to perform semantic analysis on the offline recognition result of the voice segment, and the response information sent by the server is received under the condition.

In one embodiment of the invention, as the operation resources, storage resources and the like of the server are more sufficient than those of the intelligent equipment, the server can run a more complex model to determine the response information, and the accuracy of the response information determined by the server is higher than that of the response information determined by the intelligent equipment. Therefore, under the condition that the network is good, response information which is sent by the server and is determined by semantic analysis based on the cloud identification result is adopted, and under the condition that the network is poor, in order to avoid the user waiting for the response information returned by the server for a long time, the response information determined by the intelligent equipment is adopted to respond to the user.

Specifically, in the case that at least one of the following conditions is satisfied, and in the case that the offline identification success rate reaches a preset threshold, determining, by the intelligent device, the response information offline:

in case 1, the cloud recognition result sent by the server is not received before the end of the voice to be recognized is detected.

In this case, the server cannot be successfully connected or connected before the voice to be recognized ends, but the voice to be recognized cannot be successfully sent to the server or the cloud recognition result returned by the server cannot be successfully received due to poor network. In this case, if the waiting server returns the response information, it may take a long time, and a poor experience is brought to the user.

And 2, detecting that cloud recognition results of the voice fragments are not received within a second preset time after the voice to be recognized is finished.

Specifically, if the cloud recognition result is received before the end of the voice to be recognized is detected, the voice to be recognized is successfully connected to the server before the end of the voice to be recognized, the server successfully recognizes part of the voice to be recognized to obtain the cloud recognition result, and the part of the cloud recognition result is successfully sent to the intelligent device. In order to obtain all cloud identification results sent by the server, the server continues to wait for a second preset duration.

If all the identification results are not received within the second preset time period, the network is still worse, and if the response information sent by the server is continued to be waited, the server needs to wait for all the cloud identification results to be returned, and then wait for the response information sent by the server again, so that the time consumption is long. In this case, in order to avoid long waiting time of the user, the response information is determined by the smart device.

And 3, after receiving the cloud identification result of the voice fragment, not receiving the response information of the voice fragment sent by the server within a third preset time.

If the cloud recognition result is received before the voice to be recognized is finished and all the cloud recognition results are successfully received within a second preset time period after the voice to be recognized is finished, the network is possibly restored to be normal, and response information returned by the server side is continued to be waited for in order to ensure that the user is correctly responded.

After all cloud identification results are received, response information sent by the server is not received within a third preset time period, and the network is shown to be degraded again, at this time, the user has waited for a long time, and in order to avoid the user to wait continuously, the response information is determined directly by the intelligent equipment.

In one embodiment of the invention, when receiving the cloud recognition result of the voice segment sent by the server, the intelligent device can perform semantic analysis on the cloud recognition result based on the offline semantic analysis model to obtain the response information of the voice segment and output the response information when determining the response information offline. Because the accuracy of the cloud identification result is higher than that of the offline identification result, under the condition that the cloud identification result of the voice fragment sent by the server is received and the response information returned by the server cannot be received, the semantic analysis is performed on the cloud identification result offline, so that the voice interaction efficiency and the accuracy of the obtained response information can be improved.

Referring to fig. 4, fig. 4 is a signaling diagram of another voice interaction process according to an embodiment of the present invention.

The step of recognizing the voice to be recognized is the same as the step described in fig. 2, that is, the step before the step of updating the display result in fig. 4 is the same as the step described in fig. 2, and a detailed description thereof is omitted.

After updating the display result, the server performs semantic analysis on the cloud identification result to obtain a cloud semantic analysis result, determines response information based on the cloud semantic analysis result, and sends the response information to the intelligent device;

If the intelligent equipment receives the response information sent by the server within a third preset time after receiving the cloud identification result of the voice fragment, outputting the response information returned by the server;

if the intelligent device does not receive the response information sent by the server within a third preset time after receiving the cloud identification result of the voice fragment, the intelligent device performs semantic analysis offline to obtain a semantic analysis result, determines the response information based on the semantic analysis result, and outputs the response information.

When the scheme provided by the embodiment is applied to voice interaction, the server side is utilized to determine the response information, so that the response accuracy can be improved. However, when the response information returned by the server cannot be received under the condition of poor network, the response information is determined offline by the intelligent equipment, so that a user does not need to wait for a response result for a long time, and the voice interaction efficiency can be improved.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a voice interaction device according to an embodiment of the present invention, where the device includes:

the network connection module 501 is configured to request to establish a connection with a server after detecting that a voice to be recognized starts;

the voice recognition module 502 is configured to, if connection with the server is not successfully established within a first preset duration, recognize the collected voice to be recognized based on an offline voice recognition model, obtain an offline recognition result, and display the offline recognition result;

The result receiving module 503 is configured to send the voice to be recognized to the server if the connection is successfully established with the server after the first preset duration, and receive a cloud recognition result of the voice to be recognized sent by the server;

and the display updating module 504 is configured to update the displayed recognition result according to the received cloud recognition result if the update requirement is satisfied after the end of the voice to be recognized is detected.

In one embodiment of the present invention, the display update module 504 is specifically configured to:

if the cloud recognition result of the voice to be recognized returned by the server is received before the voice to be recognized is detected to be finished, and the cloud recognition result of the voice fragment returned by the server is received within a second preset time period after the voice to be recognized is finished, updating the displayed recognition result according to the received cloud recognition result.

Referring to fig. 6, in one embodiment of the invention, the apparatus further comprises:

the response information obtaining module 505 is configured to determine, according to an offline recognition result of a voice segment, an offline recognition success rate of the offline voice recognition model for correctly recognizing the voice segment, and perform semantic analysis on the offline recognition result of the voice segment based on an offline semantic analysis model if the offline recognition success rate reaches a preset threshold, obtain response information corresponding to the voice segment, and output the response information, where the voice segment includes: the voice to be recognized is collected from the beginning of the voice to be recognized to the end of the voice to be recognized.

In one embodiment of the present invention, the response information obtaining module 505 is specifically configured to:

In one embodiment of the invention, the apparatus further comprises:

When the scheme provided by the embodiment is applied to voice interaction, the server side is utilized to determine the response information, so that the response accuracy can be improved. However, when the response information returned by the server cannot be received under the condition of poor network, the intelligent equipment determines the response information, so that a user does not need to wait for a response result for a long time, and the voice interaction efficiency can be improved.

The embodiment of the present invention further provides an electronic device, as shown in fig. 7, including a processor 701, a communication interface 702, a memory 703 and a communication bus 704, where the processor 701, the communication interface 702, and the memory 703 perform communication with each other through the communication bus 704,

a memory 703 for storing a computer program;

the processor 701 is configured to implement any of the steps of the voice interaction method described above when executing the program stored in the memory 703.

The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, there is also provided a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the steps of any of the above-described voice interaction methods.

In yet another embodiment of the present invention, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the voice interaction methods of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the apparatus embodiments, the electronic device embodiments, the computer-readable storage medium embodiments, the computer program product embodiments, the description is relatively simple, as it is substantially similar to the method embodiments, and relevant places are referred to in the partial description of the method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method of voice interaction, the method comprising:

2. The method of claim 1, wherein if the cloud recognition result of the voice segment returned by the server is received within a second preset time period after the voice to be recognized is detected to be finished, determining that the update requirement is met, and updating the displayed recognition result according to the received cloud recognition result of the voice segment comprises:

3. The method according to any one of claims 1-2, wherein the method further comprises:

4. The method of claim 3, wherein determining the offline recognition success rate of the offline speech recognition model for correctly recognizing the speech segment according to the offline recognition result of the speech segment, if the offline recognition success rate reaches a preset threshold, performing semantic analysis on the offline recognition result of the speech segment based on an offline semantic analysis model, obtaining response information corresponding to the speech segment, and outputting the response information, comprises:

5. The method of claim 3, wherein determining an offline recognition success rate for the offline speech recognition model to correctly recognize the speech segment based on the offline recognition result of the speech segment comprises:

6. The method according to any one of claims 1-2, wherein the method further comprises:

7. A voice interaction device, the device comprising:

the display updating module is configured to determine that an updating requirement is met if a cloud recognition result of a voice segment returned by the server is received within a second preset time period after the voice to be recognized is detected, and update a displayed recognition result according to the received cloud recognition result of the voice segment, where the voice segment includes: the voice to be recognized is collected from the beginning of the voice to be recognized to the end of the voice to be recognized.

8. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

A memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1-6 when executing a program stored on a memory.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-6.