CN112466304B

CN112466304B - Offline voice interaction method, device, system, equipment and storage medium

Info

Publication number: CN112466304B
Application number: CN202011411215.4A
Authority: CN
Inventors: 孙洪菠
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2023-09-08
Anticipated expiration: 2040-12-03
Also published as: CN112466304A

Abstract

The application discloses an offline voice interaction method, device, system, equipment and storage medium, relates to the technical field of computers, and particularly relates to the technical field of artificial intelligence such as voice, deep learning and the like. The offline voice interaction method comprises the following steps: after the local terminal wakes up, continuously transmitting a voice signal to be recognized sent by a user to a decoder in the local terminal, so that the decoder continuously decodes the voice signal to be recognized to obtain a voice recognition result; continuously receiving the voice recognition result sent by the decoder, and continuously responding to the voice recognition result until receiving an ending instruction sent by the user; and after receiving the ending instruction, ending the continuous interaction. The application can support continuous recognition after one-time awakening in the offline voice interaction scene.

Description

Offline voice interaction method, device, system, equipment and storage medium

Technical Field

The application relates to the technical field of computers, in particular to the technical field of artificial intelligence such as voice, deep learning and the like, and particularly relates to an offline voice interaction method, device, system, equipment and storage medium.

Background

With the popularization of computer technology, people's life has gradually moved into the intelligent era nowadays. Not only are computers, mobile phones, PADs and people's clothing and eating houses started to apply intelligent technologies, such as intelligent televisions, intelligent navigation, intelligent home and the like, and the intelligent technologies provide convenient and quick services in various aspects of people's life. The voice interaction belongs to the category of man-machine interaction, is a new generation interaction mode based on voice input, and is a process of giving instructions to a machine by using natural language of human beings so as to achieve the purpose of the human beings.

The voice interaction process generally comprises the processes of waking up, voice recognition, voice synthesis and the like. In the prior art, only one-time recognition after waking is supported, namely, after the intelligent device is currently woken up, the intelligent device only executes a single instruction after being woken up, and if the intelligent device is required to be controlled later, the intelligent device is required to be woken up again, and a new instruction is required to be sent out again.

Disclosure of Invention

The application provides an offline voice interaction method, an offline voice interaction device, an offline voice interaction system, offline voice interaction equipment and an offline voice interaction storage medium.

According to an aspect of the present application, there is provided an offline voice interaction method, including: after the local terminal wakes up, continuously transmitting a voice signal to be recognized sent by a user to a decoder in the local terminal, so that the decoder continuously decodes the voice signal to be recognized to obtain a voice recognition result; continuously receiving the voice recognition result sent by the decoder, and continuously responding to the voice recognition result until receiving an ending instruction sent by the user; and after receiving the ending instruction, ending the continuous interaction.

According to another aspect of the present application, there is provided an offline voice interaction apparatus, comprising: the transmission unit is used for continuously transmitting the voice signal to be recognized sent by the user to a decoder in the local terminal after the local terminal is awakened, so that the decoder continuously decodes the voice signal to be recognized to obtain a voice recognition result; the response unit is used for continuously receiving the voice recognition result sent by the decoder and continuously responding to the voice recognition result until receiving an ending instruction sent by the user; and the ending unit is used for ending the continuous interaction after receiving the ending instruction.

According to another aspect of the present application there is provided an offline voice interaction system comprising an apparatus as claimed in any one of the above aspects.

According to another aspect of the present application, there is provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present application there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of the above aspects.

According to the technical scheme, after the local terminal is awakened, the voice signal is continuously transmitted and processed, and the voice interaction is ended when the user initiatively initiates and ends, so that continuous recognition after one-time awakening in an offline voice interaction scene can be supported, user experience is improved, resource waste is avoided, and voice interaction efficiency is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:

FIG. 1 is a schematic diagram of a first embodiment according to the present application;

FIG. 2 is a schematic diagram of an offline voice interaction system according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a second embodiment according to the present application;

FIG. 4 is a schematic diagram of a retrospective speech signal according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a third embodiment according to the present application;

FIG. 6 is a schematic diagram of a fourth embodiment according to the present application;

FIG. 7 is a schematic diagram of a fifth embodiment according to the application;

fig. 8 is a schematic diagram of an electronic device for implementing any of the offline voice interaction methods according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the related art, voice interaction only supports single recognition after waking up, for example, a wake-up word is "small-scale", in a music scene, a user needs to wake up an intelligent device (such as an intelligent sound box) to play music, the user needs to speak "small-scale", the intelligent sound box replies a response word (such as "in woolen", the user can speak a voice command "play music" later, and the intelligent sound box performs an operation of playing music after recognition. If the user finds that the music is not the one he wants to listen to after playing the music and needs to replace the music, in the related art, the user needs to wake up the intelligent sound box again, namely, the user needs to speak the "small degree" again, then the intelligent sound box responds to the "on-the-road" again, the user speaks the new voice command again to replace the one, and then the intelligent sound box responds to the new voice command to replace the one music to play. If the user needs to turn up the volume, the intelligent sound box needs to speak the "small degree" again, the intelligent sound box responds to the "on-the-road" again, the user can speak the new voice command again to turn up the volume, and then the intelligent sound box responds to the new voice command to turn up the play volume. As can be seen from the above flow, in the case that the user needs to issue a plurality of voice commands, the voice interaction process in the related art needs to wake up the intelligent device for a plurality of times. For a user, the operation is complicated, and the intelligent equipment needs to be awakened before a new voice instruction is sent out each time, so that the user experience is affected; for intelligent equipment, the wake-up words need to be recognized for multiple times, so that resource waste is caused, and the voice interaction efficiency is reduced.

In order to improve user experience, avoid resource waste and improve voice interaction efficiency, the application provides the following embodiments.

Fig. 1 is a schematic diagram according to a first embodiment of the present application. The embodiment provides an offline voice interaction method, which comprises the following steps:

101. after the local terminal wakes up, the voice signal to be recognized sent by the user is continuously transmitted to a decoder in the local terminal, so that the decoder continuously decodes the voice signal to be recognized to obtain a voice recognition result.

102. And continuously receiving the voice recognition result sent by the decoder, and continuously responding to the voice recognition result until receiving an ending instruction sent by the user.

103. And after receiving the ending instruction, ending the continuous interaction.

The offline voice interaction method provided by the embodiment is applied to an offline voice interaction scene, so that the execution subject of the embodiment is a local terminal used by a user. The specific form of the local terminal is not limited, and covers all intelligent devices configured with an offline voice interaction function with a user, for example, the local terminal can be a vehicle-mounted terminal, an intelligent home terminal and various mobile devices, and the mobile devices include: mobile phones, tablet computers, handheld computing devices, PDAs (personal digital assistants), portable media players, devices that use headsets and headphones (e.g., bluetooth compatible devices), cell phone tablet (phablet) devices (i.e., combination smart phones/tablet devices), wearable computers, and the like.

The local terminal can perform voice interaction with the user based on the offline voice interaction system, and further, the offline voice interaction system can comprise a voice interaction interface so that the user can input voice instructions through the voice interaction interface. The voice interaction interface may be provided by an APP (application), a web page, a program, or the like, which is not limited by the present application. The APP may be explicitly installed on the interface of the local terminal or the APP may be called up by the user through a specific hardware and/or software button, nor is the application limited in this regard.

In this embodiment, "sustained" as opposed to "single" refers to being in progress rather than ending until offline voice interaction is not completed. For example, after the local terminal wakes up, N voice signals are received before an end instruction sent by the user is received, if the processing is single, only the first voice signal after waking up is processed, and the rest N-1 voice signals are regarded as invalid voice signals and are not processed. In this embodiment, all of the N voice signals are processed. Therefore, multiple interactions after single wake-up can be realized instead of single wake-up and single recognition.

As shown in fig. 2, the offline voice interaction system 200 may include: a data acquisition module 201, a wake-up module 202, an identification processing module 203, a voice endpoint detection module 204 and a decoder 205.

In connection with the offline voice interaction system shown in fig. 2, the execution subject of the method shown in fig. 1 may be specifically the recognition processing module 203 in the system shown in fig. 2.

In the offline voice interaction process, after the intelligent equipment collects the voice signals, judging whether the voice signals contain wake-up words or not, after the fact that the wake-up words are contained is determined, taking the voice signals received later as voice signals to be recognized, carrying out subsequent recognition, response and other processes on the voice signals to be recognized, for example, if the voice signals to be recognized are music playing, recognizing and responding to the operation of music playing.

In the embodiment of the present application, in order to distinguish, a voice signal received before a local terminal successfully wakes up is referred to as a "wake-up voice signal", where the "wake-up voice signal" may or may not include a wake-up word, for example, the wake-up word is "small-scale", and the voice signal including "small-scale" is the wake-up voice signal; the voice signal received after the local terminal is successfully awakened is called as a voice signal to be recognized, for example, after the local terminal is successfully awakened by adopting the awakening word of "small degree", the voice signal such as "play music" and the like is used as the voice signal to be recognized.

The data acquisition module 201 is used for acquiring voice signals. For example, after a user sends out a voice signal, the microphone array collects the voice signal sent by the user, and the microphone array may not process or enhance the voice signal sent by the user, and then send the unprocessed voice signal (may be referred to as an original voice signal) or the processed voice signal to the data collection module 201.

After the data acquisition module 201 acquires the voice signal, if the voice signal is a wake-up voice signal, the wake-up voice signal is sent to the wake-up module 202. For example, after the data acquisition module 201 acquires the voice signal, if the wake-up identifier fed back by the wake-up module 202 is not received, the currently acquired voice signal is sent to the wake-up module 202 as a wake-up voice signal; if the wake-up identifier fed back by the wake-up module 202 is received, the currently acquired voice signal is used as the voice signal to be recognized and is not sent to the wake-up module any more. In addition, after the data acquisition module 201 acquires the voice signal, the voice signal is sent to the recognition processing module 203, whether the voice signal is a wake-up voice signal or a voice signal to be recognized.

The wake-up module 202 is configured to detect whether a wake-up word is included in the voice signal, and determine that the local terminal is successfully wake-up when the wake-up word is included, or else, continue to detect the voice signal when the wake-up word is not included. The wake-up module 202 may be implemented by various related technologies when detecting wake-up words, for example, dividing a speech signal into multiple frames, extracting speech features of each frame of speech signal, and determining whether the frame of speech signal contains wake-up words according to the speech features and the wake-up acoustic model.

After detecting that the voice signal contains the wake-up word, the wake-up module 202 feeds back the wake-up identifier to the data acquisition module 201, and after the data acquisition module 201 receives the wake-up identifier, it is determined that the local terminal is successfully awakened, and then subsequent processing after the wake-up is performed, for example, feedback response information of the local terminal can be triggered, for example, after the user wakes up the local terminal by adopting the wake-up word "small-scale", the local terminal feeds back the response word "on-line" to the user.

After receiving the wake-up identifier fed back by the wake-up module 202, the data acquisition module 201 may send the wake-up identifier to the recognition processing module 203, so that the recognition processing module 203 determines a wake-up time point according to the wake-up success information, and performs subsequent processing based on the wake-up time point. And, after receiving the wake-up identifier, the data acquisition module 201 continuously transmits the voice signal received later to the recognition processing module 203 and the voice endpoint detection module 204 as the voice signal to be recognized.

The voice endpoint detection module 204 is configured to detect a voice start point and a voice tail point of a voice signal to be recognized, and send the detected voice start point and voice tail point to the recognition processing module 203. The voice endpoint detection module 204 is, for example, a voice activity detection (Voice Activity Detection, VAD) module. The voice endpoint detection module 204 may employ various related techniques to detect voice endpoints (voice start point and voice end point), for example, to extract voice features of a voice signal, and then detect the voice endpoints according to the voice features and the voice endpoint detection model.

The recognition processing module 203 is configured to determine a wake-up time point according to the received wake-up identifier, determine a backtracking start point by using the wake-up time point as a base point, send a speech signal between the backtracking start point and a tail point of the speech signal to be recognized for the first time as a backtracking speech signal to the decoder 205, and select a speech signal not to be recognized for the first time according to the speech start point and the speech end point detected by the speech end point detection module 204 to continuously transmit to the decoder 205.

The decoder 205 is configured to perform decoding processing on the received voice signal to be recognized, obtain a voice recognition result, and continuously send the voice recognition result to the recognition processing module 203. The decoder may perform decoding processing using various correlation techniques, such as extracting speech features of the speech signal, and recognizing speech recognition results based on the speech features and an offline speech recognition model. When decoding, for example, "play music" in the form of speech is recognized as "play music" in the form of text.

The recognition processing module 203 is further configured to continuously receive the speech recognition result sent by the decoder, and continuously respond to the speech recognition result. For example, if the speech recognition result is "play music", the music playing interface is called to play music.

In the related art, only single recognition is supported after a local terminal wakes up; in this embodiment, after the terminal wakes up, the data acquisition module, the voice endpoint detection module, the recognition processing module and the decoder support continuous voice transmission, continuous voice endpoint detection, continuous voice decoding, continuous voice recognition result response and other processes until an end instruction sent by the user is received. For example, after the local terminal is awakened by adopting the small-scale method, the user sequentially speaks voice signals such as music playing, music changing, volume increasing and the like, and only the music playing operation is identified and responded in the related art, and the music changing and volume increasing operation is not responded; in this embodiment, the operations of playing music, changing one's head, etc. are sequentially responded.

In the embodiment of the application, the first speech signal to be recognized refers to the first speech signal to be recognized spoken by the user, such as the above-mentioned "play music", after the local terminal wakes up, and the non-first speech signal to be recognized refers to the speech signal to be recognized after the first speech signal to be recognized, such as the above-mentioned "change one's head", "turn up volume", etc.

The ending instruction sent by the user is actively sent by the user, and the ending instruction can be a voice signal which is spoken by the user, or can be an operation instruction which is generated by the user through operating software or hardware on the local terminal. A user speaking voice signal such as "stop playing" by the user; or, for example, setting an 'exit' icon on the voice interaction interface, and generating an end instruction after the user clicks the 'exit' icon; or may be a hardware button, for example, where the user clicks a preset end button on the local terminal to generate an end command. The application is not limited to the specific form of the end instruction issued by the user.

After the recognition processing module 203 obtains the end instruction sent by the user, the continuous interaction is ended, for example, a voice signal is not sent to the decoder any more, a voice recognition result is not responded any more, state setting can be performed, and information and the like for successfully exiting the continuous interaction can be fed back to the application layer.

The wake-up process of the local terminal may be repeated, for example, after finishing the continuous interaction process, the next continuous interaction process may be performed, for example, after finishing the continuous interaction process with the above-mentioned "stop playing", if the user needs to play music again later, or needs to perform other operations, such as making a call, etc., the user may wake up the local terminal again by using the wake-up word "small degree", and start the new continuous interaction process again. Therefore, after receiving the end instruction, the current continuous interaction is ended, or the current continuous interaction is ended instead of all the continuous interaction processes, and the user can still wake up again and start up the new continuous interaction process in the following process.

In this embodiment, after the local terminal wakes up, the voice signal is continuously transmitted and processed, and the voice interaction is ended when the user initiatively initiates and ends, so that continuous recognition after one-time wake-up in an offline voice interaction scene can be supported, user experience is improved, resource waste is avoided, and voice interaction efficiency is improved.

In the offline voice interaction scenario, all relevant modules including the decoder are integrated in the local terminal, for example, all modules of the offline voice interaction system shown in fig. 2 are integrated on a chip of the local terminal, and are limited by the space and processing capacity of the chip, so that the problem of low success rate of first voice recognition may exist. For this purpose, the application also provides some embodiments to improve the first recognition success rate.

In some embodiments, the wake-up is determined according to a wake-up voice signal sent by a user, the voice signal to be recognized includes a first voice signal to be recognized and a non-first voice signal to be recognized, and the continuously transmitting the voice signal to be recognized sent by the user to a decoder in the local terminal includes: determining a backtracking starting point in the awakening voice signal, determining a backtracking voice signal according to the backtracking starting point and the voice signal to be recognized for the first time, and transmitting the backtracking voice signal to a decoder in the local terminal; and continuously acquiring a starting point and a tail point of the non-first-time to-be-recognized voice signal, and continuously transmitting the non-first-time to-be-recognized voice signal between the starting point and the tail point to a decoder in the local terminal.

In this embodiment, by backtracking before the first speech signal to be recognized, the integrity of the first speech signal to be recognized can be ensured, and the success rate of first recognition can be improved.

Fig. 3 is a schematic diagram according to a second embodiment of the application. The embodiment provides an offline voice interaction method, which is combined with the system shown in fig. 2, and includes:

301-302, after the data acquisition module acquires the wake-up voice signal, the wake-up voice signal is sent to the wake-up module and the recognition processing module.

The wake-up speech signal is, for example, a speech signal containing the wake-up word "small scale" issued by the user.

It can be appreciated that the timing relationship of the data acquisition module sending the wake-up voice signal to the wake-up module and the recognition processing module is not limited, for example, the data acquisition module may send the wake-up voice signal to the wake-up module and the recognition processing module at the same time, or may send the wake-up voice signal to the wake-up module first and then to the recognition processing module, or may send the wake-up voice signal to the recognition processing module first and then to the wake-up module.

303. After receiving the wake-up voice signal, the wake-up module identifies wake-up words therein, determines that the local terminal wakes up after identifying the wake-up words, and sends wake-up identification to the data acquisition module.

The wake-up identifier is, for example, a voice watermark value.

The data acquisition module can add a voice watermark on the wake-up voice signal and send the wake-up voice signal added with the voice watermark to the wake-up module and the recognition processing module. When the data acquisition module adds the voice watermark, the voice watermark value can be allocated to each voice watermark, and the voice watermark values are counted in sequence from 0, i.e. the voice watermark values can be 0, 1, 2. The data acquisition module may add a voice watermark to the voice signal by using various correlation techniques, and the manner of adding the voice watermark is not limited in this embodiment.

The wake-up module may process based on the speech frame when detecting the wake-up word. That is, the voice signal is divided into individual voice frames, for example, divided into one voice frame every 32ms, and whether or not a wake-up word is included is detected in each voice frame. After the wake-up word is detected, the voice watermark on the voice frame containing the wake-up word can be analyzed based on a pre-configured protocol to obtain a corresponding voice watermark value, and then the voice watermark value is used as a wake-up identifier to be sent to the data acquisition module.

304-306, after the data acquisition module receives the wake-up identification, determining that the local terminal wakes up. And then, the received wake-up identification can be sent to the recognition processing module, and the data acquisition module takes the voice signal acquired after the local terminal is waken up as a voice signal to be recognized and sends the voice signal to be recognized to the recognition processing module and the voice endpoint detection module.

It is understood that the timing relationships of 304-306 are not limited.

307. After receiving the voice signal to be recognized, the voice endpoint detection module detects the starting point and the tail point of the voice signal to be recognized, and sends the starting point and the tail point to the recognition processing module.

308. After receiving the wake-up identifier (i.e. the voice watermark value), the recognition processing module determines the tail point of the voice frame where the voice watermark corresponding to the voice watermark value is located as a wake-up time point, and determines forward backtracking preset duration as a backtracking starting point by taking the wake-up time point as a reference, and determines a voice signal between the backtracking starting point and the tail point of the voice signal to be recognized for the first time as a backtracking voice signal. And transmitting the retrospective speech signal to a decoder.

After the data acquisition module acquires the wake-up voice signal, the data acquisition module not only transmits the wake-up voice signal to the wake-up module, but also transmits the wake-up voice signal to the recognition processing module, and the recognition processing module can buffer the wake-up voice signal after receiving the wake-up voice signal. And, as described above, the data acquisition module may add a voice watermark to the transmitted wake-up voice signal. After receiving the voice watermark value as the wake-up identifier, the recognition processing module can analyze the voice watermark on the wake-up voice signal according to a pre-configured protocol, find the voice watermark corresponding to the received voice watermark value, determine the voice frame where the voice watermark is located, and then determine the tail point of the voice frame as the wake-up time point.

In this embodiment, by tracing forward with the wakeup time point as a reference, accuracy of the tracing start point can be improved, and thus integrity of the voice signal to be recognized for the first time is ensured.

In this embodiment, the wake-up time point can be simply and accurately determined by determining the wake-up time point based on the voice watermark value.

For example, referring to fig. 4, according to the above procedure, the wake-up time point may be determined in the wake-up voice signal.

The preset duration is generally longer than the duration occupied by the wake-up word, for example, the preset duration is 2080ms. Referring to fig. 4, the forward backtracking 2080ms obtains the backtracking start point based on the wakeup time point.

The voice signal to be recognized can be divided into a first voice signal to be recognized and a non-first voice signal to be recognized, a starting point and a tail point of the first voice signal to be recognized and a starting point and a tail point of the non-first voice signal to be recognized can be detected through the processing of the voice endpoint detection module, and then the voice endpoint detection module can send the detected starting point and tail point of the voice signals (including the first voice signal to be recognized and the non-first voice signal to be recognized) to the recognition processing module.

Corresponding to the first speech signal to be recognized, as shown in fig. 4, the speech signal before the backtracking start point and the tail point of the first speech signal to be recognized is determined as the backtracking speech signal. For example, the first speech signal to be identified is "play music", and the speech signal between the backtracking start point and the "play music" tail point is used as the backtracking speech signal.

For example, if the first speech signal to be recognized is "play music", the speech signal between the backtracking start point and the "play music" tail point is sent to the decoder.

In this embodiment, the integrity of the first speech signal to be recognized can be ensured at the decoder by sending the speech signal between the backtracking start point and the tail point of the first speech signal to be recognized as the backtracking speech signal to the decoder, so that the success rate of the first recognition is improved.

309. The recognition processing module continuously acquires a starting point and a tail point of the non-first-time to-be-recognized voice signal, and continuously transmits the non-first-time to-be-recognized voice signal between the starting point and the tail point to the decoder.

For the non-first to-be-recognized voice signals, for example, the non-first to-be-recognized voice signals include "change one's head", "turn up volume", and the like, the voice endpoint detection module respectively performs endpoint detection on each non-first to-be-recognized voice signal and sends endpoint information (a starting point and an ending point) obtained by detection to the recognition processing module, and the recognition processing module sends the non-first to-be-recognized voice signals between the starting point and the ending point to the decoder according to the endpoint information.

310. The decoder continuously decodes the voice signal to be recognized to obtain a voice recognition result, and continuously transmits the voice recognition result to the recognition processing module.

Since there is a certain redundancy in the first received speech signal, i.e. the retrospective speech signal, the decoder needs to remove a part of the first received speech signal, i.e. remove a preset time period (e.g. 2080 ms) from the beginning, and then perform decoding processing on the removed speech signal.

After the decoder decodes the voice recognition result, the sequential identification can be added in the voice recognition result in order to enable the recognition processing module to respond to the voice recognition result in order. The sequential identifications may have the same identification prefix, e.g., sequential identifications sn_1, sn_2, sn_3, respectively.

In this embodiment, by sequentially responding to the voice recognition result, the accuracy of response can be ensured, and the user experience can be improved.

In this embodiment, the sequential identifiers have the same identifier prefix, so that uniform identification can be facilitated.

For example, the sequence identifier of the "play music" is sn_1, the sequence identifier of the "replace" is sn_2, the sequence identifier of the "turn up volume" is sn_3, the sequence identifier of the "stop play" is sn_4, and so on.

311. The recognition processing module continuously responds to the voice recognition result.

The recognition processing module may respond to the speech recognition result in sequence according to the sequential identification. For example, the speech recognition result with the sequence identified as sn_1 is responded to, and then the speech recognition result with the sequence identified as sn_2 is responded to.

312. And after receiving an ending instruction sent by the user, the identification processing module ends the continuous interaction.

For example, if the voice recognition result received by the recognition processing module is "stop playing", the local continuous interaction is ended after the voice recognition result is received.

The following describes the interaction procedure of the user with the local terminal in a specific example. The local terminal takes the vehicle-mounted terminal as an example, and in the space in the vehicle, the condition of no network or poor network exists frequently.

The voice instructions sent by the user to the local terminal are respectively as follows: small degree. Music is played. The sound is louder. Stopping playing.

1) The user speaks wake-up words such as "small scale" to the vehicle-mounted terminal; the vehicle-mounted terminal wakes up the vehicle-mounted terminal based on the wake-up word;

2) The vehicle-mounted terminal plays the response sound 'on-the-road'; then starting the continuous interaction process of the time, and uploading the retrospective voice signal to a decoder by the recognition processing module;

3) The user continues to say "play music"; the decoder returns an identification result, and the identification processing module calls the music resource of the vehicle-mounted terminal to play; the recognition processing module continuously transmits the voice signal to the decoder;

4) The user continues to speak "louder"; the decoder returns an identification result, and the identification processing module calls volume resources of the vehicle-mounted terminal to turn up volume; the identification processing module continuously transmits data to the decoder;

5) The user continues to say "stop playing"; the decoder returns the identification result, and the identification processing module stops playing music. Ending the continuous interaction.

In this embodiment, after the local terminal wakes up, the voice signal is continuously transmitted and processed, and the voice interaction is ended when the user initiatively initiates and ends, so that continuous recognition after one-time wake-up in an offline voice interaction scene can be supported, user experience is improved, resource waste is avoided, and voice interaction efficiency is improved. By backtracking before the first speech signal to be identified, the integrity of the first speech signal to be identified can be ensured, and the success rate of the first identification is improved. By taking the awakening time point as a reference to trace forward, the accuracy of the trace start point can be improved, and the integrity of the voice signal to be recognized for the first time is further ensured. The wake-up time point can be simply, conveniently and accurately determined by determining the wake-up time point based on the voice watermark value. By responding to the voice recognition results in sequence, the response accuracy can be ensured, and the user experience is improved. Unified identification can be facilitated by having sequential identifications with the same identification prefix.

Fig. 5 is a schematic diagram according to a third embodiment of the present application. As shown in fig. 5, the embodiment provides an offline voice interaction device, and the offline voice interaction device 500 may include a transmission unit 501, a response unit 502, and an ending unit 503. The transmission unit 501 is configured to continuously transmit a voice signal to be recognized sent by a user to a decoder in the local terminal after the local terminal wakes up, so that the decoder continuously decodes the voice signal to be recognized to obtain a voice recognition result; the response module 502 is configured to continuously receive the speech recognition result sent by the decoder, and continuously respond to the speech recognition result until an end instruction sent by the user is received; the ending module 503 is configured to end the continuous interaction after receiving the ending instruction.

In some embodiments, the wake-up is determined according to a wake-up voice signal sent by the user, where the voice signal to be recognized includes a first voice signal to be recognized and a non-first voice signal to be recognized, referring to fig. 6, the apparatus 600 includes a transmission unit 601, a response unit 602, and an ending unit 603, and the transmission unit 601 may include a first transmission module 6011 and a second transmission module 6012.

The first transmission module 6011 is configured to determine a backtracking start point in the wake-up voice signal, determine a backtracking voice signal according to the backtracking start point and the first voice signal to be identified, and transmit the backtracking voice signal to a decoder in the local terminal; and the second transmission module 6012 is configured to continuously acquire a start point and an end point of the non-first to-be-recognized voice signal, and continuously transmit the non-first to-be-recognized voice signal between the start point and the end point to a decoder in the local terminal.

In some embodiments, the first transmission module 6011 is specifically configured to: determining a wake-up time point corresponding to the wake-up in the wake-up voice signal; and taking the awakening time point as a reference, and determining forward backtracking preset time length as a backtracking starting point.

In some embodiments, the first transmission module 6011 is further specifically configured to: receiving a wake-up identifier, wherein the wake-up identifier comprises: a voice watermark value; and determining the tail point of the voice frame where the voice watermark corresponding to the voice watermark value is located as a wake-up time point.

In some embodiments, the first transmission module 6011 is specifically configured to: acquiring the tail point of the voice signal to be recognized for the first time; and determining the voice signal between the backtracking starting point and the tail point of the voice signal to be recognized for the first time as a backtracking voice signal.

In some embodiments, the voice recognition result includes a sequence identifier, and the response module 603 is specifically configured to: and according to the sequence identification, responding to the voice recognition result in sequence.

In some embodiments, the sequential identifications have the same identification prefix.

Fig. 7 is a schematic diagram according to a fifth embodiment of the present application. The present embodiment provides an offline voice interaction system, the system 700 includes: an offline voice interaction device 701, which may be as shown in fig. 5 or 6, is not described in detail herein. The system 700 may further include: and a decoder 702, wherein the decoder 702 is used for removing the voice signal with the preset duration from the head in the voice signal received for the first time, and performing decoding processing on the voice signal with the preset duration removed.

In some embodiments, decoder 702 is further configured to: in the speech recognition result, sequential identifiers are added in sequence.

According to embodiments of the present application, the present application also provides an electronic device, a readable storage medium and a computer program product.

FIG. 8 is a block diagram of an electronic device implementing an offline voice interaction method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 8, the electronic device includes: one or more processors 801, memory 802, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 801 is illustrated in fig. 8.

Memory 802 is a non-transitory computer readable storage medium provided by the present application. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the offline voice interaction method provided by the application.

The memory 802 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the offline voice interaction method in the embodiment of the present application. The processor 801 executes various functional applications of the server and data processing, i.e., implements the offline voice interaction method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 802.

Memory 802 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the electronic device of the offline voice interaction method, etc. In addition, memory 802 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 802 may optionally include memory located remotely from processor 801, which may be connected to an electronic device performing an offline voice interaction method via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device performing the offline voice interaction method may further include: an input device 803 and an output device 804. The processor 801, memory 802, input devices 803, and output devices 804 may be connected by a bus or other means, for example in fig. 8.

The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic device performing the offline voice interaction method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a joystick, one or more mouse buttons, a track ball, a joystick, etc. The output device 804 may include a display apparatus, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It can be understood that although the present application is directed to an offline voice interaction system, it is not excluded that a terminal where the offline voice interaction system is deployed has networking capability, for example, the offline voice interaction system is deployed on a mobile phone, and under a certain condition, for example, when in a vehicle, the terminal may be in an offline state when in a vehicle space due to poor network signals in the vehicle space, and is not limited to being in an offline state when the terminal is in an offline state all the time, for example, when the network signals are good, the terminal may have networking capability. The application is directed to an offline voice interaction scheme of a terminal (such as a mobile phone) in an offline state (such as in a vehicle without network signals).

According to the technical scheme provided by the embodiment of the application, the voice signal is continuously transmitted and processed after the local terminal is awakened, and the voice interaction is ended after the user initiatively initiates and ends, so that the continuous recognition after one-time awakening in an offline voice interaction scene can be supported, the user experience is improved, the resource waste is avoided, and the voice interaction efficiency is improved. By backtracking before the first speech signal to be identified, the integrity of the first speech signal to be identified can be ensured, and the success rate of the first identification is improved. By taking the awakening time point as a reference to trace forward, the accuracy of the trace start point can be improved, and the integrity of the voice signal to be recognized for the first time is further ensured. The wake-up time point can be simply, conveniently and accurately determined by determining the wake-up time point based on the voice watermark value. By responding to the voice recognition results in sequence, the response accuracy can be ensured, and the user experience is improved. Unified identification can be facilitated by having sequential identifications with the same identification prefix.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. An offline voice interaction method, comprising:

after the local terminal wakes up, continuously transmitting a voice signal to be recognized sent by a user to a decoder in the local terminal, so that the decoder continuously decodes the voice signal to be recognized to obtain a voice recognition result;

continuously receiving the voice recognition result sent by the decoder, and continuously responding to the voice recognition result until receiving an ending instruction sent by the user;

After receiving the ending instruction, ending the continuous interaction;

the wake-up is determined according to a wake-up voice signal sent by the user, the voice signal to be recognized includes a first voice signal to be recognized and a non-first voice signal to be recognized, and the continuous transmission of the voice signal to be recognized sent by the user to a decoder in the local terminal includes:

determining a backtracking starting point in the awakening voice signal, determining a backtracking voice signal according to the backtracking starting point and the voice signal to be recognized for the first time, and transmitting the backtracking voice signal to a decoder in the local terminal; the method comprises the steps of,

and continuously acquiring a starting point and a tail point of the non-first-time to-be-recognized voice signal, and continuously transmitting the non-first-time to-be-recognized voice signal between the starting point and the tail point to a decoder in the local terminal.

2. The method of claim 1, wherein the determining a backtracking start point in the wake-up speech signal comprises:

determining a wake-up time point corresponding to the wake-up in the wake-up voice signal;

and taking the awakening time point as a reference, and determining forward backtracking preset time length as a backtracking starting point.

3. The method of claim 2, wherein the determining a wake time point corresponding to the wake comprises:

Receiving a wake-up identifier, wherein the wake-up identifier comprises: a voice watermark value;

and determining the tail point of the voice frame where the voice watermark corresponding to the voice watermark value is located as a wake-up time point.

4. The method of claim 1, wherein the determining the backtracking speech signal from the backtracking start point and the first speech signal to be identified comprises:

acquiring the tail point of the voice signal to be recognized for the first time;

and determining the voice signal between the backtracking starting point and the tail point of the voice signal to be recognized for the first time as a backtracking voice signal.

5. The method of any of claims 1-4, wherein the speech recognition result includes a sequential identification, and the continually responding to the speech recognition result includes:

and according to the sequence identification, responding to the voice recognition result in sequence.

6. The method of claim 5, wherein the sequential identifications have the same identification prefix.

7. An offline voice interaction device, comprising:

the transmission unit is used for continuously transmitting the voice signal to be recognized sent by the user to a decoder in the local terminal after the local terminal is awakened, so that the decoder continuously decodes the voice signal to be recognized to obtain a voice recognition result;

The response unit is used for continuously receiving the voice recognition result sent by the decoder and continuously responding to the voice recognition result until receiving an ending instruction sent by the user;

the ending unit is used for ending the continuous interaction after receiving the ending instruction;

the wake-up is determined according to a wake-up voice signal sent by the user, the voice signal to be recognized comprises a first voice signal to be recognized and a non-first voice signal to be recognized, and the transmission unit comprises:

the first transmission module is used for determining a backtracking starting point in the awakening voice signal, determining a backtracking voice signal according to the backtracking starting point and the voice signal to be recognized for the first time, and transmitting the backtracking voice signal to a decoder in the local terminal; the method comprises the steps of,

and the second transmission module is used for continuously acquiring the starting point and the tail point of the non-first voice signal to be recognized and continuously transmitting the non-first voice signal to be recognized between the starting point and the tail point to a decoder in the local terminal.

8. The apparatus of claim 7, wherein the first transmission module is specifically configured to:

9. The apparatus of claim 8, wherein the first transmission module is further specifically configured to:

10. The apparatus of claim 7, wherein the first transmission module is specifically configured to:

11. The apparatus according to any of claims 7-10, wherein the speech recognition result comprises a sequential identification, the response module being specifically configured to:

12. The apparatus of claim 11, wherein the sequential identifications have the same identification prefix.

13. An offline voice interaction system, comprising:

the device of any one of claims 7-12.

14. The system of claim 13, further comprising:

And a decoder for removing a voice signal of a preset duration from the first received voice signal, and performing decoding processing on the voice signal after the removal of the preset duration.

15. The system of claim 14, wherein the decoder is further configured to:

in the speech recognition result, sequential identifiers are added in sequence.

16. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

17. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.