CN115881114A

CN115881114A - Voice recognition method, device, storage medium and electronic device

Info

Publication number: CN115881114A
Application number: CN202111137440.8A
Authority: CN
Inventors: 王伟龙
Original assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd
Current assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2023-03-31

Abstract

The embodiment of the invention provides a voice recognition method, a voice recognition device, a storage medium and an electronic device, wherein the method comprises the following steps: collecting a first voice audio and a second voice audio in a preset time period after the first voice audio; identifying the first voice audio to obtain an identification result; determining a compensation audio based on the acquisition time of the target awakening word under the condition that the recognition result indicates that the first voice audio comprises the target awakening word; and determining the audio to be recognized based on the compensation voice and the second voice audio, and sending the audio to be recognized to the target server to instruct the target server to recognize the instruction included in the audio to be recognized. According to the invention, the problem of poor user experience caused by the fact that a user needs to wait for a certain time to speak the voice command after awakening the voice equipment in the related technology is solved, the effect of awakening without waiting is achieved, and the user experience is improved.

Description

Speech recognition method, speech recognition device, storage medium and electronic device

Technical Field

The embodiment of the invention relates to the field of communication, in particular to a voice recognition method, a voice recognition device, a storage medium and an electronic device.

Background

With the continuous maturity of intelligent voice application technology, more and more household devices apply the intelligent voice technology, and people want to listen to songs, look up weather, control household appliances and the like, only need to speak out instructions to the household appliances, and do not need to contact the devices or draw out a mobile phone for control.

In the related art, the interactive mode of the voice device adopts a way of waking up once, asking once and answering once. For example: the user wakes up the intelligent device by calling "little you" and then the device responds to the user's wake-up by voice or light and enters a wake-up mode, the user calls "how is the weather today? "today's weather is clear \8230;", the device is broadcasted by inquiry ". However, the wake-up engine needs to calculate time, and only needs to monitor the attenuation point of the sound energy, the previous section of audio can be sent to the calculation, the wake-up score is calculated through matching with the acoustic characteristics of the local wake-up model, and in addition, the player power amplifier is started to have time consumption of 100ms +, so that a user needs to wait for a short time after waking up to speak the voice command. If the instruction is directly spoken without waiting for awakening response, the first half section of the instruction cannot be identified, if a user directly says ' small-priority, small-priority and high-priority ' closing of a living room air conditioner ', the possible identification result is only ' the living room air conditioner ', the user does not know whether to open or close the air conditioner, and experience is poor.

Therefore, in the related art, a user needs to wait for a certain time to speak a voice command after waking up the voice device, so that the user experience is poor.

In view of the problems in the related art, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method, a voice recognition device, a storage medium and an electronic device, which are used for at least solving the problem of poor user experience caused by the fact that a user needs to wait for a certain time to speak a voice command after awakening a voice device in the related technology.

According to an embodiment of the present invention, there is provided a speech recognition method including: collecting a first voice audio and a second voice audio within a preset time period after the first voice audio; identifying the first voice audio to obtain an identification result; determining a compensation audio based on the acquisition time of a target awakening word under the condition that the identification result indicates that the first voice audio comprises the target awakening word; collecting a second voice audio within a preset time period after the first voice audio; and determining the audio to be recognized based on the compensation audio and the second voice audio, and sending the audio to be recognized to a target server to instruct the target server to recognize the instruction included in the audio to be recognized.

According to another embodiment of the present invention, there is provided a voice recognition apparatus including: the acquisition module is used for acquiring a first voice audio and a second voice audio within a preset time period after the first voice audio; the first recognition module is used for recognizing the first voice audio to obtain a recognition result; the determining module is used for determining compensation audio based on the acquisition time of the target awakening word under the condition that the recognition result indicates that the first voice audio comprises the target awakening word; and the second recognition module is used for determining the audio to be recognized based on the compensation audio and the second voice audio and sending the audio to be recognized to a target server so as to instruct the target server to recognize the instruction included in the audio to be recognized.

According to yet another embodiment of the invention, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program, when executed by a processor, performs the steps of the method as set forth in any one of the above.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the method and the device, the first voice audio and the second voice audio in the preset time period after the first voice audio are collected are identified, the first voice audio is identified, the identification result is obtained, the compensation audio is determined according to the collection time of the target awakening word under the condition that the identification result indicates that the first voice audio comprises the target awakening word, the audio to be identified is determined according to the compensation audio and the second voice audio, and the audio to be identified is sent to the target server so as to indicate the target server to identify the instruction in the audio to be identified. Under the condition that the first voice audio comprises the target awakening word, the audio to be recognized determined according to the compensation audio can be directly sent to the target server, and the target server recognizes the instruction in the audio to be recognized and does not need to speak the compensation audio under the condition that the equipment is awakened. Therefore, the problem that the user needs to wait for a certain time to speak the voice command after waking up the voice device in the related technology, so that the user experience is poor can be solved, the effect of waking up without waiting is achieved, and the user experience is improved.

Drawings

Fig. 1 is a block diagram of a hardware configuration of a mobile terminal of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flow diagram of a speech recognition method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of determining compensated audio according to an exemplary embodiment of the present invention;

FIG. 4 is a flow diagram of a method of speech recognition according to an embodiment of the present invention;

fig. 5 is a block diagram of a structure of a voice recognition apparatus according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

In order to make the interaction between the person and the intelligent household appliance smoother and more natural, the voice command is spoken together while waking up, for example: the user calls the small-priority air conditioner to turn on the air conditioner directly. The following is generally adopted in the related art:

in the first scheme, the computing performance of hardware is improved by adopting a better CPU, so that the computing time of the wake-up engine is shortened.

And in the second scheme, a better calculation path is found by optimizing the algorithm of the wake engine.

And in the third scheme, the first words of the awakening words are monitored to be considered to be awakened. This is actually a trade-off concept, and the original 'Xiaoyou' is expanded to 'Xiaoyou'.

However, the above solution has the following disadvantages:

the first scheme is as follows: the hardware cost is improved and is not consistent with the cost reduction of enterprises.

Scheme II: the method has the advantages that the awakening engine algorithm is optimized, the research and development resource investment is huge, the research and development period is long, the awakening rate and the false awakening rate are influenced, and most importantly, the scheme has a bottleneck. The input-output ratio is not high.

The third scheme is as follows: this is an ingenious, which will increase the probability of false wake-up. For example, the 'Xiaoyouxiaomei' should not be woken up, but the scheme is woken up because the 'Xiaoyouxiaomei' is matched. Not only can the user be disturbed by false wake-up, but also the privacy of the user can be leaked.

In view of the above problems in the related art, the following embodiments are proposed:

the method embodiments provided in the embodiments of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking the example of the operation on the mobile terminal, fig. 1 is a hardware structure block diagram of the mobile terminal of a speech recognition method according to an embodiment of the present invention. As shown in fig. 1, the mobile terminal may include one or more processors 102 (only one is shown in fig. 1) (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.) and a memory 104 for storing data, wherein the mobile terminal may further include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store computer programs, for example, software programs and modules of application software, such as a computer program corresponding to the speech recognition method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet via wireless.

In the present embodiment, a speech recognition method is provided, and fig. 2 is a flowchart of a speech recognition method according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, collecting a first voice audio and a second voice audio in a preset time period after the first voice audio;

step S204, recognizing the first voice audio to obtain a recognition result;

step S206, under the condition that the recognition result indicates that the first voice audio comprises a target awakening word, determining a compensation audio based on the acquisition time of the target awakening word;

step S208, determining an audio to be recognized based on the compensation audio and the second voice audio, and sending the audio to be recognized to a target server to instruct the target server to recognize an instruction included in the audio to be recognized.

In the above embodiment, the first voice audio and the second voice audio may be audio collected by a sound collection device, and the sound collection device may be a sound collection device integrated with the voice device, such as a microphone. The voice device may include an intelligent sound, an intelligent home device, a device including a voice interaction function, and the like, such as a mobile phone including a voice interaction function, a tablet computer, an intelligent wearable device, and the like.

In the above embodiment, when the user wants to issue the instruction through the voice device, the user may directly speak the wakeup word and the voice including the instruction, such as "small preferred, and turn on the air conditioner". After the voice is collected by the voice collecting device, the voice can be identified through a processor included in the voice device, and an identification result is obtained. Whether the target awakening words are included in the identification voice or not can be identified locally in the voice equipment, collected voice does not need to be uploaded to a target server in real time, compensation audio is uploaded to the target server only when the target awakening words are identified, and privacy of a user is protected.

In the above embodiment, in the case that the recognition result indicates that the target wake-up word is included in the voice, the compensation audio may be determined according to the acquisition time of the target wake-up word. Wherein, the target awakening word is 'Xiaoyou'. The target awakening word also supports user-defined equipment, and a user can issue an instruction for changing the awakening word to the voice equipment and modify the default awakening word of the system into the user-defined awakening word.

In the above embodiment, the compensation audio may be acquired after the target wake-up word, and the second acquisition period for acquiring the compensation audio is adjacent to the first acquisition period for acquiring the target wake-up word. For example, the target wake-up word may be speech captured between 1 minute 40 seconds and 1 minute 42 seconds, and the compensation audio may be speech captured between 1 minute 42 seconds and 1 minute 44 seconds.

In the above embodiment, the compensation audio and other audio besides the compensation audio are included in the audio to be recognized, where the other audio is the voice collected after the compensation audio, and in the voice to be recognized, the compensation audio may precede the other audio. When the user speaks 'little superior and turn on the air conditioner', the sound collection device can collect the voice spoken by the user and recognize the voice to obtain a recognition result, and when the voice is determined to contain 'little superior and little superior', the obtained recognition result is determined to contain the target awakening word. During the time that the device begins to recognize until a recognition result is obtained, the user may have spoken a portion of the speech in the speech command, such as "turn on". If the voice command is collected after the recognition result is obtained, the condition that only the 'air conditioner' is collected exists. Therefore, the collected 'on' can be used as compensation audio, the 'air conditioner' can be used as other audio, the compensation audio is compensated before the other audio during uploading, the audio to be identified is obtained, and the audio to be identified is uploaded to the target server. Therefore, the voice instruction does not need to be spoken after the voice device responds. The user experience is improved. That is, after the compensation audio is determined, the compensation audio may be compensated before other audio to obtain an audio to be recognized, the audio to be recognized is sent to the target server, and after the target server receives the audio to be recognized, the audio to be recognized may be analyzed to recognize an instruction included in the audio to be recognized, such as "turn on an air conditioner". Wherein the other audio may be audio determined from the second speech audio.

In the above embodiment, the user can successively speak the wakeup word and the voice instruction without speaking the wakeup word first, and then speak the voice instruction after the voice device is awakened, so that the waiting time can be reduced, and the user experience is improved.

Optionally, the main body of the above steps may be a voice device, a background processor, or other devices with similar processing capabilities, and may also be a machine integrated with at least a sound collection device and a data processing device, where the sound collection device may include a sound collection module such as a microphone, and the data processing device may include a terminal such as a computer and a mobile phone, but is not limited thereto.

In one exemplary embodiment, the method further comprises: determining sound characteristics of the target awakening word and acquisition time of the target awakening word based on the recognition result; wherein determining a compensation audio based on the acquisition time of the target wake-up word comprises: determining the collection end time of the target awakening word based on the sound characteristics and the collection time; determining the result reporting time of the obtained identification result; and determining the audio collected in the time period from the ending time to the result reporting time as the compensation audio. In this embodiment, the sound characteristics of the target wake-up word and the collection time of the target wake-up word may be determined by the wake-up engine according to the recognition result, where the sound characteristics may include a sound waveform and the like. And determining the acquisition starting time and the acquisition ending time of the target awakening words according to the sound characteristics and the acquisition time. In the process of identifying the voice, the reporting time of the result of the identification result can be determined. And determining the time period from the ending time to the result reporting time as the time period for acquiring the compensation audio, and determining the audio acquired in the time period as the compensation audio. Wherein the compensation audio may be audio included in audio collected by the sound collection apparatus. For example, in the period from the ending time to the result reporting time, when the sound collection device plays the audio, the audio collected by the sound collection device includes both the audio played by the sound collection device and the compensation audio. Therefore, after the compensation audio is collected, the compensation audio can be preprocessed to eliminate the audio played by the sound collection device.

In the above embodiment, the time period from the ending time to the result reporting time may not be a time period of a fixed duration. I.e. it is determined that the compensated audio is not a fixed duration audio. Since the time duration of each wake-up is not fixed. The wake-up duration is affected by the user speech rate, environmental noise, distributed wake-up, and other factors, resulting in different wake-up durations, and the difference may be very large. If the compensation is less, the character can still be lost; if the compensation is excessive, the awakening word 'small priority' may be compensated, so that the recognition error is caused, and the interaction is influenced. Therefore, a compensation calculation module can be introduced, and the starting time of the awakening word, the ending time of the awakening word (namely the acquisition ending time) and the awakening judgment result reporting time (namely the result reporting time) can be accurately obtained by utilizing the acoustic information obtained by the awakening engine. The audio length to be compensated can be accurately obtained only by subtracting the end time of the awakening word from the awakening judgment result reporting time. The compensation algorithm adopts a dynamic calculation mode to obtain an accurate time interval from the end of the awakening word to the report of the awakening event from the awakening event, so that the audio length to be compensated can be accurately calculated. Wherein, the schematic diagram of determining the compensated audio can be seen in fig. 3.

In one exemplary embodiment, the method further comprises: caching the collected voice audio to obtain a cached audio; wherein, the caching the collected voice audio to obtain the cached audio comprises at least one of the following steps: caching the collected first voice audio and the second voice audio in real time, and determining the first voice audio and the second voice audio as the cached audio; and determining the voice audio cached in real time before the recognition result is obtained as the cached audio. In this embodiment, the sound collection device may perform real-time buffering on the collected audio while collecting the audio, and may also perform real-time buffering on the collected first voice audio and the collected second voice audio.

In the above embodiment, a recording buffer may be added to the voice device without changing the existing hardware resources and wake-up algorithm, and after wake-up occurs, the data in the wake-up area may be compensated to the head of the audio including the instruction in the second voice audio, so that any recognition instruction is not discarded. If not awakened, the data in the recording cache region can not be uploaded to the cloud, and the privacy of the user can not be revealed. The record is cached, and the cached audio is compensated to the audio to be identified after awakening, so that the aim of awakening wait-free interaction is fulfilled.

In one exemplary embodiment, the method further comprises: deleting the cache audio under the condition that the recognition result indicates that the first voice audio does not include a target awakening word; determining the compensation audio from the buffered audio if the recognition result indicates that the first voice audio includes a target wake-up word. In this embodiment, in a case that the recognition result indicates that the first speech audio does not include the target wake word, the cached audio may be deleted to save the storage space. In the case where the recognition result indicates a target wake word included in the first speech audio, a compensation audio may be determined from the buffered audio. So as to prevent the situation that the complete compensation audio can not be obtained due to the fact that the audio is not buffered.

In one exemplary embodiment, determining the audio to be recognized based on the compensation audio and the second speech audio comprises: identifying a sound characteristic of the second speech audio; under the condition that the voice instruction is determined to be included in the second voice audio based on the sound characteristic of the second voice audio, determining sub-audio including the voice instruction in the second voice audio, and determining the compensation audio and the sub-audio as the audio to be identified; determining the compensation audio as the audio to be recognized if it is determined that no voice instruction is included in the second voice audio based on the sound characteristic of the second voice audio. In this embodiment, the second speech audio is an audio collected after the target wake-up word, and when the audio to be recognized is determined, it may be determined whether a sub-audio including a speech instruction exists in the second speech audio, and under the condition that the sub-audio exists, the sub-audio and the target wake-up word are combined to obtain the audio to be recognized. In the absence of sub-audio, the compensation audio is determined to be the audio to be identified.

In one exemplary embodiment, the method further comprises: preprocessing the cache audio to obtain a preprocessed cache audio; the pre-treatment comprises at least one of: echo cancellation processing and automatic gain control processing; wherein the determining the compensated audio from the buffered audio comprises: and determining the compensation audio according to the preprocessed cache audio. In this embodiment, the buffered audio may be preprocessed, and the compensation audio may be determined from the preprocessed buffered audio.

In the above embodiment, since the voice device may collect the voice when playing the voice, the voice may be preprocessed, so as to eliminate the voice played by the voice device itself, and enhance the voice spoken by the user. Wherein the pre-processing may include one or more of echo cancellation processing, automatic gain control processing.

In the above embodiment, all the voices collected by the sound collection device may be preprocessed, and the voices collected after the target wake-up word, that is, the audio to be recognized may be preprocessed. Because the audio to be recognized is collected after the target awakening word and is the voice containing the voice instruction, only the audio to be recognized is preprocessed, the effect of recognizing the voice instruction is not influenced, the processing resource of the voice equipment can be saved, and the processing speed is improved.

In the above embodiment, the buffered audio may be subjected to echo cancellation processing, and sound being played by the voice device, noise around the voice device, and the like included in the audio may be cancelled through the echo cancellation processing, where the echo cancellation processing may be processing performed by the AEC module. After the echo cancellation process, the echo-cancelled audio may also be subjected to an automatic gain control process. I.e. to enhance the speech spoken by the user. Wherein the enhancement processing can be performed by an AGC model. The buffered audio data can have audio processed by the AEC and AGC modules, and the self-sounding of the equipment is filtered out, so that the awakening response words are prevented from being compensated.

In an exemplary embodiment, after sending the audio to be recognized to the target server, the method further comprises: receiving a control instruction sent by the target server, wherein the control instruction is generated according to the audio to be identified; and executing the target operation indicated by the control instruction, and deleting the cached audio. In this embodiment, after the audio to be recognized is sent to the target server, the target server may recognize the audio to be recognized, determine a voice recognition result, determine a control instruction according to the voice recognition result, and send the control instruction. After the voice device receives the control instruction, the target operation corresponding to the control instruction can be determined, and the target operation is executed. After the voice device receives the control instruction, the locally cached audio can be deleted. So as to achieve the purpose of saving storage space.

The following describes the speech recognition method with reference to the specific embodiments:

fig. 4 is a flowchart of a speech recognition method according to an embodiment of the present invention, as shown in fig. 4, the method includes:

in step S402, a microphone (corresponding to the sound collection apparatus described above) collects sound.

In step S404, AEC (echo cancellation) processing is performed on the picked-up audio.

In step S406, AGC (automatic gain control) processing is performed on the AEC-processed speech.

In step S408, the wakeup decision is performed, and if the determination result is yes, step S410 is performed, and if the determination result is no, step S402 is performed.

In step S410, the compensation module performs sound compensation to compensate the recording B (corresponding to the compensated audio) before the recording a (corresponding to the other audio), so as to obtain a recording C (corresponding to the audio to be identified).

Step S412, the recording C is uploaded to the server.

And step S414, recognizing the recording C through an ASR automatic speech recognition technology.

In the embodiment, the audio cache is utilized, and the cache data is taken out after awakening, so that the effect of no waiting for awakening is achieved. The effect of completing 'awakening + instruction issuing' in a sentence is realized, and the instruction is issued without waiting for awakening.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method according to the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a speech recognition apparatus is further provided, and the speech recognition apparatus is used to implement the foregoing embodiments and preferred embodiments, which have already been described and are not described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 5 is a block diagram showing the structure of a speech recognition apparatus according to an embodiment of the present invention, as shown in fig. 5, the apparatus including:

a collecting module 52, configured to collect a first voice audio and a second voice audio within a preset time period after the first voice audio

The first recognition module 54 is configured to recognize the first voice audio to obtain a recognition result;

a determining module 56, configured to determine a compensation audio based on a collection time of a target wake-up word if the recognition result indicates that the first voice audio includes the target wake-up word;

a second recognition module 58, configured to determine, based on the compensation audio and the second speech audio, an audio to be recognized, and send the audio to be recognized to a target server, so as to instruct the target server to recognize an instruction included in the audio to be recognized.

In the above embodiment, when the user wants to issue the instruction through the voice device, the user can directly speak the wakeup word and the voice including the instruction, such as "small preferred, and open the air conditioner". After the voice is collected by the voice collecting device, the voice can be identified through a processor included in the voice device, and an identification result is obtained. Whether the target awakening words are included in the identification voice or not can be identified locally in the voice equipment, collected voice does not need to be uploaded to a target server in real time, compensation audio is uploaded to the target server only when the target awakening words are identified, and privacy of a user is protected.

In the above embodiment, in the case that the recognition result indicates that the target wake-up word is included in the speech, the compensation audio may be determined according to the acquisition time of the target wake-up word. Wherein, the target awakening word is 'Xiaoyou'. The target awakening words also support the user-defined equipment, and a user can issue an awakening word changing instruction to the voice equipment and change the default awakening words of the system into the user-defined awakening words.

In the above embodiment, the target voice includes the compensation audio and other audio besides the compensation audio, where the other audio is the voice collected after the compensation audio, and in the target voice, the compensation audio precedes the other audio. When the user speaks 'little preferred, turn on the air conditioner', the sound collecting device can collect the voice spoken by the user and recognize the voice to obtain a recognition result, and after the voice is determined to contain the target awakening word 'little preferred, little preferred', the obtained recognition result is determined to contain the target awakening word. During the time that the device begins to recognize until a recognition result is obtained, the user may have spoken a portion of the speech in the speech command, such as "turn on". If the voice command is collected after the recognition result is obtained, the condition that only the 'air conditioner' is collected exists. Therefore, the collected 'on' can be used as compensation audio, the 'air conditioner' can be used as other audio, the compensation audio is compensated before other audio during uploading, the target voice is obtained, and the target voice is uploaded to the target server. Therefore, the voice instruction does not need to be spoken after the voice device responds. The user experience is improved. That is, after the compensation audio is determined, the compensation audio may be compensated before other audio to obtain a target voice, and the target voice is sent to the target server, and after receiving the audio to be recognized, the target server may analyze the audio to be recognized and recognize an instruction included in the audio to be recognized, such as "turn on an air conditioner". Wherein the other audio may be audio determined from the second speech audio.

According to the method and the device, the first voice audio and the second voice audio in the preset time period after the first voice audio are collected are identified, the first voice audio is identified, the identification result is obtained, the compensation audio is determined according to the collection time of the target awakening word under the condition that the identification result indicates that the first voice audio comprises the target awakening word, the audio to be identified is determined according to the compensation audio and the second voice audio, and the audio to be identified is sent to the target server so as to indicate the target server to identify the instruction in the audio to be identified. Under the condition that the first voice audio comprises the target awakening word, the audio to be recognized determined according to the compensation audio can be directly sent to the target server, and the target server recognizes the instruction in the audio to be recognized, so that the compensation audio does not need to be spoken under the condition that the equipment is awakened. Therefore, the problem that the user needs to wait for a certain time to speak the voice command after waking up the voice device in the related technology, so that the user experience is poor can be solved, the effect of waking up without waiting is achieved, and the user experience is improved.

In one exemplary embodiment, the apparatus is configured to include: determining sound characteristics of the target awakening word and acquisition time of the target awakening word based on the recognition result; wherein determining a compensation audio based on the acquisition time of the target wake-up word comprises: determining the collection end time of the target awakening word based on the sound characteristics and the collection time; determining the result reporting time of the obtained identification result; and determining the audio collected in the time period from the ending time to the result reporting time as the compensation audio. In this embodiment, the sound characteristics of the target wake-up word and the collection time of the target wake-up word may be determined by the wake-up engine according to the recognition result, where the sound characteristics may include a sound waveform and the like. And determining the acquisition starting time and the acquisition ending time of the target awakening words according to the sound characteristics and the acquisition time. In the process of identifying the voice, the reporting time of the result of the identification result can be determined. And determining the time period from the ending time to the result reporting time as the time period for acquiring the compensation audio, and determining the audio acquired in the time period as the compensation audio. Wherein the compensation audio may be audio included in audio collected by the sound collection apparatus. For example, in the period from the ending time to the result reporting time, when the sound collection device plays the audio, the audio collected by the sound collection device includes both the audio played by the sound collection device and the compensation audio. Therefore, after the compensation audio is collected, the compensation audio can be preprocessed to eliminate the audio played by the sound collection device.

In the above embodiment, the time period from the ending time to the result reporting time may not be a time period of a fixed duration. I.e. it is determined that the compensated audio is not a fixed duration audio. Since the time duration of each wake-up is not fixed. The wake-up duration may be affected by the user's speech rate, environmental noise, distributed wake-up, etc., resulting in different wake-up durations, and this difference may be large. If the compensation is less, the word can still be lost; if the compensation is excessive, the awakening word 'small excel' may be compensated in, so that recognition errors are caused, and interaction is influenced. Therefore, a compensation calculation module can be introduced, and the starting time of the awakening word, the ending time of the awakening word (namely the acquisition ending time) and the awakening judgment result reporting time (namely the result reporting time) can be accurately obtained by utilizing the acoustic information obtained by the awakening engine. The audio length to be compensated can be accurately obtained only by subtracting the end time of the awakening word from the awakening judgment result reporting time. The compensation algorithm adopts a dynamic calculation mode to obtain an accurate time interval from the end of the awakening word to the report of the awakening event from the awakening event, so that the audio length to be compensated can be accurately calculated. A schematic diagram of determining the second acquisition time period can be seen in fig. 3.

In an exemplary embodiment, the apparatus may be configured to buffer the collected voice audio to obtain a buffered audio; the method for caching the collected voice audio to obtain the cached audio can be realized by at least one of the following steps: caching the collected first voice audio and the second voice audio in real time, and determining the first voice audio and the second voice audio as the cached audio; and determining the voice audio cached in real time before the recognition result is obtained as the cached audio. In this embodiment, the sound collection device may perform real-time buffering on the collected audio while collecting the audio, and may also buffer the second voice audio when it is determined that the first voice audio includes the target wakeup word. The collected first voice audio and the second voice audio can be cached in real time.

In the above embodiment, a recording buffer may be added to the voice device without changing the existing hardware resources and wake-up algorithm, and after wake-up occurs, the data in the wake-up area may be compensated to the head of the audio including the instruction in the second voice audio, so that any recognition instruction is not discarded. If not awakened, the data in the recording cache region can not be uploaded to the cloud, so that the privacy of the user can not be revealed. The record is cached, and the cached audio is compensated to the audio to be identified after awakening, so that the aim of awakening wait-free interaction is fulfilled.

In one exemplary embodiment, the apparatus is further configured to: deleting the cache audio under the condition that the recognition result indicates that the first voice audio does not include a target awakening word; determining the compensation audio from the buffered audio if the recognition result indicates that the first voice audio includes a target wake-up word. In this embodiment, in a case that the recognition result indicates that the first speech audio does not include the target wake word, the cached audio may be deleted to save the storage space. In the case that the recognition result indicates a target wake-up word included in the first voice audio, a compensation audio may be determined from the buffered audio. So as to prevent the situation that the complete compensation audio can not be obtained due to the fact that the audio is not buffered.

In an exemplary embodiment, the second recognition module 58 may enable determining the audio to be recognized based on the compensation audio and the second speech audio by: identifying a sound characteristic of the second speech audio; under the condition that the voice instruction is determined to be included in the second voice audio based on the sound characteristic of the second voice audio, determining sub-audio including the voice instruction in the second voice audio, and determining the compensation audio and the sub-audio as the audio to be identified; determining the compensation audio as the audio to be recognized if it is determined that no voice instruction is included in the second voice audio based on the sound feature of the second voice audio. In this embodiment, the second voice audio is an audio collected after the target wake-up word, and when determining the audio to be recognized, it may be determined whether a sub-audio including a voice instruction exists in the second voice audio, and in the case that the sub-audio exists, the sub-audio and the target wake-up word are combined to obtain the audio to be recognized. In the absence of the sub-audio, the compensation audio is determined as the audio to be identified.

In an exemplary embodiment, the apparatus may be further configured to pre-process the buffered audio to obtain a pre-processed buffered audio; the pre-treatment comprises at least one of: echo cancellation processing and automatic gain control processing; wherein the determining the compensation audio from the buffered audio comprises: and determining the compensation audio according to the preprocessed cache audio. In this embodiment, the buffered audio may be preprocessed, and the compensation audio may be determined from the preprocessed buffered audio.

In the above embodiment, all the voices collected by the sound collection device may be preprocessed, and the voices collected after the target wake-up word, that is, the audio to be recognized may be preprocessed. Because the collected audio to be recognized after the target awakening word is the voice containing the voice instruction, only the audio to be recognized is preprocessed, the effect of recognizing the voice instruction is not influenced, the processing resource of the voice equipment can be saved, and the processing speed is improved.

In the above embodiment, the buffered audio may be subjected to echo cancellation processing, and sound being played by the voice device, noise around the voice device, and the like included in the audio may be cancelled through the echo cancellation processing, where the echo cancellation processing may be processing performed by the AEC module. After the echo cancellation process, the audio subjected to the echo cancellation process may also be subjected to an automatic gain control process. I.e. to enhance the speech spoken by the user. Wherein the enhancement processing can be performed by an AGC model. The buffered audio data can have audio processed by the AEC and AGC modules, and the self-sounding of the equipment is filtered out, so that the awakening response words are prevented from being compensated.

In an exemplary embodiment, the apparatus may be further configured to receive a control instruction sent by a target server after sending an audio to be identified to the target server, where the control instruction is generated according to the audio to be identified; and executing the target operation indicated by the control instruction, and deleting the cached audio. In this embodiment, after the audio to be recognized is sent to the target server, the target server may recognize the audio to be recognized, determine a voice recognition result, determine a control instruction according to the voice recognition result, and send the control instruction. After receiving the control instruction, the voice device may determine a target operation corresponding to the control instruction and execute the target operation. After the voice device receives the control instruction, the locally cached audio can be deleted. So as to achieve the purpose of saving storage space.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are located in different processors in any combination.

Embodiments of the present invention further provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the method described in any of the above.

In an exemplary embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention further provide an electronic device, comprising a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

In an exemplary embodiment, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary embodiments, and details of this embodiment are not repeated herein.

It will be apparent to those skilled in the art that the various modules or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and they may be implemented using program code executable by the computing devices, such that they may be stored in a memory device and executed by the computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A speech recognition method, comprising:

collecting a first voice audio and a second voice audio within a preset time period after the first voice audio;

identifying the first voice audio to obtain an identification result;

determining a compensation audio based on the acquisition time of a target awakening word under the condition that the identification result indicates that the first voice audio comprises the target awakening word;

and determining the audio to be recognized based on the compensation audio and the second voice audio, and sending the audio to be recognized to a target server to instruct the target server to recognize the instruction included in the audio to be recognized.

2. The method of claim 1, further comprising:

determining sound characteristics of the target awakening word and acquisition time of the target awakening word based on the recognition result;

wherein determining a compensation audio based on the acquisition time of the target wake-up word comprises:

determining the collection end time of the target awakening word based on the sound characteristics and the collection time;

determining the result reporting time of the obtained identification result;

and determining the audio collected in the time period from the ending time to the result reporting time as the compensation audio.

3. The method according to claim 1 or 2, characterized in that the method further comprises: caching the collected voice audio to obtain a cached audio;

wherein, the caching the collected voice audio to obtain the cached audio comprises at least one of the following steps:

caching the collected first voice audio and the second voice audio in real time, and determining the first voice audio and the second voice audio as the cached audio;

and determining the voice audio cached in real time before the recognition result is obtained as the cached audio.

4. The method of claim 3, further comprising:

deleting the cached audio if the recognition result indicates that the first voice audio does not include a target wake-up word;

determining the compensation audio from the buffered audio if the recognition result indicates that the first voice audio includes a target wake-up word.

5. The method of any of claims 1-4, wherein determining the audio to be recognized based on the compensation audio and the second speech audio comprises:

identifying sound features of the second speech audio;

under the condition that the voice instruction is determined to be included in the second voice audio based on the sound characteristic of the second voice audio, determining sub-audio including the voice instruction in the second voice audio, and determining the compensation audio and the sub-audio as the audio to be identified;

determining the compensation audio as the audio to be recognized if it is determined that no voice instruction is included in the second voice audio based on the sound feature of the second voice audio.

6. The method according to any one of claims 3 to 5, further comprising:

preprocessing the cache audio to obtain a preprocessed cache audio; the pre-treatment comprises at least one of: echo cancellation processing and automatic gain control processing;

wherein the determining the compensated audio from the buffered audio comprises:

and determining the compensation audio according to the preprocessed cache audio.

7. The method of any of claims 3 to 6, wherein after sending the audio to be identified to the target server, the method further comprises:

receiving a control instruction sent by the target server, wherein the control instruction is generated according to the audio to be identified;

and executing the target operation indicated by the control instruction, and deleting the cached audio.

8. A speech recognition apparatus, comprising:

the acquisition module is used for acquiring a first voice audio and a second voice audio within a preset time period after the first voice audio;

the first recognition module is used for recognizing the first voice audio to obtain a recognition result;

the determining module is used for determining compensation audio based on the acquisition time of a target awakening word under the condition that the recognition result indicates that the first voice audio comprises the target awakening word;

and the second identification module is used for determining the audio to be identified based on the compensation audio and the second voice audio, and sending the audio to be identified to a target server so as to instruct the target server to identify the instruction included in the audio to be identified.

9. A computer-readable storage medium, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. An electronic device comprising a memory and a processor, wherein the memory has a computer program stored therein, and the processor is configured to execute the computer program to perform the method of any of claims 1 to 7.