CN112927685A

CN112927685A - Dynamic voice recognition method and device

Info

Publication number: CN112927685A
Application number: CN201911242880.2A
Authority: CN
Inventors: 王美华; 陈庆隆
Original assignee: Realtek Semiconductor Corp
Current assignee: Realtek Semiconductor Corp
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2021-06-08

Abstract

The invention provides a dynamic voice recognition method and a device. The dynamic voice recognition method comprises a first stage: the voice data is detected by the digital microphone and stored in the first memory, the voice is detected in the voice data to generate a voice detection signal, and the first processing circuit selectively determines to execute the second stage or the third stage according to the total effective data amount, the transmission bit rate of the digital microphone and the identification interval time. And executing the second stage, the first processing circuit outputs the first command to the second processing circuit, and the second processing circuit makes the memory access circuit transfer the sound data to the second memory according to the first command and stores the sound data as the voice data. And executing the third stage, the first processing circuit outputs the second command to the second processing circuit, the second processing circuit transfers the sound data to the second memory according to the second command, the sound data is stored as the voice data, and the second processing circuit confirms whether the voice data is matched with the preset voice command.

Description

Dynamic voice recognition method and device

Technical Field

The present invention relates to a speech recognition technology, and more particularly, to a dynamic speech recognition method and apparatus.

Background

In the conventional electronic device, voice assistant (voice assistant) technology is widely applied in various fields and supports a voice wake-up function. When the voice assistant is in a standby mode (standby mode), it still needs to listen to the hotword and give a corresponding response when the hotword appears, so the voice assistant must wake up periodically, the processing system of the voice assistant will be started in the standby mode to detect whether there is a voice by using the voice activity detection circuit, and further enter voice recognition when the voice appears to confirm whether there is a hotword (hot words) in the voice, and then determine whether to execute system startup of the electronic device or execute corresponding operation.

However, regular waking up of the voice assistant at equal frequencies for detection has poor sensitivity. At the same time, the processing system of the voice assistant also needs to meet low power operation to meet the relevant specifications of energy requirements.

Disclosure of Invention

In view of the above, the present invention provides a dynamic speech recognition method, which includes performing a first stage: detecting voice data by using a digital microphone and storing the voice data in a first memory; detecting the voice in the voice data to generate a voice detection signal; and selectively determining to execute the second stage or the third stage by the first processing circuit according to the total effective data amount, the transmission bit rate of the digital microphone and the identification interval time. And executing the second stage, the first processing circuit outputs the first command to the second processing circuit, and the second processing circuit makes the memory access circuit transfer the sound data to the second memory according to the first command and store the sound data as the voice data. And executing the third stage, the first processing circuit outputs the second command, the second processing circuit makes the memory access circuit transfer the sound data to the second memory and store the sound data as the voice data according to the second command, and the second processing circuit confirms whether the voice data in the second memory is matched with a preset voice command or not.

The invention further provides a dynamic voice recognition device, which comprises a digital microphone, a first memory, a voice activity detection circuit, a memory access circuit, a second memory, a first processing circuit and a second processing circuit. The digital microphone is used for detecting a voice data. The first memory is electrically connected to the digital microphone for storing voice data. The voice activity detection circuit is electrically connected with the digital microphone and is used for detecting the voice data and generating a voice detection signal. The memory access circuit is electrically connected with the first memory and used for transferring the sound data to the second memory according to the first instruction so as to store the sound data as voice data. The first processing circuit is electrically connected with the voice activity detection circuit. The second processing circuit is electrically connected with the first processing circuit, the second memory and the memory access circuit. The dynamic voice recognition device is used for executing the dynamic voice recognition method.

According to some embodiments, when the first processing circuit receives the human voice detection signal, the first processing circuit outputs the first instruction or the second instruction after the identification interval.

According to some embodiments, the identification interval is determined by a budget relationship value, the identification interval is 2 seconds when the budget relationship value is less than or equal to the target average power consumption and the previous period time 1/3; when the budget relation value is greater than the target average power consumption and the previous period time 1/3 and less than or equal to the target average power consumption and the previous period time 2/3, the identification interval time is 1.5 seconds; and when the budget relation value is larger than the target average power consumption and the previous period time 2/3, the identification interval time is 1 second.

According to some embodiments, the budget relationship is a target average power consumption and a previous period time- (a first average power consumption of the first phase and a first time of the first phase + a second average power consumption of the second phase and a second time of the second phase + a third average power consumption of the third phase and a third time of the third phase), wherein the previous period time is equal to a sum of the first time, the second time and the third time.

According to some embodiments, the third average power consumption is greater than the second average power consumption, and the second average power consumption is greater than the first average power consumption.

According to some embodiments, after the voice detection signal is generated, the first processing circuit determines whether the first memory is full of voice data, and proceeds to the next step when the first memory is full of voice data.

In summary, the present invention takes the experience of the user into consideration when performing the dynamic speech recognition, and can reduce the average power consumption when triggering the search of the preset speech command (hot word) in the standby mode, thereby providing a method with better sensitivity.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.

Fig. 1 is a block diagram of an electronic device according to an embodiment of the invention.

FIG. 2 is a flowchart illustrating a dynamic speech recognition method according to an embodiment of the invention.

FIG. 3 is a waveform diagram of a dynamic speech recognition device according to an embodiment of the present invention.

FIG. 4 is a flowchart illustrating a dynamic speech recognition method according to another embodiment of the invention.

Description of reference numerals:

10 electronic device

20 dynamic voice recognition device

21 digital microphone

22 first memory

23 voice activity detection circuit

24 memory access circuit

25 first processing circuit

26 second processing circuit

27 second memory

30 video and audio processing circuit

31-33 core processing circuit

34 to 36 third memories

C1 first instruction

C2 second instruction

SD1 sound data

SD2 voice data

SS human voice detection signal

ST1 first stage

ST2 second stage

ST3 third stage

T cycle time

Time T1-T2

Time interval of Ti identification

S10-S28

S30-S36

Detailed Description

FIG. 1 is a block diagram of an electronic device according to an embodiment of the invention, please refer to FIG. 1, in which the electronic device 10 includes a dynamic speech recognition device 20, an audio/video processing circuit 30, a plurality of core processing circuits 31-33 and a plurality of third memories 34-36, and the plurality of core processing circuits 31-33 are electrically connected to the third memories 34-36. When the dynamic voice recognition device 20 recognizes the preset voice command in the standby mode, the electronic device 10 executes a system boot program, so that the video processing circuit 30, the plurality of core processing circuits 31 to 33, and the plurality of third memories 34 to 36 can cooperate with each other to play the video signal received by the electronic device 10. In one embodiment, the electronic device 10 may be a television, but is not limited thereto.

The dynamic speech recognition device 20 includes a digital microphone 21, a first memory 22, a voice activity detection circuit 23, a memory access circuit 24, a first processing circuit 25, a second processing circuit 26, and a second memory 27. The digital microphone 21 is used to detect a sound data SD 1. The first memory 22 is electrically connected to the digital microphone 21 for storing the audio data SD 1. In one embodiment, the first memory 22 may be, but is not limited to, a Static Random Access Memory (SRAM).

The voice activity detection circuit 23 is electrically connected to the digital microphone 21 for detecting the audio data SD1 and generating a voice detection signal SS. In one embodiment, the voice activity detection circuit 23 may be, but is not limited to, a voice recognition chip or a voice recognition processing circuit.

The memory access circuit 24 is electrically connected to the first memory 22 and the second memory 27, and is used for transferring the audio data SD1 to the second memory 27 according to a first command, so as to store the audio data SD1 as a voice data SD 2. In one embodiment, the Memory access circuit 24 may be, but is not limited to, a Direct Memory Access (DMA) circuit, and the second Memory 27 may be, but is not limited to, a Dynamic Random Access Memory (DRAM).

The first processing circuit 25 is electrically connected to the voice activity detecting circuit 23, and is configured to generate a first command C1 or a second command C2 according to the human voice detecting signal SS. The second processing circuit 26 is electrically connected to the first processing circuit 25, the second memory 27 and the memory access circuit 24, the second processing circuit 26 makes the memory access circuit 24 transfer the audio data SD1 to the second memory 27 according to the first command C1 and store the audio data SD 2; or the second processing circuit 26 makes the memory access circuit 24 transfer the audio data SD1 to the second memory 27 and store as the audio data SD2 according to the second command C2, and confirms whether the audio data SD2 in the second memory 27 matches a predetermined audio command. In one embodiment, the first processing circuit 25 may use a microcontroller with low power consumption, such as an 8051 microcontroller, but the invention is not limited thereto. The second processing circuit 26 may be a general microprocessor, a microcontroller, a central processing unit, and other various types of processing circuits, but the invention is not limited thereto.

In one embodiment, the first instruction C1 or the second instruction C2 is an instruction to modify the common state.

Fig. 2 is a flowchart illustrating a dynamic speech recognition method according to an embodiment of the invention, and fig. 3 is a waveform diagram illustrating a dynamic speech recognition apparatus according to an embodiment of the invention, and referring to fig. 1, fig. 2 and fig. 3, the dynamic speech recognition method includes executing a first stage ST1 (steps S10-S18, step S22) and executing a second stage ST2 (step S20) or a third stage ST3 (steps S24-S26) by using the dynamic speech recognition apparatus 20, which will be described in detail below for each stage.

In executing the first stage ST1 (pure standby stage), as shown in step S10, the sound data SD1 is detected by the digital microphone 21 and the sound data SD1 is stored in the first memory 22. In step S12, the voice activity detecting circuit 23 detects whether there is a voice in the voice data SD1, and is triggered to generate the voice detecting signal SS when the voice is detected in the voice data SD1, and transmits the voice detecting signal SS to the first processing circuit 25. In step S14, the first processing circuit 25 determines whether the first memory 22 is full of the sound data SD1, and proceeds to the next step S16 when the sound data SD1 is full, so as to ensure that enough sound data SD1 can proceed to the following steps. In step S16, the first processing circuit 25 selectively determines to execute the second stage ST2(DMA stage) or the third stage ST3 (voice recognition stage) according to a total amount of valid data, the transmission bit rate of the digital microphone 21 and a recognition interval Ti.

In one embodiment, the target average power consumption, the first average power consumption of the first stage ST1, the second average power consumption of the second stage ST2, and the third average power consumption of the third stage ST3 are known, and the time occupied by each stage in the previous period T is obtained, including the first time Ta of the first stage ST1, the second time Tb of the second stage ST2, and the third time Tc of the third stage ST3, wherein the previous period T is equal to the sum of the first time Ta, the second time Tb, and the third time Tc, i.e., T + Tb + Tc. In one embodiment, the period time T may be, but is not limited to, 16 seconds. Thus, a Budget relation (Budget) relating to the power usage can be obtained from the above parameters, the Budget relation being the target average power consumption over a period of time T- (first average power consumption in the first stage ST1 over the first time Ta in the first stage ST1 + second average power consumption in the second stage ST2 + second average power consumption in the second stage ST2 + third average power consumption in the third stage ST3 over the third time Tc in the third stage ST 3).

After the budget relationship value is obtained, the identification interval time Ti can be dynamically determined according to the budget relationship value. Specifically, when the budget relation value is less than or equal to the target average power consumption and the previous period time T1/3, the identification interval time Ti is determined to be 2 seconds. When the budget relation value is greater than the target average power consumption and the previous period time T × 1/3 and less than or equal to the target average power consumption and the previous period time T × 2/3, determining that the identification interval time Ti is 1.5 seconds. When the budget relation value is larger than the target average power consumption and the previous period time T2/3, the identification interval time Ti is determined to be 1 second. Then, knowing that the total effective data amount is the sum of the effective data amount of the first memory 22 and the effective data amount of the second memory 27, and the transmission bit rate of the digital microphone 21, the first processing circuit 25 determines to execute the DMA stage of the second stage ST2 when the total effective data amount is smaller than the product of the transmission bit rate of the digital microphone 21 and the recognition interval time. When the total amount of valid data is greater than or equal to the product of the transmission bit rate of the digital microphone 21 and the recognition interval time, the first processing circuit 25 determines to execute the voice recognition stage of the third stage ST 3.

When the first processing circuit 25 determines to execute the second stage ST2, as shown in step S18, the first processing circuit 25 wakes up the second processing circuit 26 first, and then enters the second stage ST 2. In the second stage ST2, as shown in step S20, the first processing circuit 25 outputs a first command C1 to the second processing circuit 26, and the second processing circuit 26 makes the memory access circuit 24 transfer the audio data SD1 in the first memory 22 to the second memory 27 according to the first command C1 for storage as the audio data SD 2. In the second stage ST2, the voice data SD2 is converted into the second memory 27 only through the memory access circuit 24 without performing voice recognition.

When the first processing circuit 25 determines to execute the third stage ST3, as shown in step S22, the first processing circuit 25 wakes up the second processing circuit 27 first, and then proceeds to the third stage ST 3. In the third stage ST3, as shown in step S24, the first processing circuit 25 outputs a second command C2 to the second processing circuit 26, and the second processing circuit 26 causes the memory access circuit 24 to transfer the audio data SD1 in the first memory 22 to the second memory 27 according to the second command C2 to store as the audio data SD2, and determines whether the audio data SD2 in the second memory 27 matches the preset audio command. In step S26, the second processing circuit 26 determines whether the voice data SD2 in the second memory 27 matches the preset voice command, if the voice data SD2 confirms that the preset voice command matches, then the system boot program is executed in step S28 to wake up other circuits, including the video processing circuit 30, the core processing circuits 31-33, and the third memories 34-36, to perform system boot.

Fig. 4 is a flowchart illustrating a dynamic speech recognition method according to another embodiment of the present invention, referring to fig. 1, fig. 3 and fig. 4, the dynamic speech recognition method includes using the dynamic speech recognition device 20 to perform a first stage ST1 (steps S10-S16) and a second stage ST2 (step S30) or a third stage ST3 (steps S32-S34), which are described in detail below.

In executing the first stage ST1 (pure standby stage), as shown in step S10, the sound data SD1 is detected by the digital microphone 21 and the sound data SD1 is stored in the first memory 22. In step S12, the voice activity detection circuit 23 detects whether there is a voice in the audio data SD1, and when detecting a voice, it is triggered to generate a voice detection signal SS to be transmitted to the first processing circuit 25. In step S14, the first processing circuit 25 determines whether the first memory 22 is full of the sound data SD1, and proceeds to the next step S16 when the sound data SD1 is full, so as to ensure that enough sound data SD1 can proceed to the following steps. In step S16, the first processing circuit 25 selectively determines to execute the second stage ST2(DMA stage) or the third stage ST3 (voice recognition stage) according to a total amount of valid data, the transmission bit rate of the digital microphone 21 and a recognition interval Ti.

When the first processing circuit 25 determines to execute the second stage ST2, as shown in step S30, in the second stage ST2, the first processing circuit 25 outputs a first command C1 to wake up the second processing circuit 26, and the second processing circuit 26 makes the memory access circuit 24 transfer the audio data SD1 in the first memory 22 to the second memory 27 according to the first command C1 to store as the audio data SD 2.

When the first processing circuit 25 determines to execute the third stage ST3, as shown in step S32, in the third stage ST3, the first processing circuit 25 outputs a second command C2 to wake up the second processing circuit 26, and the second processing circuit 26 makes the memory access circuit 24 transfer the audio data SD1 in the first memory 22 to the second memory 27 according to the second command C2 to store as the audio data SD2, and determines whether the audio data SD2 in the second memory 27 matches the preset audio command. In step S34, the second processing circuit 26 determines whether the voice data SD2 in the second memory 27 matches with the preset voice command, and if the voice data SD2 confirms that the preset voice command matches, executes a system booting procedure in step S28 to wake up all circuits for system booting.

The steps (S10-S26 and S30-S34) of the dynamic speech recognition method are merely examples and are not limited to the above-described sequential execution. Various operations under the dynamic speech recognition method may be added, substituted, omitted, or performed in a different order as appropriate without departing from the spirit and scope of the invention.

In one embodiment, when the first processing circuit 25 receives the voice detection signal SS, the first processing circuit 25 outputs the first command C1 or the second command C2 after the recognition interval Ti. As shown in fig. 1 and 3, when the first processing circuit 25 receives the human voice detection signal SS at time T1, the first processing circuit 25 outputs the first command C1 or the second command C2 at time T2 after the recognition interval Ti, wherein the recognition interval Ti can be dynamically determined based on the above-mentioned manner to ensure that the received voice data SD1 sufficiently reflects the predetermined voice command before the second processing circuit 26 and the second memory 27 are enabled, so that the low power operation can be satisfied to meet the relevant specification of the energy requirement.

In one embodiment, if the keyword set by the preset voice command is "Hi, TV", please refer to fig. 1 and fig. 3, at time T1, the digital microphone 21 detects the external sound and generates the sound data SD1, and the first memory 22 stores the sound data SD1, for example, the digital microphone 21 detects that the user utters the voice command such as "Hi, TV …" to the dynamic voice recognition device 20. Meanwhile, the voice activity detection circuit 23 determines that the sound data SD1 has voice and outputs a voice detection signal SS. At time T2, the first processing circuit 25 outputs either the first command C1 or the second command C2. The second processing circuit 26 and the second memory 27 are also enabled, and at this time, the second processing circuit 26 enables the memory access circuit 24 according to the first command C1 or the second command C2 to transfer the audio data SD1 to the second memory 27 and store it as the audio data SD 2. Therefore, the second processing circuit 26 can analyze the voice data SD2 to determine whether the voice data SD2 matches the preset voice command "Hi, TV", and determine that the voice data SD2 matches the preset voice command at the second processing circuit 26 to wake up other circuits to execute the system booting procedure.

In one embodiment, the first stage ST1 uses the digital microphone 21, the first memory 22, the voice activity detection circuit 23 and the first processing circuit 25 in the dynamic speech recognition device 20. The second stage ST2 uses the digital microphone 21, the first memory 22, the voice activity detection circuit 23, the memory access circuit 24, the first processing circuit 25, a portion of the second processing circuit 26 (only a portion of the functions of the second memory are enabled), and the second memory 27 in the dynamic speech recognition device 20. The third stage ST3 uses all the circuits of the digital microphone 21, the first memory 22, the voice activity detection circuit 23, the memory access circuit 24, the first processing circuit 25, the second processing circuit 26, and the second memory 27 in the dynamic speech recognition device 20. Therefore, the third average power consumption of the third stage ST3 is greater than the second average power consumption of the second stage ST2, and the second average power consumption is greater than the first average power consumption of the first stage ST 1. For example, the power consumption of the first stage ST1 is about 0.5 watt, the power consumption of the third stage ST3 is about 4 watt, and the power consumption of the second stage ST2 is between the two.

Therefore, the present invention can determine the budget relationship value according to the time occupied by each stage (the first time, the second time and the third time) in the previous period T and the average power consumption of each stage, so as to dynamically determine the length of the recognition interval time Ti according to the budget relationship value, and further determine whether to perform the voice data recognition (execute the second stage ST2 or the third stage ST3) accordingly, so that the voice recognition can be dynamically performed according to the power consumption of the actual operation. Therefore, the invention can take the experience of the user into account when performing dynamic voice recognition, and can reduce the average power consumption when triggering and searching the preset voice command in the standby mode, thereby providing a method with better sensitivity.

The embodiments described above are merely illustrative of the technical spirit and features of the present invention, and the present invention is not limited to the embodiments described above, but rather, the present invention may be implemented by those skilled in the art.

Claims

1. A dynamic speech recognition method includes:

executing a first stage:

detecting a voice data by a digital microphone and storing the voice data in a first memory;

detecting voice in the voice data to generate a voice detection signal; and

selectively determining to execute a second stage or a third stage by a first processing circuit according to a total effective data amount, a transmission bit rate of the digital microphone and an identification interval time;

the second phase is executed:

the first processing circuit outputs a first command to a second processing circuit, and the second processing circuit enables a memory access circuit to transfer the sound data to a second memory according to the first command and store the sound data as voice data; and

the third phase is performed:

the first processing circuit outputs a second instruction to the second processing circuit, the second processing circuit enables the memory access circuit to transfer the voice data to the second memory according to the second instruction and store the voice data as the voice data, and the second processing circuit confirms whether the voice data in the second memory is matched with a preset voice instruction or not.

2. The dynamic speech recognition method of claim 1, wherein the first processing circuit determines to perform the second stage when the total amount of valid data is less than a product of the transmission bit rate of the digital microphone and the recognition interval time; and when the total effective data amount is larger than or equal to the product of the transmission bit rate of the digital microphone and the identification interval time, the first processing circuit determines to execute the third stage, wherein the total effective data amount is the sum of the effective data amount of the first memory and the effective data amount of the second memory.

3. The dynamic speech recognition method of claim 2, wherein when the first processing circuit receives the human voice detection signal, the first processing circuit outputs the first command or the second command after the recognition interval.

4. The method according to claim 3, wherein the recognition interval is determined by a budget relation value, the recognition interval is 2 seconds when the budget relation value is less than or equal to target average power consumption and previous cycle time 1/3; the identification interval is 1.5 seconds when the budget relation value is greater than the target average power consumption and the previous period time 1/3 and less than or equal to the target average power consumption and the previous period time 2/3; and the identification interval is 1 second when the budget relation value is greater than the target average power consumption and the previous period time 2/3.

5. The method of claim 4, wherein the budget relationship is the target average power consumption and the previous period time- (a first average power consumption of the first stage + a first time of the first stage + a second average power consumption of the second stage + a second time of the second stage + a third average power consumption of the third stage) and wherein the previous period time is equal to a sum of the first time, the second time and the third time.

6. The method of claim 5, wherein the third average power consumption is greater than the second average power consumption, and the second average power consumption is greater than the first average power consumption.

7. The method of claim 1, further comprising, after the step of generating the human voice detection signal: judging whether the first memory is full of the sound data, and proceeding to the next step when the first memory is full of the sound data.

8. The method of claim 1, wherein the step of selectively determining whether to perform the second stage or the third stage in performing the first stage further comprises: the first processing circuit wakes up the second processing circuit.

9. The method of claim 1, wherein the first processing circuit wakes up the second processing circuit when the first processing circuit outputs the first command or the second command.

10. A dynamic speech recognition device, comprising:

a digital microphone for detecting a voice data;

a first memory electrically connected to the digital microphone for storing the sound data;

a voice activity detection circuit electrically connected to the digital microphone for detecting the voice data and generating a voice detection signal;

a memory access circuit electrically connected to the first memory, the memory access circuit transferring the sound data to a second memory for storage as a voice data;

a first processing circuit electrically connected to the voice activity detection circuit; and

the second processing circuit is electrically connected with the first processing circuit, the second memory and the memory access circuit;

wherein, the dynamic voice recognition device is used for executing the following steps:

executing a first stage:

detecting the voice data by using the digital microphone and storing the voice data in the first memory;

the voice activity detection circuit detects the voice in the voice data to generate the voice detection signal; and

selectively determining to execute a second stage or a third stage according to a total effective data amount, a transmission bit rate of the digital microphone and an identification interval time by the first processing circuit;

the second phase is executed:

the first processing circuit outputs a first command to the second processing circuit, and the second processing circuit makes the memory access circuit transfer the sound data to the second memory and store the sound data as the voice data according to the first command; and

the third phase is performed: