CN111414071B

CN111414071B - Processing system and voice detection method

Info

Publication number: CN111414071B
Application number: CN201910011390.5A
Authority: CN
Inventors: 陈庆隆
Original assignee: Realtek Semiconductor Corp
Current assignee: Realtek Semiconductor Corp
Priority date: 2019-01-07
Filing date: 2019-01-07
Publication date: 2021-11-02
Anticipated expiration: 2039-01-07
Also published as: CN111414071A

Abstract

The application discloses a processing system and a voice detection method. The processing system operates in a first power domain and includes a first memory bank, memory bank access circuitry, and first processing circuitry. The first memory bank stores sound data detected from the microphone. The memory access circuit transfers the voice data to the second memory according to the first instruction to be stored as voice data. The first processing circuit outputs a second instruction according to the human voice detection signal. The second instruction enables the second processing circuit to determine whether the voice data in the second memory bank matches a preset voice instruction, and one of the first processing circuit and the second processing circuit outputs the first instruction. The second processing circuit operates in a second power domain, and power consumption corresponding to the first power domain is lower than that corresponding to the second power domain.

Description

Processing system and voice detection method

Technical Field

The present disclosure relates to a processing system, and more particularly, to a processing system and a voice detection method applied in an electronic device supporting voice wakeup.

Background

More and more electronic devices support voice wake-up functionality. In the prior art, an additional set of digital signal processing circuits is required to analyze the voice command. As a result, the cost of the device will increase, and the overall power consumption of the device will also increase, failing to meet the relevant specifications for energy requirements.

Disclosure of Invention

To address the above issues, aspects of the present disclosure provide a processing system that operates in a first power domain and includes a first memory bank, memory bank access circuitry, and first processing circuitry. The first memory is used for storing sound data detected by a microphone. The memory access circuit is used for transferring the sound data to a second memory according to a first instruction so as to store the sound data as voice data. The first processing circuit is used for outputting a second instruction according to a human voice detection signal. The second instruction is used for enabling a second processing circuit to determine whether the voice data in the second memory bank matches a preset voice instruction, and one of the first processing circuit and the second processing circuit outputs the first instruction. The second processing circuit operates in a second power domain, and the power consumption corresponding to the first power domain is lower than the power consumption corresponding to the second power domain.

Some aspects of the present disclosure provide a speech detection method, comprising the operations of: storing sound data detected from a microphone by a first memory bank; transferring the voice data to a second memory bank according to the first instruction to be stored as voice data; and outputting a second instruction by the first processing circuit according to the human voice detection signal. The second instruction is used for enabling a second processing circuit to confirm whether the voice data in the second storage body is matched with a preset voice instruction or not, and one of the first processing circuit and the second processing circuit is used for outputting the first instruction. The first memory bank and the first processing circuit operate in a first power domain, the second processing circuit operates in a second power domain, and power consumption corresponding to the first power domain is lower than power consumption corresponding to the second power domain.

In summary, the processing system and the voice detection method provided by the embodiment of the present invention can be operated with the original low power core processing circuit in the device to achieve the voice wake-up application. In this way, the trade-off between device cost and average power consumption can be better balanced, while at the same time complying with existing energy requirements.

Drawings

The drawings attached to this document are described as follows:

FIG. 1 is a schematic diagram of an electronic device according to some embodiments of the present disclosure;

FIG. 2 is a waveform diagram illustrating various data/signals/instructions of FIG. 1 according to some embodiments of the disclosure;

FIG. 3 is a waveform diagram illustrating various data/signals/instructions of FIG. 1 according to some other embodiments of the present disclosure;

FIG. 4 is a waveform diagram illustrating various data/signals/instructions of FIG. 1 according to yet further embodiments of the present disclosure; and

fig. 5 is a flowchart illustrating a voice detection method according to some embodiments of the disclosure.

Detailed Description

All terms used herein have their ordinary meaning. The above definitions of words and phrases are generally used in dictionary, and any use of the word and phrase herein is meant to be included within the context of this disclosure by way of example only and should not be taken as limiting the scope or meaning of the present disclosure. Likewise, the present disclosure is not limited to the various embodiments shown in this specification.

As used herein, coupled or connected means that two or more elements are in direct physical or electrical contact with each other, or in indirect physical or electrical contact with each other, or that two or more elements are in mutual operation or action.

As used herein, the term "circuit system" generally refers to a single system comprising one or more circuits (circuits). The term "circuit" broadly refers to an object that is connected in some manner by one or more transistors and/or one or more active and passive elements to process signals.

Referring to fig. 1, fig. 1 is a schematic diagram of an electronic device 100 according to some embodiments of the present disclosure. In some embodiments, the electronic device 100 may be a television, but the disclosure is not limited thereto. In some embodiments, the electronic device 100 includes a main processor 110, a plurality of memory banks 120, an audio/video processing circuit 130, and a processing system 140.

The main processor 110 is a multi-core processor, which includes a plurality of core processing circuits 111-114. The plurality of core processing circuits 111-114 are respectively coupled to the plurality of memory banks 120 and the video processing circuit 130. In some embodiments, the plurality of memory banks 120 may be implemented by Dynamic Random Access Memory (DRAM), but the disclosure is not limited thereto.

In some embodiments, the video processing circuit 130 is used for performing audio/video encoding/decoding, scaling, motion compensation, and the like on a video (not shown) provided by an external source. The plurality of core processing circuits 111-114, the plurality of memory banks 120 and the video processing circuit 130 may cooperate with each other to play the received video.

In some embodiments, the processing system 140 is activated in the standby mode and is configured to determine whether to enable the core processing circuit 111 according to an external audio signal to perform a boot operation. In other words, the electronic device 100 may support the voice wake-up function. For example, when it is determined that a voice occurs in the environment, the processing system 140 may enable the core processing circuit 111 to analyze whether the voice is a predetermined command. If so, the other core processing circuits 112-114 are also enabled to perform the system booting operation.

The processing system 140 includes a microphone 141, a memory bank 142, Voice Activity Detection (VAD) circuitry 143, processing circuitry 144, and memory bank access circuitry 145. The microphone 141 detects the sound data SD 1. In some embodiments, the microphone 141 may be implemented by one or more digital microphones. Various suitable sound receiving components for implementing the microphone 141 are contemplated.

The memory bank 142 is coupled to the microphone 141 to receive and store the sound data SD 1. In some embodiments, memory bank 142 may be implemented by a Static Random Access Memory (SRAM).

The VAD circuit 143 is coupled to the microphone 141 to determine whether there is voice in the voice data SD 1. For example, the VAD circuit 143 may analyze energy, pitch, etc. information in the sound data SD1 to determine whether there is a human voice. When it is judged that there is a voice, the VAD circuit 143 outputs a voice detection signal SS. In some embodiments, the VAD circuit 143 may be implemented by a voice recognition chip. Alternatively, in some embodiments, the VAD circuit 143 may be implemented by processing circuitry that performs various types of speech recognition algorithms.

The processing circuit 144 is coupled to the VAD circuit 143 to receive the voice detection signal SS. In some embodiments, the processing circuit 144 outputs the command C1 and the command C2 according to the human voice detection signal SS, wherein the command C1 is used to enable the bank access circuit 145, and the command C2 is used to enable the core processing circuit 111. In some embodiments, the processing circuit 144 may be implemented by a microcontroller circuit that consumes less power. For example, the processing circuit 144 may be implemented by an 8051 microcontroller, but the disclosure is not limited thereto. Various types of microcontrollers are within the scope of the present disclosure.

The bank access circuit 145 is coupled to the processing circuit 144, the memory bank 142, and one 120 (hereinafter referred to as 120A) of the plurality of memory banks 120. In some embodiments, the bank access circuit 145 transfers the audio data SD1 of the memory bank 120 to the memory bank 120A according to the command C1 for storage as the audio data SD 2. The core processing circuit 111 may be enabled based on instruction C2 to boot the bank 120A.

In other words, in some embodiments, when the VAD circuit 143 determines that there is a voice, the core processing circuit 111 and the bank 120A are enabled in response to the command C2, and the bank access circuit 145 is enabled in response to the command C1 to convert the voice data SD1 into the voice data SD 2. Thus, the core processing circuit 111 can determine whether the voice data SD2 matches a predetermined voice command to determine whether to wake up the remaining memory banks 120 and the remaining core processing circuits 112 to 114 for system booting.

In some embodiments, the Memory Access circuit 145 may be implemented by a Direct Memory Access (DMA) controller circuit, but is not limited thereto.

In some embodiments, as shown in fig. 1, the core processing circuit 111 operates in a power domain PD 2. In some embodiments, bank 120A operates under power domain PD 2. The remaining plurality of memory banks 120 and the remaining plurality of core processing circuits 112-114 operate in the power domain PD 1. The processing system 140 operates in the power domain PD 3. Generally, for the aforementioned video-audio related operations, the power consumption of the circuit elements under the power domain PD1 is high. Furthermore, the processing system 140 in the power domain PD3 is configured to receive voice commands in the standby mode, so that the corresponding power consumption is the lowest. The memory bank 120A and the core processing circuit 111 under the power domain PD2 are used for recognizing whether the voice command matches a predetermined voice command according to the control of the processing system 140. Accordingly, the power consumption corresponding to the power domain PD1 is higher than the power consumption corresponding to the power domain PD2, and the power consumption corresponding to the power domain PD2 is higher than the power consumption corresponding to the power domain PD 3. In a non-limiting example, as shown in FIG. 1, power dissipation for power domain PD2 is approximately 4 Watts (W), and power dissipation for power domain PD3 is approximately 0.3W.

In some related art techniques, additional digital signal processing circuitry is used to detect voice commands. In these techniques, the additional digital signal processing circuit increases the device cost and generates more power consumption, which cannot meet the international energy requirement.

In contrast to the above-mentioned techniques, the processing system 140 of the present embodiment may utilize the less energy-consuming memory bank 142 and the processing circuit 144 as a buffer for receiving the audio data. Thus, when it is determined that the received voice data includes a voice, the low power core circuit 111 in the electronic device 100 is enabled to recognize whether the received voice is a predetermined voice command. By this arrangement, the trade-off between device cost and average power consumption can be better balanced, and the existing energy requirements can be met.

In some embodiments, the core processing circuit 111 may output instruction C1 (shown in dashed lines in fig. 1). More specifically, when it is determined that there is a voice, the VAD circuit 143 outputs the voice detection signal SS. The processing circuit 144 then generates instruction C2 to enable the core processing circuit 111. At this time, the core processing circuit 111 outputs an instruction C1 to enable the bank access circuit 145. The above arrangement also provides a better balance between the cost of the device and the average power consumption, and at the same time meets the existing energy requirements.

Referring to fig. 2, fig. 2 is a waveform diagram illustrating various data/signals/instructions of fig. 1 according to some embodiments of the disclosure.

As shown in fig. 2, at time T1, the microphone 141 detects an external sound and generates sound data SD1, for example, the microphone 141 detects that the user utters a voice command such as "Hi, TV" to the electronic device 100. Meanwhile, the VAD circuit 143 determines that the voice data SD1 has voice, and outputs a voice detection signal SS. In addition, at time T1, the memory bank 142 stores the audio data SD 1.

At time T2, the processing circuit 144 outputs a command C1 and a command C2 (shown as pulse P1). In response to instruction C2, the core processing circuit 111 and the memory bank 120A are enabled (represented as pulse P2). Meanwhile, in response to the command C1, the bank access circuit 145 is enabled (i.e., pulse P1) to transfer the audio data SD1 to the bank 120A and store as the audio data SD 2.

Accordingly, the core processing circuit 111 can analyze the speech data SD2 to determine whether the speech data SD2 matches the predetermined speech command (i.e., the operation corresponding to the pulse P2). For example, if the predetermined voice command is "Hi, TV", in this example, the core processing circuit 111 confirms that the voice data SD2 matches the predetermined voice command to wake up other circuits in the power domain PD1 to perform system power-on.

In some embodiments, a predetermined delay time TD1 is set between the time T1 and the time T2. As shown in fig. 2, when the processing circuit 144 receives the human voice detection signal SS at time T1, the processing circuit 144 outputs the command C1 and the command C2 after the predetermined delay time TD1 (i.e., time T2). In some embodiments, the predetermined delay time TD1 may be determined based on a keyword of a predetermined voice command. For example, if the keyword set by the preset voice command is "Hi, TV", and the user generally needs 1 second to finish "Hi, TV", the predetermined delay time TD1 can be set to 1 second. By setting the predetermined delay time TD1, it is ensured that the received sound data SD1 sufficiently reflects the predetermined voice command before the core processing circuit 111 and the memory bank 120A are enabled. In this way, the average power consumption can be further reduced. The above-mentioned values related to the predetermined delay time TD1 are used for example, and the disclosure is not limited thereto.

In some embodiments, when the predetermined voice command has more than two keywords (e.g., "" Hi "and" "TV"), the core processing circuit 111 can be enabled at least once to perform the subsequent operation. For example, as shown in FIG. 2, the core processing circuit 111 may be enabled once (i.e., pulse P1) to perform subsequent operations. Alternatively, in some embodiments, as shown in fig. 3 and 4, the core processing circuit 111 may be enabled more than twice (e.g., pulse P1 and pulse P3) to perform subsequent operations. There is a predetermined delay time TD2 between the time when the core processing circuit 111 is enabled twice consecutively.

Referring to fig. 3, fig. 3 is a waveform diagram illustrating data/signals/instructions of fig. 1 according to other embodiments of the present disclosure.

As shown in fig. 3, similar to fig. 2, at time T1, the microphone 141 generates the audio data SD1, for example, the microphone 141 detects that the user speaks "Hi" to the electronic device 100. Meanwhile, the VAD circuit 143 determines that the voice data SD1 has voice and outputs the voice detection signal SS, and the memory 142 stores the voice data SD 1.

At time T2, the processing circuit 144 outputs command C1 and command C2 (i.e., pulse P1). In response to instruction C2, the core processing circuit 111 and the memory bank 120A are enabled (i.e., pulse P2). In response to the instruction C1, the bank access circuit 145 transfers the sound data SD1 to the bank 120A and stores as the voice data SD 2. Accordingly, the core processing circuit 111 can analyze the speech data SD2 and confirm that the keyword "Hi" (i.e., the pulse P2) is included in the speech data SD 2. Thereafter, the core processing circuit 111 and the memory bank 120A enter the standby mode again (i.e., the time interval between the pulses P2 and P4).

In this case, in the foregoing process, the user continues to give voice commands to the electronic device 100. For example, the microphone 141 detects that the user then speaks "TV". Therefore, similar to the aforementioned operation, the memory bank 142 continuously stores the sound data SD1, and the VAD circuit 143 outputs the human voice detection signal SS again. At time T3, the processing circuit 144 again outputs the command C1 and the command C2 (i.e., the pulse P3) to enable the bank access circuit 145 and the core processing circuit 111 and transfer the current audio data SD1 to the memory 120A.

Accordingly, the core processing circuit 111 can analyze the speech data SD2 and confirm that the speech data SD2 includes the keyword "" TV "(i.e., pulse P4). In this way, the core processing circuit 111 can find out two keywords "Hi" and "TV" to confirm that the voice data SD2 matches the preset voice command, so as to wake up other circuits in the power domain PD1 to execute system booting.

Referring to fig. 4, fig. 4 is a waveform diagram illustrating data/signals/instructions of fig. 1 according to still other embodiments of the disclosure.

In the example shown in FIG. 4, at time T2, the core processing circuit 111 and the bank access circuit 145 are enabled to transfer the audio data SD1 to the memory bank 120A and store as the audio data SD2 without confirming whether the audio data SD2 matches the predetermined audio command (i.e., pulse P2). At time T3, the core processing circuit 111 and the bank access circuit 145 are again enabled to transfer the current audio data SD1 to the bank 120A. Accordingly, the core processing circuit 111 can analyze the speech data SD2 and confirm that the speech data SD2 includes two keywords "" Hi "and" "TV" (i.e., pulse P4). In this way, the core processing circuit 111 can confirm that the voice data SD2 matches the predetermined voice command to wake up other circuits in the power domain PD1 to perform system booting.

In other words, compared to fig. 3, in this example, when the core processing circuit 111 is enabled for the first time, only a short duty cycle (i.e., the pulse P2) is required to transfer the audio data SD1 to the voice data SD 2. Then, when the core processing circuit 111 is enabled again later, the voice data SD2 is analyzed to determine whether to perform system boot. By this arrangement, the memory bank 142 needs to store less sound data SD1 compared to the embodiment of fig. 2. Thus, memory bank 142 may be implemented with a lower capacity SRAM to further save device cost.

In detail, in some embodiments, as shown in fig. 4, when the processing circuit 144 receives the voice detection signal SS for the first time at time T1, the processing circuit 144 outputs only the command C1 (e.g., pulse P1) at time T2 to enable the bank access circuit 145 and transfer the current audio data SD1 to the memory bank 120A (e.g., pulse P2). Then, the processing circuit 144 outputs the command C1 and the command C2 (e.g., pulse P3) at time T3. Thus, in addition to the bank access circuit 145 being enabled to transfer the audio data SD1, the core processing circuit 111 is also enabled to analyze the stored audio data SD2 (e.g., pulse P4) at times T2 and T3 to determine whether to perform system boot. In other words, the first time the core processing circuit 111 is enabled by instruction C2, the bank access circuit 145 has been enabled at least twice by instruction C1.

In some embodiments, the predetermined delay times TD1 and TD2 may be set by a user. In some embodiments, the predetermined delay times TD1 and TD2 may be the same or different values, such as the delay times TD1 and TD2 may be set to 0.5 seconds and 0.5 seconds, or to 0.5 seconds and 0.7 seconds.

For ease of illustration, the processing circuit 144 outputting the instruction C1 is illustrated in fig. 2 to 4. As previously described, in other embodiments, the instruction C1 may instead be output by the core processing circuit 111. In these embodiments, the operations corresponding to the pulse P1 or the pulse P3 in fig. 2 to 4 include: the processing circuit 144 first generates the instruction C2 to enable the core processing circuit 111, and the core processing circuit 111 then outputs the instruction C1 to enable the bank access circuit 145.

Referring to fig. 5, fig. 5 is a flowchart illustrating a voice detection method 500 according to some embodiments of the disclosure.

In operation S510, the detected sound data SD1 from the microphone 141 is stored.

In operation S520, the command C1 and the command C2 are output according to the voice detection signal SS.

In operation S530, the bank access circuit 145 is enabled according to instruction C1.

In operation S540, the audio data SD1 is transferred to the memory bank 120A to be stored as the voice data SD 2.

In operation S550, the SD2 is analyzed according to the command C2 to determine whether the SD2 matches a predetermined voice command, and the system is powered on when the SD2 matches the predetermined voice command.

The above description of the operations can refer to the embodiments of fig. 1 to 4, and thus will not be repeated. The steps of the voice detection method 500 are merely exemplary and are not limited to the above-described exemplary sequential execution. The various operations under the speech detection method 500 may be suitably added, substituted, omitted, or performed in a different order without departing from the manner of operation and scope of various embodiments of the present disclosure.

In various embodiments, the implementation of the processing system 140 may be software, hardware, and/or firmware. For example, various circuits within the processing system 140 may be integrated into an application specific integrated circuit. In some embodiments, the processing system 140 may be implemented by software executing the speech detection method 500. Alternatively, the processing system 140 may be implemented by digital signal processing circuitry that performs the speech detection method 500. In other embodiments, each circuit or unit in the processing system 140 may also be implemented by software, hardware, and firmware.

Although the present invention has been described with reference to the above embodiments, it should be understood that various changes and modifications can be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

[ notation ] to show

100: electronic device

120: memory bank

140: processing system

141: microphone (CN)

143: voice activity detection circuit

145: memory access circuit

C1, C2: instructions

SD 2: voice data

PD 1-PD 3: power domain

P1-P4: pulse wave

500: voice detection method

S530 and S540: operation 110: main processor

130: video and audio processing circuit

111-114: core processing circuit

142: memory bank

144: processing circuit

And SS: human voice detection signal

120A: memory bank

SD 1: sound data

T1-T3: time of day

TD1, TD 2: predetermined delay time

S510, S520: operation of

S550: operation of

Claims

1. A processing system operating in a first power domain, the processing system comprising:

a first memory bank for storing a voice data detected from a microphone;

a memory access circuit for transferring the audio data to a second memory according to a first command to be stored as a voice data; and

the first processing circuit is used for outputting a second instruction according to a human voice detection signal, wherein the second instruction is used for enabling a second processing circuit outside the processing system to confirm whether the voice data in the second storage body matches a preset voice instruction, one of the first processing circuit and the second processing circuit outside the processing system is used for outputting the first instruction, the second processing circuit outside the processing system is operated in a second power domain, and the power consumption corresponding to the first power domain is lower than that corresponding to the second power domain.

2. The processing system of claim 1, further comprising:

and the voice activity detection circuit is used for outputting the human voice detection signal according to the voice data.

3. The processing system of claim 1, wherein one of the first processing circuit and the second processing circuit is configured to output the first command after a predetermined delay time when the first processing circuit receives the human voice detection signal.

4. The processing system of claim 1, wherein the first processing circuit is configured to output the second instruction after a predetermined delay time when the first processing circuit receives the human voice detection signal.

5. The processing system as claimed in claim 1, wherein the first processing circuit outputs the first command, the predetermined voice command includes a plurality of keywords, and the first processing circuit is configured to output the second command to enable the second processing circuit at least once to determine whether the voice data includes the keywords.

6. The processing system of claim 5, wherein the second processing circuit is enabled twice consecutively with a predetermined delay time therebetween.

7. The processing system of claim 5, wherein the second processing circuit does not confirm whether the voice data matches the predetermined voice command when the second processing circuit is enabled by the second command for the first time.

8. The processing system of claim 5, wherein the memory access circuitry has been enabled by the first instruction at least twice when the second processing circuitry was first enabled by the second instruction.

9. A method of speech detection, comprising:

storing, by a first memory bank of a processing system operating in a first power domain, a sound data detected from a microphone;

transferring the voice data to a second memory bank according to a first instruction by a memory bank access circuit of the processing system to store the voice data as voice data; and

outputting a second instruction by a first processing circuit of the processing system according to a human voice detection signal, wherein the second instruction is used for enabling a second processing circuit outside the processing system to determine whether the voice data in the second memory bank matches a preset voice instruction, one of the first processing circuit and the second processing circuit outside the processing system is used for outputting the first instruction, the second processing circuit outside the processing system is operated in a second power domain, and the power consumption corresponding to the first power domain is lower than the power consumption corresponding to the second power domain.

10. The method of claim 9, wherein the first processing circuit outputs the first command, the predetermined voice command includes a plurality of keywords, and the first processing circuit enables the second processing circuit at least once according to the second command to determine whether the voice data includes the keywords.