CN114944155A

CN114944155A - Offline voice recognition method combining terminal hardware and algorithm software processing

Info

Publication number: CN114944155A
Application number: CN202110186016.6A
Authority: CN
Inventors: 许兵; 高君效
Original assignee: Chipintelli Technology Co Ltd
Current assignee: Chipintelli Technology Co Ltd
Priority date: 2021-02-14
Filing date: 2021-02-14
Publication date: 2022-08-26
Anticipated expiration: 2041-02-14
Also published as: CN114944155B

Abstract

An off-line speech recognition method and chip combining terminal hardware and algorithm software processing comprises the following steps: s1, a microphone captures an external analog voice signal in real time; s2, carrying the data in the first cache to a voice preprocessing module; s3, obtaining a clean voice signal and storing the clean voice signal; s4, the direct memory access module sends the clean voice signal to the voice endpoint detection module and the hardware calculation module at the same time; s5, a voice endpoint detection module calculates and judges whether the clean voice signal is an effective voice signal; s6, calculating voice acoustic characteristics; and S7, calculating the voice acoustic characteristics by the neural network calculation module, and performing voice recognition processing by the CPU. The CPU in the chip and each hardware computing module are connected in an effective parallel processing mode, and the requirement on the processing capacity of the CPU can be reduced and the chip cost is reduced by carrying data in parallel.

Description

Offline voice recognition method combining terminal hardware and algorithm software processing

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to an offline voice recognition method combining terminal hardware and algorithm software processing.

Background

The development of the voice recognition technology has been for many years, and especially along with the gradual maturity of the neural network technology in recent years, a large amount of voice recognition adopts the neural network technology, so that the accuracy rate of the recognition is improved, and the voice recognition is enabled to be gradually truly commercialized. The mainstream method is to adopt a cloud speech recognition technology, similar to speech recognition of a smart sound box and a speech assistant on a smart phone in the prior art, and to upload speech to a server after the speech is collected from the terminal, the server hardware runs a relevant speech recognition algorithm to process the speech, and a result is fed back to the terminal.

The computing power problem required by voice can be solved by adopting cloud voice recognition, a better voice recognition effect can be obtained, but the problems that voice privacy is safely revealed by using the cloud voice recognition, the network is depended on, and the instantaneity is not good exist, and the method is not completely suitable for application occasions such as control and the like. The industry also needs off-line speech recognition solutions. In the off-line voice recognition, the cloud hardware resources cannot be called, the processing capacity of the terminal hardware is limited, the terminal product has high comprehensive requirements on cost and performance, including response time, judgment accuracy and the like, and how to utilize the limited hardware processing resources is a challenging technical problem in combination with the high cost performance, high real-time performance and high recognition rate of algorithm software design.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention discloses an off-line speech recognition method combining terminal hardware and algorithm software processing.

The invention relates to an off-line voice recognition method combining terminal hardware and algorithm software processing, which is characterized by comprising the following steps:

s1, a microphone captures an external analog voice signal in real time and sends the external analog voice signal to a voice data acquisition module in an offline voice recognition chip; the voice data acquisition module converts the analog voice signal into a digital voice signal and sends the digital voice signal to a first cache through a direct memory access module in the chip;

s2, monitoring the data volume of the first cache by the CPU, and carrying the data in the first cache to a voice preprocessing module by the CPU when the data in the first cache is accumulated to a preset threshold value;

s3, when the voice preprocessing module receives a digital voice signal transmitted by the CPU from the first cache, processing the signal to obtain a clean voice signal, informing the CPU, and storing the clean voice signal into a second cache by the CPU;

s4, the direct memory access module sends the clean voice signals in the second cache to the voice endpoint detection module and the hardware calculation module at the same time;

s5, a voice endpoint detection module calculates and judges whether the clean voice signal is an effective voice signal or not and the starting and ending endpoints of the effective voice signal, if so, the voice endpoint detection module informs a hardware calculation module and sends the starting and ending endpoint information to enter the step S6; if not, the process is terminated and continues to wait for the next process;

s6, judging whether the clean voice signal sent by the direct memory access module is an effective voice signal or not by the hardware calculation module according to the notification of the voice endpoint detection module; if the voice signal is the effective voice signal, acquiring the starting end point and the ending end point of the effective voice signal sent by the voice end point detection module; the hardware calculation module calculates to obtain the voice acoustic characteristics and informs the CPU, and the step S7 is entered; if the voice signal is not a valid voice signal, not processing the clean voice signal data sent before and entering a state of waiting for the clean voice signal data at the next time;

s7, the CPU stores the voice acoustic characteristic result data calculated by the hardware calculation module into a neural network processing prepositive cache; the direct memory access module sends the voice acoustic characteristic result to the neural network computing module in parallel, and the neural network computing module reads acoustic model parameters stored in an external FLASH memory of the chip in real time and performs neural network computing on voice acoustic characteristic result data;

and the result data after the calculation is transmitted to the third cache SRAM 3 in parallel in real time through the direct memory access module, and simultaneously the CPU is informed to perform voice recognition processing.

Preferably, in S3, the processing includes performing noise reduction, filtering, speech enhancement and sound source localization.

Preferably, the voice acoustic feature calculation includes removing invalid voice signal data, loading a mel filter coefficient, performing FFT calculation, mean variance calculation, normalization calculation and floating point conversion fixed point quantization in sequence.

Preferably, the specific process of the CPU processing and identifying in step S7 is:

s71, reading the awakening word language model and the command word language model from the external FLASH memory by the CPU, and storing the awakening word language model and the command word language model into a third cache SRAM 3;

s72, after receiving the voice recognition processing notification of the neural network computing module in the step S7, the CPU reads a computing result T from a third cache, and meanwhile, the CPU judges whether the current equipment is in an awakening state or not;

s73, if the current equipment is not in the awakening state, continuously judging whether the calculation result T is the awakening word or not, if not, the awakening word is invalid, and if the equipment is not in the awakening state currently and cannot be awakened, continuously waiting for the awakening operation;

if the calculation result T is the awakening word, continuously judging whether the corresponding confidence coefficient exceeds a threshold set by the awakening word, if not, judging that the confidence coefficient is possibly misidentification, considering that the awakening is invalid, and continuously waiting for the awakening operation; if the corresponding confidence coefficient exceeds the threshold set by the awakening word, judging that the awakening word is valid, and awakening the equipment;

s74, if the current device is in the wake-up state, the CPU continues to judge the text and the confidence corresponding to the calculation result T read from the third cache;

firstly, judging whether the text has a corresponding command, if not, judging that the command word is invalid, and continuing waiting for the next command operation;

if the text is judged to have a corresponding command, continuously judging the confidence corresponding to the text, judging whether the confidence is higher than the threshold set by the corresponding command word, if not, judging that the confidence is false recognition and the command word is invalid, and continuously waiting for the next command operation; and if the corresponding confidence coefficient exceeds the threshold set by the command word, judging that the command word is effective, and executing the operation corresponding to the command word.

Preferably, when the CPU determines that the current device is not in the wake state, only the wake word language model is loaded.

The invention also discloses an off-line voice recognition chip which comprises a CPU and a direct memory access module, wherein a first cache, a second cache and a third cache are connected between the direct memory access module and the CPU, the first cache is also connected with a voice preprocessing module, and the second cache is also connected with a voice endpoint detection module and a hardware calculation module; the hardware computing module is connected with a neural network front cache; the direct memory access module is also connected with a voice data acquisition module and a neural network computing module, and the neural network computing module is connected with a third cache and a neural network front cache; the neural network computing module and the CPU are provided with external memory connecting ports; the direct memory access module, the voice preprocessing module, the voice endpoint detection module, the hardware calculation module, the voice data acquisition module and the neural network calculation module are all hardware modules.

Preferably, the hardware module is implemented in an ASIC manner.

Preferably, each cache is realized by different storage partitions through different data read-write channels of the direct memory access module by the same cache device

The off-line speech recognition method combining the terminal hardware and the algorithm software processing has the following advantages:

the CPU and each hardware computing module in the chip are connected in an effective parallel processing mode, and the processing capacity requirement of the CPU can be reduced and the chip cost is reduced by carrying data in parallel.

The parallel processing work of the hardware computing module and the CPU can ensure that the CPU and the hardware module of the chip can independently operate, the operation speed is improved, voice data cannot be omitted when real-time voice recognition is processed, the recognition instantaneity is ensured, and the recognition effect is improved.

The chip can automatically read the acoustic model from the external FLASH memory, an internal RAM is not needed for storing the acoustic model, and the memory space is greatly saved.

Drawings

FIG. 1 is a flow chart illustrating an embodiment of an off-line speech recognition method according to the present invention;

fig. 2 is a schematic diagram of an embodiment of the offline speech recognition method according to the present invention, and fig. 2 does not show a connection line between all modules.

Detailed Description

The following provides a more detailed description of embodiments of the present invention.

The off-line speech recognition method combining the terminal hardware and the algorithm software processing of the invention, as shown in figure 1, comprises the following steps:

s1, a microphone captures an external analog voice signal in real time and sends the external analog voice signal to a voice data acquisition module in an offline voice recognition chip; the voice data acquisition module converts the analog voice signal into a digital voice signal, and directly sends the digital voice signal to a first cache in real time through a Direct Memory Access (DMA) module in the chip;

s2, the CPU monitors the data volume of the first cache, and when the data in the first cache SRAM 1 is accumulated to a certain preset threshold value, the CPU carries the related data in the first cache SRAM 1 to the voice preprocessing module; the threshold value is generally determined and configurable according to the size of the buffer and the data processing capability of the CPU, for example, the first buffer may be set to 1M, and the threshold value may be set to 512K.

After the data of the first cache is transported, the storage bit corresponding to the first cache SRAM 1 can store the real-time digital voice signal continuously sent by the voice data acquisition module, so that the real-time data can be continuously stored by using a smaller cache SRAM 1, and the cost is reduced.

In step S1, the voice data acquisition module transmits the real-time digital voice signal by using the direct memory access module, and does not need the CPU, thereby facilitating parallel work and reducing the requirement for the processing capability of the CPU. In step S2, the CPU determines when the data in the first cache SRAM 1 can be fetched for processing, which also allows flexibility of processing to meet different speech recognition requirements.

S3, when the voice preprocessing module receives a certain amount of digital voice signals transmitted by the CPU from the first cache SRAM 1, the signals are subjected to noise reduction, filtering, voice enhancement, sound source positioning and the like to obtain clean voice signals, and the CPU is informed to store the clean voice signals into the second cache SRAM 2;

and S4, after the CPU finishes storing the signals, the direct memory access module carries the clean voice signals from the second cache SRAM 2 to the voice endpoint detection module and the hardware calculation module simultaneously in parallel.

The step is carried by the direct memory access module, so that the real-time performance and the parallel work of the processing are guaranteed, and the requirement on the processing capacity of the CPU is reduced.

S5, the voice endpoint detection module calculates and judges whether the clean voice signal is an effective voice signal or not and a starting and ending endpoint of the effective voice signal aiming at the sent clean voice signal, and if the clean voice signal is the effective voice signal, the voice endpoint detection module informs a hardware calculation module; if not, the operation is terminated and the waiting is continued;

and S6, the hardware computing module receives the notification of the voice endpoint detection module and judges whether the clean voice signal sent by the direct memory access module is an effective voice signal. If the voice signal is the effective voice signal, acquiring the starting end point and the ending end point of the effective voice signal sent by the voice end point detection module, calculating to obtain voice acoustic characteristics and informing a CPU (central processing unit); if the voice signal is not a valid voice signal, not processing the clean voice signal data sent before and entering a state of waiting for the clean voice signal data at the next time;

the voice acoustic feature is calculated by removing invalid voice signal data, loading a Mel filter coefficient, finally calculating to obtain the voice acoustic feature through the steps of FFT calculation, mean variance calculation, normalization calculation, floating point conversion fixed point quantization and the like, and a hardware calculation module informs a CPU after the calculation is finished;

s7, storing the calculated result data into a neural network processing prepositive cache by the CPU; the direct memory access module sends the data to the neural network computing module in parallel for computing, and the real-time performance of voice signal processing can be guaranteed through parallel computing;

the CPU stores data, and meanwhile, if the calculation mode or parameters of the hardware calculation module need to be adjusted, the CPU can be involved in configuring the hardware calculation module, so that the flexibility of data calculation processing is considered.

And after the calculation of the neural network calculation module is finished, the result is parallelly transmitted to a third cache SRAM 3 in real time through the direct memory access module, and meanwhile, a CPU is informed to process the result.

And when the neural network computing module is used for computing, the acoustic model parameters stored in the FLASH memory are automatically read in real time through the direct memory access module, the acoustic model and the obtained effective voice signal result data are subjected to neural network computing, after the computing is finished, the result is parallelly transmitted to a third cache SRAM 3 in real time through the direct memory access module, and meanwhile, a CPU is informed to perform voice recognition processing.

In the offline voice recognition method combining the terminal hardware and the algorithm software, the CPU is not used in the steps of S1, S4, S5 and S6, and the CPU is used in the steps of S2, S3 and S7; during continuous voice data processing, when a process is performing steps S1, S4, S5 and S6, the idle CPU can be used to process steps S2, S3 and S7 of another process, as shown in fig. 1.

As shown in fig. 1, an offline speech recognition chip capable of implementing the offline speech recognition method of the present invention includes a CPU and a direct memory access module, where a first cache, a second cache, and a third cache are connected between the direct memory access module and the CPU, the first cache is further connected to a speech preprocessing module, and the second cache is further connected to a speech endpoint detection module and a hardware computation module; the hardware computing module is connected with a neural network front cache; the direct memory access module is also connected with a voice data acquisition module and a neural network computing module, and the neural network computing module is connected with a third cache and a neural network front cache; the neural network computing module and the CPU are provided with external memory connecting ports; the direct memory access module, the voice preprocessing module, the voice endpoint detection module, the hardware calculation module, the voice data acquisition module and the neural network calculation module are all hardware modules.

The CPU is better in universality due to the instruction-based architecture, but for special tasks, the CPU of the instruction architecture has no advantages in area and performance. In the speech neural network recognition, if the CPU is used for carrying out special operations such as preprocessing, endpoint detection, hardware calculation, neural network calculation and the like, the area of the CPU is huge, and the dominant frequency above 1GHZ is usually needed, but the cost of the high-performance CPU is higher, the popularization of a speech module is not facilitated, the power consumption is higher, for offline speech recognition equipment, a lot of battery power supply equipment such as handheld equipment is used, the power consumption is sensitive, and the high-dominant-frequency CPU is not beneficial to achieving the low-cost target of a speech recognition chip.

The invention realizes the fixed special calculation required by voice recognition by a hardware ASIC module independently in a targeted design and processes with a CPU in parallel, thereby reducing the power consumption and the cost, realizing multi-core parallel operation and achieving better balance on the cost, the power consumption and the operation speed. Different storage partitions can be realized by the same cache device through different data read-write channels, and the different storage partitions are respectively used as the first cache, the second cache, the third cache and the neural network preposed cache.

In one embodiment, the CPU first reads the wakeup word language model and the command word language model from the external FLASH memory and stores them in the third cache SRAM 3.

When receiving the processing notification of the neural network computing module in the step S7, the CPU reads the computing result T from the third cache, and at the same time, the CPU determines whether the current device is in an awake state, and if not, determines whether the computing result T is an awake word, and if not, that is, the awake word is invalid, and the device is not in the awake state and cannot awake the device, continues to wait for an awake operation;

if the calculation result T is the awakening word, continuously judging whether the corresponding confidence coefficient exceeds a threshold set by the awakening word, if not, judging that the confidence coefficient is possibly misidentification, considering that the awakening is invalid, and continuously waiting for the awakening operation; and if the corresponding confidence coefficient exceeds the threshold set by the awakening word, judging that the awakening word is effective, and awakening the equipment.

And the CPU reads a calculation result T after the calculation of the neural network calculation module is completed from the third cache SRAM 3, and performs calculation such as Viterbi decoding and the like by combining the two language models of the awakening word language model and the command word language model to obtain a text and a confidence coefficient corresponding to the effective language signal. If the command is not awakened, only the awakening word language model can be loaded and stored in the third cache SRAM 3, the awakening word language model is usually much smaller than the command word language model, and at the moment, only the command of the awakening word type is identified, so that the computation and storage quantity of a CPU and NN can be reduced, and the power consumption of the system can be reduced; meanwhile, false alarm of equipment is avoided; since the awakening is a common operation of one device, only the awakening word model is loaded, the awakening recognition efficiency and the awakening time can be improved, and the user experience is improved.

When the equipment is in an awakening state and a command word is to be recognized, a command word voice model is loaded for recognition. The specific process can be as follows:

after receiving the processing notification of the neural network computing module in the step S7, the CPU determines that the current device is already in the awake state, and when the CPU continues to determine the command word and the confidence level corresponding to the computation result T read from the third cache, first determines whether the text has a corresponding command, and if not, determines that the command word is invalid, and continues to wait until the next command operation; if the text is judged to have a corresponding command, continuously judging the confidence corresponding to the text, judging whether the confidence is higher than the threshold set by the corresponding command word, if not, judging that the confidence is false recognition and the command word is invalid, and continuously waiting for the next command operation; and if the corresponding confidence coefficient exceeds the threshold set by the command word, judging that the command word is effective, and executing the operation corresponding to the command word.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

The operation of the air conditioner by the offline voice recognition chip shown in fig. 2 is described as an example:

assuming that the air conditioner is not awakened initially, the CPU detects that the equipment is not in an awakened state, and only loads an awakening word model into a third cache;

the user sends a voice command of '25 degrees', and the system performs the following operations:

s1, a microphone captures an analog voice signal of a command word of 25 degrees in real time and sends the analog voice signal to a voice data acquisition module in an offline voice recognition chip; the voice data acquisition module converts the analog voice signal into a digital voice signal and sends the digital voice signal to a first cache through a direct memory access module in the chip;

s5, a voice endpoint detection module calculates and judges whether the clean voice signal is an effective voice signal or not, and starts and ends of the effective voice signal, if the clean voice signal is found to be the effective voice signal, the voice endpoint detection module informs a hardware calculation module and sends start and end endpoint information to enter the step S6;

s6, the hardware calculation module judges that the clean voice signal '25 degrees' sent by the direct memory access module is an effective voice signal according to the notification of the voice endpoint detection module, and obtains the starting and ending endpoints of the effective voice signal sent by the voice endpoint detection module; the hardware calculation module calculates to obtain the voice acoustic characteristics and informs the CPU to enter the step S7;

s7, the CPU stores the voice acoustic characteristic result data calculated by the hardware calculation module into a neural network processing preposed cache; the direct memory access module sends the voice acoustic characteristic result to the neural network computing module in parallel, and the neural network computing module reads acoustic model parameters stored in an external FLASH memory of the chip in real time and performs neural network computing on voice acoustic characteristic result data;

The speech recognition processing procedure is as follows:

s71, the CPU detects that the device is not in an awakening state, and only loads the awakening word model into a third cache;

s72, after receiving the processing notification of the neural network computing module in the step S7, the CPU reads a computing result T from a third cache;

and S73. the CPU finds that the current equipment is not in the awakening state, judges that the calculation result T is not the awakening word, if not, the awakening word is invalid, the equipment is not in the awakening state currently and cannot be awakened, and continues to wait for the awakening operation.

Specific example 2

The difference from the embodiment 1 is that the air conditioner is already in the wake-up state, and the user sends out a command word of '25 degrees';

the step S73 skips the direct step S74 and the CPU loads the command word language model into the third cache;

s74, when the current equipment is in an awakening state, the CPU continuously judges the text and the confidence coefficient corresponding to the calculation result T read from the third cache;

firstly, judging whether a text has a corresponding command, judging whether a 25-degree text has a corresponding command, continuously judging the confidence coefficient corresponding to the text, whether the confidence coefficient exceeds the threshold set by the corresponding command word, if not, judging that the confidence coefficient is false recognition, and if not, the command word is invalid, and continuously waiting for the next command operation; and if the corresponding confidence coefficient exceeds the threshold set by the command word, judging that the command word is effective, executing the operation corresponding to the command word, and adjusting the temperature of the air conditioner to 25 ℃.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including an instruction system which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.

Claims

1. An off-line speech recognition method combining terminal hardware and algorithm software processing is characterized by comprising the following steps:

s1, a microphone captures an external analog voice signal in real time and sends the external analog voice signal to a voice data acquisition module in an off-line voice recognition chip; the voice data acquisition module converts the analog voice signal into a digital voice signal and sends the digital voice signal to a first cache through a direct memory access module in the chip;

2. The offline speech recognition method according to claim 1, wherein in S3, said processing includes performing noise reduction, filtering, speech enhancement and sound source localization.

3. The method of claim 1, wherein the speech acoustic feature calculation is performed by removing invalid speech signal data, loading mel-filter coefficients, performing FFT calculation, mean variance calculation, normalization calculation, and floating point to fixed point quantization in sequence.

4. The off-line speech recognition method of claim 1, wherein the specific flow of the CPU processing recognition in the step S7 is:

if the calculation result T is the awakening word, continuously judging whether the corresponding confidence coefficient exceeds a threshold set by the awakening word, if not, judging that the confidence coefficient is possibly mistaken identification, considering that the awakening is invalid, and continuously waiting for the awakening operation; if the corresponding confidence coefficient exceeds the threshold set by the awakening word, judging that the awakening word is valid, and awakening the equipment;

5. The offline speech recognition method of claim 4, wherein the CPU only loads the wake-up word-language model when it determines that the current device is not in the wake-up state.

6. An off-line voice recognition chip is characterized by comprising a CPU and a direct memory access module, wherein a first cache, a second cache and a third cache are connected between the direct memory access module and the CPU, the first cache is also connected with a voice preprocessing module, and the second cache is also connected with a voice endpoint detection module and a hardware calculation module; the hardware computing module is connected with a neural network front cache; the direct memory access module is also connected with a voice data acquisition module and a neural network computing module, and the neural network computing module is connected with a third cache and a neural network front cache; the neural network computing module and the CPU are provided with external memory connecting ports; the direct memory access module, the voice preprocessing module, the voice endpoint detection module, the hardware calculation module, the voice data acquisition module and the neural network calculation module are all hardware modules.

7. The offline speech recognition chip of claim 6, wherein the hardware module is implemented as an ASIC.

8. The off-line speech recognition chip of claim 6, wherein each cache is implemented by a same cache device by implementing different memory partitions through different data read/write channels of the dma module.