CN114944155B

CN114944155B - Off-line voice recognition method combining terminal hardware and algorithm software processing

Info

Publication number: CN114944155B
Application number: CN202110186016.6A
Authority: CN
Inventors: 许兵; 高君效
Original assignee: Chipintelli Technology Co Ltd
Current assignee: Chipintelli Technology Co Ltd
Filing date: 2021-02-14
Publication date: 2024-06-04
Anticipated expiration: 2041-02-14

Abstract

An off-line voice recognition method and a chip combining terminal hardware and algorithm software processing comprise the following steps: s1, capturing an external analog voice signal in real time by a microphone; s2, carrying the data in the first cache to a voice preprocessing module; s3, obtaining a clean voice signal, and storing the clean voice signal; s4, the direct memory access module simultaneously sends the clean voice signal to the voice endpoint detection module and the hardware calculation module; s5, the voice endpoint detection module calculates and judges whether the clean voice signal is an effective voice signal or not; s6, calculating the acoustic characteristics of the voice; s7, the neural network computing module computes the acoustic characteristics of the voice, and the CPU performs voice recognition processing. The CPU and each hardware calculation module in the chip are connected in an effective parallel processing mode, and the data parallel carrying is adopted, so that the CPU processing capacity requirement can be reduced, and the chip cost is reduced.

Description

Off-line voice recognition method combining terminal hardware and algorithm software processing

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to an off-line voice recognition method combining terminal hardware and algorithm software processing.

Background

The development of the voice recognition technology has been advanced for many years, and particularly, with the gradual maturation of the neural network technology in recent years, a great number of voice recognition adopts the neural network technology, so that the recognition accuracy is improved, and the voice recognition is gradually and truly commercialized. The application of neural network technology in voice recognition requires the cooperation of algorithm and hardware computing power, the main stream is to adopt cloud voice recognition technology, which is similar to voice recognition of an intelligent sound box and voice assistant on an intelligent mobile phone in the prior art, and the main stream is to collect voice from a terminal, upload the voice into a server, run relevant voice recognition algorithm by server hardware for processing, and feed back the result to the terminal.

The cloud voice recognition can solve the problem of calculation power required by voice, and can obtain a better voice recognition effect, but the cloud voice recognition also has the problems of voice privacy security leakage, dependence on a network and poor instantaneity, and is not fully applicable to application occasions such as control occasions. The industry also needs offline speech recognition solutions. In offline speech recognition, because cloud hardware resources cannot be called, terminal hardware processing capacity is limited, and comprehensive requirements of terminal products on cost and performance including response time, judgment accuracy and the like are high, how to utilize limited hardware processing resources, and a speech recognition method with high cost performance, high real-time performance and high recognition rate is designed by combining algorithm software is a challenging technical problem.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention discloses an off-line voice recognition method combining terminal hardware and algorithm software processing.

The invention relates to an off-line voice recognition method combining terminal hardware and algorithm software processing, which is characterized by comprising the following steps:

S1, capturing an external analog voice signal in real time by a microphone, and sending the analog voice signal to a voice data acquisition module inside an offline voice recognition chip; the voice data acquisition module converts the analog voice signal into a digital voice signal and sends the digital voice signal to the first cache through a direct memory access module in the chip;

S2, the CPU monitors the data quantity of the first cache, and when the data in the first cache is accumulated to a preset threshold value, the CPU carries the data in the first cache to the voice preprocessing module;

S3, when the voice preprocessing module receives a digital voice signal transmitted by the CPU from the first buffer memory, processing the signal to obtain a clean voice signal, informing the CPU, and storing the clean voice signal into the second buffer memory by the CPU;

S4, the direct memory access module simultaneously sends the clean voice signals in the second cache to the voice endpoint detection module and the hardware calculation module;

s5, the voice endpoint detection module calculates and judges whether the clean voice signal is an effective voice signal and the starting endpoint and the ending endpoint of the effective voice signal, if the clean voice signal is the effective voice signal, the hardware calculation module is informed, and the starting endpoint information and the ending endpoint information are sent to enter the S6; if not, terminating and continuing to wait for the next processing;

S6, the hardware computing module judges whether the clean voice signal sent by the direct memory access module is an effective voice signal or not according to the notification of the voice endpoint detection module; if the voice signal is the effective voice signal, acquiring starting and ending endpoints of the effective voice signal sent by the voice endpoint detection module; the hardware calculation module calculates the voice acoustic characteristics and informs the CPU to enter S7; if the voice signal is not a valid voice signal, the data of the clean voice signal sent before is not processed and enters a state of waiting for the data of the next clean voice signal;

s7, the CPU stores the voice acoustic feature result data calculated by the hardware calculation module into a neural network processing front-end cache; the direct memory access module sends the voice acoustic characteristic result to the neural network computing module in parallel, and the neural network computing module reads acoustic model parameters stored in an external FLASH memory of the chip in real time and performs neural network computation on voice acoustic characteristic result data;

And the result data after calculation is transmitted to the third buffer SRAM 3 in parallel and in real time through the direct memory access module, and simultaneously, the CPU is notified to perform voice recognition processing.

Preferably, in S3, the processing includes noise reduction, filtering, voice enhancement, and sound source localization.

Preferably, the voice acoustic feature calculation is to sequentially perform invalid voice signal data removal, mel filter coefficient loading, FFT calculation, mean variance calculation, normalization calculation and floating point transformation fixed point quantization.

Preferably, the specific flow of the CPU processing identification in the step S7 is as follows:

S71, the CPU reads the wake-up word language model and the command word language model from the external FLASH memory and stores the wake-up word language model and the command word language model in the third cache SRAM 3;

S72, after the CPU receives the voice recognition processing notification of the neural network computing module in the step S7, the CPU reads a computing result T from the third cache, and meanwhile, the CPU judges whether the current equipment is in an awake state or not;

s73, if the current equipment is not in the awakening state, continuously judging whether the calculation result T is an awakening word, if not, namely, the awakening word is invalid, and if the equipment is not in the awakening state and can not be awakened, continuously waiting for awakening operation;

if the calculation result T is a wake-up word, continuously judging whether the corresponding confidence coefficient exceeds a threshold set by the wake-up word, if not, judging that the false recognition is possible, considering wake-up is invalid, and continuously waiting for wake-up operation; if the corresponding confidence coefficient exceeds the threshold set by the wake-up word, judging that the wake-up word is valid, and waking up the equipment;

s74, if the current equipment is in an awake state, the CPU continues to judge the text and the confidence corresponding to the calculation result T read from the third cache;

firstly judging whether the text has a corresponding command, if not, judging that the command word is invalid, and continuing to wait for the next command operation;

If the text is judged to have a corresponding command, continuing to judge whether the confidence coefficient corresponding to the text exceeds a threshold set by a corresponding command word, if not, judging that the text is misidentified, and continuing to wait for the next command operation; and if the corresponding confidence coefficient exceeds the threshold set by the command word, judging that the command word is valid, and executing the operation corresponding to the command word.

Preferably, when the CPU judges that the current device is not in the wake state, only the wake word language model is loaded.

The invention also discloses an offline voice recognition chip, which comprises a CPU and a direct memory access module, wherein a first cache, a second cache and a third cache are connected between the direct memory access module and the CPU, the first cache is also connected with a voice preprocessing module, and the second cache is also connected with a voice endpoint detection module and a hardware calculation module; the hardware computing module is connected with a neural network pre-cache; the direct memory access module is also connected with a voice data acquisition module and a neural network calculation module, and the neural network calculation module is connected with a third cache and a neural network front cache; the neural network computing module and the CPU are provided with external memory connection ports; the direct memory access module, the voice preprocessing module, the voice endpoint detection module, the hardware calculation module, the voice data acquisition module and the neural network calculation module are all hardware modules.

Preferably, the hardware module is implemented in an ASIC manner.

Preferably, each cache is realized by different memory partitions through different data read-write channels of the direct memory access module by the same cache device

The off-line voice recognition method combining the terminal hardware and the algorithm software processing has the following advantages:

The CPU in the chip and each hardware calculation module are connected in an effective parallel processing mode, and the processing capacity requirement of the CPU can be reduced and the chip cost is reduced through data parallel carrying.

And secondly, the parallel processing work of the hardware calculation module and the CPU can ensure the independent operation of the CPU and the hardware module of the chip, so that the operation speed is improved, voice data cannot be missed when the real-time voice recognition is processed, the real-time performance of the recognition is ensured, and the recognition effect is improved.

Thirdly, the chip can automatically read the acoustic model from the external FLASH memory without storing the acoustic model by the internal RAM, thereby greatly saving the memory space.

Drawings

FIG. 1 is a flowchart illustrating an embodiment of an offline speech recognition method according to the present invention;

FIG. 2 is a schematic diagram of an embodiment of an offline speech recognition method according to the present invention, and the connection lines between the complete modules are not shown in FIG. 2.

Detailed Description

The following describes the present invention in further detail.

The offline speech recognition method combining terminal hardware and algorithm software processing in the invention, as shown in figure 1, comprises the following steps:

s1, capturing an external analog voice signal in real time by a microphone, and sending the analog voice signal to a voice data acquisition module inside an offline voice recognition chip; the voice data acquisition module converts the analog voice signal into a digital voice signal and directly sends the digital voice signal to the first cache in real time through a direct memory access module (DMA) in the chip;

S2, the CPU monitors the data quantity of the first cache, and when the data in the first cache SRAM 1 is accumulated to a certain preset threshold value, the CPU carries the related data in the first cache SRAM 1 to the voice preprocessing module; the threshold value is generally determined and configurable according to the buffer size and the data processing capability of the CPU, for example, the first buffer may be set to 1M, and the threshold value may be set to 512K.

After the data of the first buffer memory is carried, the storage bit corresponding to the first buffer memory SRAM 1 can store the real-time digital voice signal continuously sent by the voice data acquisition module, so that the real-time data can be stored continuously by using the smaller buffer memory SRAM 1, and the cost is reduced.

In step S1, the voice data acquisition module adopts the direct memory access module to transmit real-time digital voice signals, does not need CPU participation, is convenient for parallel work, and reduces the requirement on CPU processing capacity. In step S2, the CPU determines when the data in the first cache SRAM 1 can be fetched for processing, which also gives attention to the flexibility of processing, so as to adapt to different speech recognition requirements.

S3, when the voice preprocessing module receives a certain amount of digital voice signals transmitted by the CPU from the first cache SRAM 1, carrying out noise reduction, filtering, voice enhancement, sound source positioning and other processing on the signals to obtain clean voice signals, informing the CPU, and storing the clean voice signals into the second cache SRAM 2 by the CPU;

and S4, after the storage signal is finished, the CPU simultaneously carries the clean voice signal from the second cache SRAM 2 to the voice endpoint detection module and the hardware calculation module in parallel by the direct memory access module.

The step is carried by the direct memory access module, so that the real-time performance and parallel work of processing are ensured, and the processing capacity requirement of a CPU is reduced.

S5, the voice endpoint detection module calculates and judges whether the clean voice signal is an effective voice signal or not according to the sent clean voice signal, and the starting endpoint and the ending endpoint of the effective voice signal are notified to the hardware calculation module if the clean voice signal is the effective voice signal; if not, terminating and continuing waiting;

S6, the hardware computing module receives the notification of the voice endpoint detection module and judges whether the clean voice signal sent by the direct memory access module is an effective voice signal. If the voice signal is an effective voice signal, acquiring starting and ending endpoints of the effective voice signal sent by the voice endpoint detection module, calculating to obtain voice acoustic characteristics and notifying a CPU; if the voice signal is not a valid voice signal, the data of the clean voice signal sent before is not processed and enters a state of waiting for the data of the next clean voice signal;

The voice acoustic feature is usually calculated by removing invalid voice signal data, loading a Mel filter coefficient, calculating through FFT (fast Fourier transform), mean variance, normalization, floating point transformation and fixed point quantization, and the like, and finally calculating to obtain the voice acoustic feature, wherein a hardware calculation module informs a CPU (Central processing Unit) after calculation is completed;

S7, storing the calculated result data into a neural network processing pre-cache by a CPU; the direct memory access module sends the data to the neural network calculation module for calculation in parallel, and real-time performance of voice signal processing can be ensured through parallel calculation;

The CPU stores data, and if the calculation mode or parameters of the hardware calculation module are to be adjusted, the CPU can intervene to configure the hardware calculation module, so that the flexibility of data calculation processing is considered.

After the calculation of the neural network calculation module is completed, the result is transmitted to the third cache SRAM 3 in parallel and in real time through the direct memory access module, and meanwhile, the CPU is notified to process.

When the neural network calculation module calculates, the acoustic model parameters stored in the FLASH memory are automatically read in real time through the direct memory access module, the acoustic model and the obtained effective voice signal result data are subjected to neural network calculation, the result is transmitted to the third cache SRAM 3 in parallel and in real time through the direct memory access module after the calculation is completed, and the CPU is notified to perform voice recognition processing.

In the offline speech recognition method combining the terminal hardware and the algorithm software processing, no CPU is used in the steps S1, S4, S5 and S6, and the CPU is used in the steps S2, S3 and S7; during continuous voice data processing, when a certain process performs steps S1, S4, S5 and S6, an idle CPU may be used to process steps S2, S3 and S7 of another process, as shown in fig. 1.

An offline voice recognition chip capable of realizing the offline voice recognition method of the invention is shown in fig. 1, and comprises a CPU and a direct memory access module, wherein a first cache, a second cache and a third cache are connected between the direct memory access module and the CPU, the first cache is also connected with a voice preprocessing module, and the second cache is also connected with a voice endpoint detection module and a hardware calculation module; the hardware computing module is connected with a neural network pre-cache; the direct memory access module is also connected with a voice data acquisition module and a neural network calculation module, and the neural network calculation module is connected with a third cache and a neural network front cache; the neural network computing module and the CPU are provided with external memory connection ports; the direct memory access module, the voice preprocessing module, the voice endpoint detection module, the hardware calculation module, the voice data acquisition module and the neural network calculation module are all hardware modules.

Since the CPU is based on an instruction architecture, the versatility is better, but for dedicated tasks, the CPU of the instruction architecture has no advantages in area and performance. In voice neural network recognition, if a CPU is used to perform special operations such as preprocessing, endpoint detection, hardware calculation, neural network calculation, etc., the area of the CPU is huge, and a main frequency of 1GHZ or more is generally required, but the high-performance CPU has high cost, is unfavorable for popularization of voice modules, has high power consumption, and for offline voice recognition devices, many of the devices are battery-powered devices such as handheld devices, and are relatively sensitive to power consumption, and the high-main-frequency CPU is unfavorable for achieving the low-cost goal of a voice recognition chip.

The invention realizes the multi-core parallel operation by using the hardware ASIC module to carry out the targeted design and the parallel processing with the CPU, reduces the power consumption and the cost, and obtains better balance on the cost, the power consumption and the operation speed. Each buffer memory can realize different memory partitions through different data read-write channels by the same buffer memory device, and the memory partitions are respectively used as the first buffer memory, the second buffer memory, the third buffer memory and the neural network front buffer memory.

In one embodiment, the CPU first reads the wake-up word language model and the command word language model from the external FLASH memory, and stores them in the third cache SRAM 3.

When the CPU receives the processing notice of the neural network computing module in the step S7, the CPU reads a computing result T from the third cache, and meanwhile, the CPU judges whether the current equipment is in a wake-up state or not, if not, judges whether the computing result T is a wake-up word, if not, namely, the wake-up word is invalid, the equipment is not in the wake-up state currently and can not wake up the equipment, and continues waiting for wake-up operation;

if the calculation result T is a wake-up word, continuously judging whether the corresponding confidence coefficient exceeds a threshold set by the wake-up word, if not, judging that the false recognition is possible, considering wake-up is invalid, and continuously waiting for wake-up operation; if the corresponding confidence coefficient exceeds the threshold set by the wake-up word, judging that the wake-up word is valid, and waking up the equipment.

The CPU reads the calculated result T after the neural network calculation module is calculated from the third cache SRAM 3, combines the two language models of the wake-up word language model and the command word language model, and performs Viterbi decoding and other calculations to obtain the text and the confidence corresponding to the effective language signal. If the method is not awakened, only the awakening word language model is loaded and stored in the third cache SRAM 3, the awakening word language model is usually smaller than the command word language model, only the commands of the awakening word type are identified, the calculation amount and the storage amount of the CPU and the NN can be reduced, and the power consumption of the system is reduced; meanwhile, false alarm of equipment is avoided; as the wake-up is a common operation of a device, only the wake-up word model is loaded, so that the wake-up recognition efficiency and the wake-up time can be improved, and the user experience is improved.

When the device is in an awake state and command words are to be identified, the command word speech model is reloaded for identification. The specific process can be as follows:

When the CPU receives the processing notice of the neural network computing module in the step S7 and judges that the current equipment is in an awake state, and the CPU continuously judges the command word and the confidence coefficient corresponding to the computing result T read from the third cache, firstly judging whether the text has a corresponding command, if not, judging that the command word is invalid, and continuously waiting for the next command operation; if the text is judged to have a corresponding command, continuing to judge whether the confidence coefficient corresponding to the text exceeds a threshold set by a corresponding command word, if not, judging that the text is misidentified, and continuing to wait for the next command operation; and if the corresponding confidence coefficient exceeds the threshold set by the command word, judging that the command word is valid, and executing the operation corresponding to the command word.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

The operation of the air conditioner will be described by taking the offline speech recognition chip shown in fig. 2 as an example:

Assuming that the air conditioner is not awakened at the initial stage, the CPU detects that the equipment is not in an awakened state, and only the awakening word model is loaded into a third cache;

the user sends out a voice command of 25 degrees, and the system performs the following operations:

s1, a microphone captures an analog voice signal of a 25-degree command word in real time and sends the analog voice signal to a voice data acquisition module in an offline voice recognition chip; the voice data acquisition module converts the analog voice signal into a digital voice signal and sends the digital voice signal to the first cache through a direct memory access module in the chip;

s5, the voice endpoint detection module calculates and judges whether the clean voice signal is an effective voice signal or not, and the starting endpoint and the ending endpoint of the effective voice signal are found to be the effective voice signal, and notifies a hardware calculation module and sends starting endpoint information and ending endpoint information to enter the S6;

s6, the hardware computing module judges that the clean voice signal 25 degrees sent by the direct memory access module is an effective voice signal according to the notification of the voice endpoint detection module, and obtains the starting endpoint and the ending endpoint of the effective voice signal sent by the voice endpoint detection module; the hardware calculation module calculates the voice acoustic characteristics and informs the CPU to enter S7;

The speech recognition process is as follows:

s71, the CPU detects that the equipment is not in a wake-up state, and only the wake-up word model is loaded into a third cache;

S72, after the CPU receives the processing notice of the neural network computing module in the step S7, a computing result T is read from a third cache;

And S73, the CPU discovers that the current equipment is not in the awakening state, judges that the calculation result T is not the awakening word, and if the calculation result T is not the awakening word, namely the awakening word is invalid, the equipment is not in the awakening state currently and cannot awaken the equipment, and continues to wait for the awakening operation.

Example 2

The difference from embodiment 1 is that the air conditioner is already in the wake state, and the user issues a "25 degree" command word;

step S73, skipping and directly entering step S74, and loading the command word language model into a third cache by the CPU;

S74, the current equipment is in an awake state, and the CPU continues to judge the text and the confidence corresponding to the calculation result T read from the third cache;

Firstly judging whether a text has a corresponding command, judging that a text with the degree of 25 has the corresponding command, continuously judging whether the confidence corresponding to the text exceeds a threshold set by a corresponding command word, if not, judging that the text is misidentified, invalidating the command word and continuously waiting for the next command operation; if the corresponding confidence level exceeds the threshold set by the command word, the command word is judged to be valid, the operation corresponding to the command word is executed, and the air conditioner temperature is adjusted to 25 degrees.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing description of the preferred embodiments of the present invention is not obvious contradiction or on the premise of a certain preferred embodiment, but all the preferred embodiments can be used in any overlapped combination, and the embodiments and specific parameters in the embodiments are only for clearly describing the invention verification process of the inventor and are not intended to limit the scope of the invention, and the scope of the invention is still subject to the claims, and all equivalent structural changes made by applying the specification and the content of the drawings of the present invention are included in the scope of the invention.

Claims

1. The off-line voice recognition method combining terminal hardware and algorithm software processing is characterized by comprising the following steps:

the specific process of CPU processing identification is as follows:

If the text is judged to have a corresponding command, continuing to judge whether the confidence coefficient corresponding to the text exceeds a threshold set by a corresponding command word, if not, judging that the text is misidentified, and continuing to wait for the next command operation; if the corresponding confidence coefficient exceeds the threshold set by the command word, judging that the command word is effective, and executing the operation corresponding to the command word;

2. The offline speech recognition method of claim 1, wherein in S3, the processing includes performing noise reduction, filtering, speech enhancement, and sound source localization.

3. The offline speech recognition method of claim 1, wherein the speech acoustic feature calculation is performed sequentially with invalid speech signal data removal, mel filter coefficient loading, FFT calculation, mean variance calculation, normalization calculation, and floating point to fixed point quantization.

4. The offline speech recognition method of claim 1, wherein the CPU only loads the wake word language model when the CPU determines that the current device is not in the wake state.

5. An offline speech recognition chip for implementing the offline speech recognition method according to claim 1, comprising a CPU and a direct memory access module, wherein a first buffer, a second buffer and a third buffer are connected between the direct memory access module and the CPU, the first buffer is further connected with a speech preprocessing module, and the second buffer is further connected with a speech endpoint detection module and a hardware calculation module; the hardware computing module is connected with a neural network pre-cache; the direct memory access module is also connected with a voice data acquisition module and a neural network calculation module, and the neural network calculation module is connected with a third cache and a neural network front cache; the neural network computing module and the CPU are provided with external memory connection ports; the direct memory access module, the voice preprocessing module, the voice endpoint detection module, the hardware calculation module, the voice data acquisition module and the neural network calculation module are all hardware modules.

6. The offline speech recognition chip of claim 5, wherein the hardware module is implemented as an ASIC.

7. The offline speech recognition chip of claim 5, wherein each buffer is implemented by the same buffer device implementing different memory partitions through different data read/write channels of the direct memory access module.