CN114944155A - Offline voice recognition method combining terminal hardware and algorithm software processing - Google Patents

Offline voice recognition method combining terminal hardware and algorithm software processing Download PDF

Info

Publication number
CN114944155A
CN114944155A CN202110186016.6A CN202110186016A CN114944155A CN 114944155 A CN114944155 A CN 114944155A CN 202110186016 A CN202110186016 A CN 202110186016A CN 114944155 A CN114944155 A CN 114944155A
Authority
CN
China
Prior art keywords
voice
module
cache
cpu
voice signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110186016.6A
Other languages
Chinese (zh)
Other versions
CN114944155B (en
Inventor
许兵
高君效
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chipintelli Technology Co Ltd
Original Assignee
Chipintelli Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chipintelli Technology Co Ltd filed Critical Chipintelli Technology Co Ltd
Priority to CN202110186016.6A priority Critical patent/CN114944155B/en
Publication of CN114944155A publication Critical patent/CN114944155A/en
Application granted granted Critical
Publication of CN114944155B publication Critical patent/CN114944155B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Neurology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

An off-line speech recognition method and chip combining terminal hardware and algorithm software processing comprises the following steps: s1, a microphone captures an external analog voice signal in real time; s2, carrying the data in the first cache to a voice preprocessing module; s3, obtaining a clean voice signal and storing the clean voice signal; s4, the direct memory access module sends the clean voice signal to the voice endpoint detection module and the hardware calculation module at the same time; s5, a voice endpoint detection module calculates and judges whether the clean voice signal is an effective voice signal; s6, calculating voice acoustic characteristics; and S7, calculating the voice acoustic characteristics by the neural network calculation module, and performing voice recognition processing by the CPU. The CPU in the chip and each hardware computing module are connected in an effective parallel processing mode, and the requirement on the processing capacity of the CPU can be reduced and the chip cost is reduced by carrying data in parallel.

Description

Offline voice recognition method combining terminal hardware and algorithm software processing
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to an offline voice recognition method combining terminal hardware and algorithm software processing.
Background
The development of the voice recognition technology has been for many years, and especially along with the gradual maturity of the neural network technology in recent years, a large amount of voice recognition adopts the neural network technology, so that the accuracy rate of the recognition is improved, and the voice recognition is enabled to be gradually truly commercialized. The mainstream method is to adopt a cloud speech recognition technology, similar to speech recognition of a smart sound box and a speech assistant on a smart phone in the prior art, and to upload speech to a server after the speech is collected from the terminal, the server hardware runs a relevant speech recognition algorithm to process the speech, and a result is fed back to the terminal.
The computing power problem required by voice can be solved by adopting cloud voice recognition, a better voice recognition effect can be obtained, but the problems that voice privacy is safely revealed by using the cloud voice recognition, the network is depended on, and the instantaneity is not good exist, and the method is not completely suitable for application occasions such as control and the like. The industry also needs off-line speech recognition solutions. In the off-line voice recognition, the cloud hardware resources cannot be called, the processing capacity of the terminal hardware is limited, the terminal product has high comprehensive requirements on cost and performance, including response time, judgment accuracy and the like, and how to utilize the limited hardware processing resources is a challenging technical problem in combination with the high cost performance, high real-time performance and high recognition rate of algorithm software design.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention discloses an off-line speech recognition method combining terminal hardware and algorithm software processing.
The invention relates to an off-line voice recognition method combining terminal hardware and algorithm software processing, which is characterized by comprising the following steps:
s1, a microphone captures an external analog voice signal in real time and sends the external analog voice signal to a voice data acquisition module in an offline voice recognition chip; the voice data acquisition module converts the analog voice signal into a digital voice signal and sends the digital voice signal to a first cache through a direct memory access module in the chip;
s2, monitoring the data volume of the first cache by the CPU, and carrying the data in the first cache to a voice preprocessing module by the CPU when the data in the first cache is accumulated to a preset threshold value;
s3, when the voice preprocessing module receives a digital voice signal transmitted by the CPU from the first cache, processing the signal to obtain a clean voice signal, informing the CPU, and storing the clean voice signal into a second cache by the CPU;
s4, the direct memory access module sends the clean voice signals in the second cache to the voice endpoint detection module and the hardware calculation module at the same time;
s5, a voice endpoint detection module calculates and judges whether the clean voice signal is an effective voice signal or not and the starting and ending endpoints of the effective voice signal, if so, the voice endpoint detection module informs a hardware calculation module and sends the starting and ending endpoint information to enter the step S6; if not, the process is terminated and continues to wait for the next process;
s6, judging whether the clean voice signal sent by the direct memory access module is an effective voice signal or not by the hardware calculation module according to the notification of the voice endpoint detection module; if the voice signal is the effective voice signal, acquiring the starting end point and the ending end point of the effective voice signal sent by the voice end point detection module; the hardware calculation module calculates to obtain the voice acoustic characteristics and informs the CPU, and the step S7 is entered; if the voice signal is not a valid voice signal, not processing the clean voice signal data sent before and entering a state of waiting for the clean voice signal data at the next time;
s7, the CPU stores the voice acoustic characteristic result data calculated by the hardware calculation module into a neural network processing prepositive cache; the direct memory access module sends the voice acoustic characteristic result to the neural network computing module in parallel, and the neural network computing module reads acoustic model parameters stored in an external FLASH memory of the chip in real time and performs neural network computing on voice acoustic characteristic result data;
and the result data after the calculation is transmitted to the third cache SRAM 3 in parallel in real time through the direct memory access module, and simultaneously the CPU is informed to perform voice recognition processing.
Preferably, in S3, the processing includes performing noise reduction, filtering, speech enhancement and sound source localization.
Preferably, the voice acoustic feature calculation includes removing invalid voice signal data, loading a mel filter coefficient, performing FFT calculation, mean variance calculation, normalization calculation and floating point conversion fixed point quantization in sequence.
Preferably, the specific process of the CPU processing and identifying in step S7 is:
s71, reading the awakening word language model and the command word language model from the external FLASH memory by the CPU, and storing the awakening word language model and the command word language model into a third cache SRAM 3;
s72, after receiving the voice recognition processing notification of the neural network computing module in the step S7, the CPU reads a computing result T from a third cache, and meanwhile, the CPU judges whether the current equipment is in an awakening state or not;
s73, if the current equipment is not in the awakening state, continuously judging whether the calculation result T is the awakening word or not, if not, the awakening word is invalid, and if the equipment is not in the awakening state currently and cannot be awakened, continuously waiting for the awakening operation;
if the calculation result T is the awakening word, continuously judging whether the corresponding confidence coefficient exceeds a threshold set by the awakening word, if not, judging that the confidence coefficient is possibly misidentification, considering that the awakening is invalid, and continuously waiting for the awakening operation; if the corresponding confidence coefficient exceeds the threshold set by the awakening word, judging that the awakening word is valid, and awakening the equipment;
s74, if the current device is in the wake-up state, the CPU continues to judge the text and the confidence corresponding to the calculation result T read from the third cache;
firstly, judging whether the text has a corresponding command, if not, judging that the command word is invalid, and continuing waiting for the next command operation;
if the text is judged to have a corresponding command, continuously judging the confidence corresponding to the text, judging whether the confidence is higher than the threshold set by the corresponding command word, if not, judging that the confidence is false recognition and the command word is invalid, and continuously waiting for the next command operation; and if the corresponding confidence coefficient exceeds the threshold set by the command word, judging that the command word is effective, and executing the operation corresponding to the command word.
Preferably, when the CPU determines that the current device is not in the wake state, only the wake word language model is loaded.
The invention also discloses an off-line voice recognition chip which comprises a CPU and a direct memory access module, wherein a first cache, a second cache and a third cache are connected between the direct memory access module and the CPU, the first cache is also connected with a voice preprocessing module, and the second cache is also connected with a voice endpoint detection module and a hardware calculation module; the hardware computing module is connected with a neural network front cache; the direct memory access module is also connected with a voice data acquisition module and a neural network computing module, and the neural network computing module is connected with a third cache and a neural network front cache; the neural network computing module and the CPU are provided with external memory connecting ports; the direct memory access module, the voice preprocessing module, the voice endpoint detection module, the hardware calculation module, the voice data acquisition module and the neural network calculation module are all hardware modules.
Preferably, the hardware module is implemented in an ASIC manner.
Preferably, each cache is realized by different storage partitions through different data read-write channels of the direct memory access module by the same cache device
The off-line speech recognition method combining the terminal hardware and the algorithm software processing has the following advantages:
the CPU and each hardware computing module in the chip are connected in an effective parallel processing mode, and the processing capacity requirement of the CPU can be reduced and the chip cost is reduced by carrying data in parallel.
The parallel processing work of the hardware computing module and the CPU can ensure that the CPU and the hardware module of the chip can independently operate, the operation speed is improved, voice data cannot be omitted when real-time voice recognition is processed, the recognition instantaneity is ensured, and the recognition effect is improved.
The chip can automatically read the acoustic model from the external FLASH memory, an internal RAM is not needed for storing the acoustic model, and the memory space is greatly saved.
Drawings
FIG. 1 is a flow chart illustrating an embodiment of an off-line speech recognition method according to the present invention;
fig. 2 is a schematic diagram of an embodiment of the offline speech recognition method according to the present invention, and fig. 2 does not show a connection line between all modules.
Detailed Description
The following provides a more detailed description of embodiments of the present invention.
The off-line speech recognition method combining the terminal hardware and the algorithm software processing of the invention, as shown in figure 1, comprises the following steps:
s1, a microphone captures an external analog voice signal in real time and sends the external analog voice signal to a voice data acquisition module in an offline voice recognition chip; the voice data acquisition module converts the analog voice signal into a digital voice signal, and directly sends the digital voice signal to a first cache in real time through a Direct Memory Access (DMA) module in the chip;
s2, the CPU monitors the data volume of the first cache, and when the data in the first cache SRAM 1 is accumulated to a certain preset threshold value, the CPU carries the related data in the first cache SRAM 1 to the voice preprocessing module; the threshold value is generally determined and configurable according to the size of the buffer and the data processing capability of the CPU, for example, the first buffer may be set to 1M, and the threshold value may be set to 512K.
After the data of the first cache is transported, the storage bit corresponding to the first cache SRAM 1 can store the real-time digital voice signal continuously sent by the voice data acquisition module, so that the real-time data can be continuously stored by using a smaller cache SRAM 1, and the cost is reduced.
In step S1, the voice data acquisition module transmits the real-time digital voice signal by using the direct memory access module, and does not need the CPU, thereby facilitating parallel work and reducing the requirement for the processing capability of the CPU. In step S2, the CPU determines when the data in the first cache SRAM 1 can be fetched for processing, which also allows flexibility of processing to meet different speech recognition requirements.
S3, when the voice preprocessing module receives a certain amount of digital voice signals transmitted by the CPU from the first cache SRAM 1, the signals are subjected to noise reduction, filtering, voice enhancement, sound source positioning and the like to obtain clean voice signals, and the CPU is informed to store the clean voice signals into the second cache SRAM 2;
and S4, after the CPU finishes storing the signals, the direct memory access module carries the clean voice signals from the second cache SRAM 2 to the voice endpoint detection module and the hardware calculation module simultaneously in parallel.
The step is carried by the direct memory access module, so that the real-time performance and the parallel work of the processing are guaranteed, and the requirement on the processing capacity of the CPU is reduced.
S5, the voice endpoint detection module calculates and judges whether the clean voice signal is an effective voice signal or not and a starting and ending endpoint of the effective voice signal aiming at the sent clean voice signal, and if the clean voice signal is the effective voice signal, the voice endpoint detection module informs a hardware calculation module; if not, the operation is terminated and the waiting is continued;
and S6, the hardware computing module receives the notification of the voice endpoint detection module and judges whether the clean voice signal sent by the direct memory access module is an effective voice signal. If the voice signal is the effective voice signal, acquiring the starting end point and the ending end point of the effective voice signal sent by the voice end point detection module, calculating to obtain voice acoustic characteristics and informing a CPU (central processing unit); if the voice signal is not a valid voice signal, not processing the clean voice signal data sent before and entering a state of waiting for the clean voice signal data at the next time;
the voice acoustic feature is calculated by removing invalid voice signal data, loading a Mel filter coefficient, finally calculating to obtain the voice acoustic feature through the steps of FFT calculation, mean variance calculation, normalization calculation, floating point conversion fixed point quantization and the like, and a hardware calculation module informs a CPU after the calculation is finished;
s7, storing the calculated result data into a neural network processing prepositive cache by the CPU; the direct memory access module sends the data to the neural network computing module in parallel for computing, and the real-time performance of voice signal processing can be guaranteed through parallel computing;
the CPU stores data, and meanwhile, if the calculation mode or parameters of the hardware calculation module need to be adjusted, the CPU can be involved in configuring the hardware calculation module, so that the flexibility of data calculation processing is considered.
And after the calculation of the neural network calculation module is finished, the result is parallelly transmitted to a third cache SRAM 3 in real time through the direct memory access module, and meanwhile, a CPU is informed to process the result.
And when the neural network computing module is used for computing, the acoustic model parameters stored in the FLASH memory are automatically read in real time through the direct memory access module, the acoustic model and the obtained effective voice signal result data are subjected to neural network computing, after the computing is finished, the result is parallelly transmitted to a third cache SRAM 3 in real time through the direct memory access module, and meanwhile, a CPU is informed to perform voice recognition processing.
In the offline voice recognition method combining the terminal hardware and the algorithm software, the CPU is not used in the steps of S1, S4, S5 and S6, and the CPU is used in the steps of S2, S3 and S7; during continuous voice data processing, when a process is performing steps S1, S4, S5 and S6, the idle CPU can be used to process steps S2, S3 and S7 of another process, as shown in fig. 1.
As shown in fig. 1, an offline speech recognition chip capable of implementing the offline speech recognition method of the present invention includes a CPU and a direct memory access module, where a first cache, a second cache, and a third cache are connected between the direct memory access module and the CPU, the first cache is further connected to a speech preprocessing module, and the second cache is further connected to a speech endpoint detection module and a hardware computation module; the hardware computing module is connected with a neural network front cache; the direct memory access module is also connected with a voice data acquisition module and a neural network computing module, and the neural network computing module is connected with a third cache and a neural network front cache; the neural network computing module and the CPU are provided with external memory connecting ports; the direct memory access module, the voice preprocessing module, the voice endpoint detection module, the hardware calculation module, the voice data acquisition module and the neural network calculation module are all hardware modules.
The CPU is better in universality due to the instruction-based architecture, but for special tasks, the CPU of the instruction architecture has no advantages in area and performance. In the speech neural network recognition, if the CPU is used for carrying out special operations such as preprocessing, endpoint detection, hardware calculation, neural network calculation and the like, the area of the CPU is huge, and the dominant frequency above 1GHZ is usually needed, but the cost of the high-performance CPU is higher, the popularization of a speech module is not facilitated, the power consumption is higher, for offline speech recognition equipment, a lot of battery power supply equipment such as handheld equipment is used, the power consumption is sensitive, and the high-dominant-frequency CPU is not beneficial to achieving the low-cost target of a speech recognition chip.
The invention realizes the fixed special calculation required by voice recognition by a hardware ASIC module independently in a targeted design and processes with a CPU in parallel, thereby reducing the power consumption and the cost, realizing multi-core parallel operation and achieving better balance on the cost, the power consumption and the operation speed. Different storage partitions can be realized by the same cache device through different data read-write channels, and the different storage partitions are respectively used as the first cache, the second cache, the third cache and the neural network preposed cache.
In one embodiment, the CPU first reads the wakeup word language model and the command word language model from the external FLASH memory and stores them in the third cache SRAM 3.
When receiving the processing notification of the neural network computing module in the step S7, the CPU reads the computing result T from the third cache, and at the same time, the CPU determines whether the current device is in an awake state, and if not, determines whether the computing result T is an awake word, and if not, that is, the awake word is invalid, and the device is not in the awake state and cannot awake the device, continues to wait for an awake operation;
if the calculation result T is the awakening word, continuously judging whether the corresponding confidence coefficient exceeds a threshold set by the awakening word, if not, judging that the confidence coefficient is possibly misidentification, considering that the awakening is invalid, and continuously waiting for the awakening operation; and if the corresponding confidence coefficient exceeds the threshold set by the awakening word, judging that the awakening word is effective, and awakening the equipment.
And the CPU reads a calculation result T after the calculation of the neural network calculation module is completed from the third cache SRAM 3, and performs calculation such as Viterbi decoding and the like by combining the two language models of the awakening word language model and the command word language model to obtain a text and a confidence coefficient corresponding to the effective language signal. If the command is not awakened, only the awakening word language model can be loaded and stored in the third cache SRAM 3, the awakening word language model is usually much smaller than the command word language model, and at the moment, only the command of the awakening word type is identified, so that the computation and storage quantity of a CPU and NN can be reduced, and the power consumption of the system can be reduced; meanwhile, false alarm of equipment is avoided; since the awakening is a common operation of one device, only the awakening word model is loaded, the awakening recognition efficiency and the awakening time can be improved, and the user experience is improved.
When the equipment is in an awakening state and a command word is to be recognized, a command word voice model is loaded for recognition. The specific process can be as follows:
after receiving the processing notification of the neural network computing module in the step S7, the CPU determines that the current device is already in the awake state, and when the CPU continues to determine the command word and the confidence level corresponding to the computation result T read from the third cache, first determines whether the text has a corresponding command, and if not, determines that the command word is invalid, and continues to wait until the next command operation; if the text is judged to have a corresponding command, continuously judging the confidence corresponding to the text, judging whether the confidence is higher than the threshold set by the corresponding command word, if not, judging that the confidence is false recognition and the command word is invalid, and continuously waiting for the next command operation; and if the corresponding confidence coefficient exceeds the threshold set by the command word, judging that the command word is effective, and executing the operation corresponding to the command word.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
The operation of the air conditioner by the offline voice recognition chip shown in fig. 2 is described as an example:
assuming that the air conditioner is not awakened initially, the CPU detects that the equipment is not in an awakened state, and only loads an awakening word model into a third cache;
the user sends a voice command of '25 degrees', and the system performs the following operations:
s1, a microphone captures an analog voice signal of a command word of 25 degrees in real time and sends the analog voice signal to a voice data acquisition module in an offline voice recognition chip; the voice data acquisition module converts the analog voice signal into a digital voice signal and sends the digital voice signal to a first cache through a direct memory access module in the chip;
s2, monitoring the data volume of the first cache by the CPU, and carrying the data in the first cache to a voice preprocessing module by the CPU when the data in the first cache is accumulated to a preset threshold value;
s3, when the voice preprocessing module receives a digital voice signal transmitted by the CPU from the first cache, processing the signal to obtain a clean voice signal, informing the CPU, and storing the clean voice signal into a second cache by the CPU;
s4, the direct memory access module sends the clean voice signals in the second cache to the voice endpoint detection module and the hardware calculation module at the same time;
s5, a voice endpoint detection module calculates and judges whether the clean voice signal is an effective voice signal or not, and starts and ends of the effective voice signal, if the clean voice signal is found to be the effective voice signal, the voice endpoint detection module informs a hardware calculation module and sends start and end endpoint information to enter the step S6;
s6, the hardware calculation module judges that the clean voice signal '25 degrees' sent by the direct memory access module is an effective voice signal according to the notification of the voice endpoint detection module, and obtains the starting and ending endpoints of the effective voice signal sent by the voice endpoint detection module; the hardware calculation module calculates to obtain the voice acoustic characteristics and informs the CPU to enter the step S7;
s7, the CPU stores the voice acoustic characteristic result data calculated by the hardware calculation module into a neural network processing preposed cache; the direct memory access module sends the voice acoustic characteristic result to the neural network computing module in parallel, and the neural network computing module reads acoustic model parameters stored in an external FLASH memory of the chip in real time and performs neural network computing on voice acoustic characteristic result data;
and the result data after the calculation is transmitted to the third cache SRAM 3 in parallel in real time through the direct memory access module, and simultaneously the CPU is informed to perform voice recognition processing.
The speech recognition processing procedure is as follows:
s71, the CPU detects that the device is not in an awakening state, and only loads the awakening word model into a third cache;
s72, after receiving the processing notification of the neural network computing module in the step S7, the CPU reads a computing result T from a third cache;
and S73. the CPU finds that the current equipment is not in the awakening state, judges that the calculation result T is not the awakening word, if not, the awakening word is invalid, the equipment is not in the awakening state currently and cannot be awakened, and continues to wait for the awakening operation.
Specific example 2
The difference from the embodiment 1 is that the air conditioner is already in the wake-up state, and the user sends out a command word of '25 degrees';
the step S73 skips the direct step S74 and the CPU loads the command word language model into the third cache;
s74, when the current equipment is in an awakening state, the CPU continuously judges the text and the confidence coefficient corresponding to the calculation result T read from the third cache;
firstly, judging whether a text has a corresponding command, judging whether a 25-degree text has a corresponding command, continuously judging the confidence coefficient corresponding to the text, whether the confidence coefficient exceeds the threshold set by the corresponding command word, if not, judging that the confidence coefficient is false recognition, and if not, the command word is invalid, and continuously waiting for the next command operation; and if the corresponding confidence coefficient exceeds the threshold set by the command word, judging that the command word is effective, executing the operation corresponding to the command word, and adjusting the temperature of the air conditioner to 25 ℃.
The off-line speech recognition method combining the terminal hardware and the algorithm software processing has the following advantages:
the CPU and each hardware computing module in the chip are connected in an effective parallel processing mode, and the processing capacity requirement of the CPU can be reduced and the chip cost is reduced by carrying data in parallel.
The parallel processing work of the hardware computing module and the CPU can ensure that the CPU and the hardware module of the chip can independently operate, the operation speed is improved, voice data cannot be omitted when real-time voice recognition is processed, the recognition instantaneity is ensured, and the recognition effect is improved.
The chip can automatically read the acoustic model from the external FLASH memory, an internal RAM is not needed for storing the acoustic model, and the memory space is greatly saved.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including an instruction system which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.

Claims (8)

1. An off-line speech recognition method combining terminal hardware and algorithm software processing is characterized by comprising the following steps:
s1, a microphone captures an external analog voice signal in real time and sends the external analog voice signal to a voice data acquisition module in an off-line voice recognition chip; the voice data acquisition module converts the analog voice signal into a digital voice signal and sends the digital voice signal to a first cache through a direct memory access module in the chip;
s2, monitoring the data volume of the first cache by the CPU, and carrying the data in the first cache to a voice preprocessing module by the CPU when the data in the first cache is accumulated to a preset threshold value;
s3, when the voice preprocessing module receives a digital voice signal transmitted by the CPU from the first cache, processing the signal to obtain a clean voice signal, informing the CPU, and storing the clean voice signal into a second cache by the CPU;
s4, the direct memory access module sends the clean voice signals in the second cache to the voice endpoint detection module and the hardware calculation module at the same time;
s5, a voice endpoint detection module calculates and judges whether the clean voice signal is an effective voice signal or not and the starting and ending endpoints of the effective voice signal, if so, the voice endpoint detection module informs a hardware calculation module and sends the starting and ending endpoint information to enter the step S6; if not, the process is terminated and continues to wait for the next process;
s6, judging whether the clean voice signal sent by the direct memory access module is an effective voice signal or not by the hardware calculation module according to the notification of the voice endpoint detection module; if the voice signal is the effective voice signal, acquiring the starting end point and the ending end point of the effective voice signal sent by the voice end point detection module; the hardware calculation module calculates to obtain the voice acoustic characteristics and informs the CPU, and the step S7 is entered; if the voice signal is not a valid voice signal, not processing the clean voice signal data sent before and entering a state of waiting for the clean voice signal data at the next time;
s7, the CPU stores the voice acoustic characteristic result data calculated by the hardware calculation module into a neural network processing preposed cache; the direct memory access module sends the voice acoustic characteristic result to the neural network computing module in parallel, and the neural network computing module reads acoustic model parameters stored in an external FLASH memory of the chip in real time and performs neural network computing on voice acoustic characteristic result data;
and the result data after the calculation is transmitted to the third cache SRAM 3 in parallel in real time through the direct memory access module, and simultaneously the CPU is informed to perform voice recognition processing.
2. The offline speech recognition method according to claim 1, wherein in S3, said processing includes performing noise reduction, filtering, speech enhancement and sound source localization.
3. The method of claim 1, wherein the speech acoustic feature calculation is performed by removing invalid speech signal data, loading mel-filter coefficients, performing FFT calculation, mean variance calculation, normalization calculation, and floating point to fixed point quantization in sequence.
4. The off-line speech recognition method of claim 1, wherein the specific flow of the CPU processing recognition in the step S7 is:
s71, reading the awakening word language model and the command word language model from the external FLASH memory by the CPU, and storing the awakening word language model and the command word language model into a third cache SRAM 3;
s72, after receiving the voice recognition processing notification of the neural network computing module in the step S7, the CPU reads a computing result T from a third cache, and meanwhile, the CPU judges whether the current equipment is in an awakening state or not;
s73, if the current equipment is not in the awakening state, continuously judging whether the calculation result T is the awakening word or not, if not, the awakening word is invalid, and if the equipment is not in the awakening state currently and cannot be awakened, continuously waiting for the awakening operation;
if the calculation result T is the awakening word, continuously judging whether the corresponding confidence coefficient exceeds a threshold set by the awakening word, if not, judging that the confidence coefficient is possibly mistaken identification, considering that the awakening is invalid, and continuously waiting for the awakening operation; if the corresponding confidence coefficient exceeds the threshold set by the awakening word, judging that the awakening word is valid, and awakening the equipment;
s74, if the current device is in the wake-up state, the CPU continues to judge the text and the confidence corresponding to the calculation result T read from the third cache;
firstly, judging whether the text has a corresponding command, if not, judging that the command word is invalid, and continuing waiting for the next command operation;
if the text is judged to have a corresponding command, continuously judging the confidence corresponding to the text, judging whether the confidence is higher than the threshold set by the corresponding command word, if not, judging that the confidence is false recognition and the command word is invalid, and continuously waiting for the next command operation; and if the corresponding confidence coefficient exceeds the threshold set by the command word, judging that the command word is effective, and executing the operation corresponding to the command word.
5. The offline speech recognition method of claim 4, wherein the CPU only loads the wake-up word-language model when it determines that the current device is not in the wake-up state.
6. An off-line voice recognition chip is characterized by comprising a CPU and a direct memory access module, wherein a first cache, a second cache and a third cache are connected between the direct memory access module and the CPU, the first cache is also connected with a voice preprocessing module, and the second cache is also connected with a voice endpoint detection module and a hardware calculation module; the hardware computing module is connected with a neural network front cache; the direct memory access module is also connected with a voice data acquisition module and a neural network computing module, and the neural network computing module is connected with a third cache and a neural network front cache; the neural network computing module and the CPU are provided with external memory connecting ports; the direct memory access module, the voice preprocessing module, the voice endpoint detection module, the hardware calculation module, the voice data acquisition module and the neural network calculation module are all hardware modules.
7. The offline speech recognition chip of claim 6, wherein the hardware module is implemented as an ASIC.
8. The off-line speech recognition chip of claim 6, wherein each cache is implemented by a same cache device by implementing different memory partitions through different data read/write channels of the dma module.
CN202110186016.6A 2021-02-14 2021-02-14 Off-line voice recognition method combining terminal hardware and algorithm software processing Active CN114944155B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110186016.6A CN114944155B (en) 2021-02-14 2021-02-14 Off-line voice recognition method combining terminal hardware and algorithm software processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110186016.6A CN114944155B (en) 2021-02-14 2021-02-14 Off-line voice recognition method combining terminal hardware and algorithm software processing

Publications (2)

Publication Number Publication Date
CN114944155A true CN114944155A (en) 2022-08-26
CN114944155B CN114944155B (en) 2024-06-04

Family

ID=82905770

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110186016.6A Active CN114944155B (en) 2021-02-14 2021-02-14 Off-line voice recognition method combining terminal hardware and algorithm software processing

Country Status (1)

Country Link
CN (1) CN114944155B (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001020597A1 (en) * 1999-09-15 2001-03-22 Conexant Systems, Inc. Automatic speech recognition to control integrated communication devices
CA2407930A1 (en) * 2000-06-23 2002-01-03 Minebea Co., Ltd. Input device for voice recognition and articulation using keystroke data.
CN101996627A (en) * 2009-08-21 2011-03-30 索尼公司 Speech processing apparatus, speech processing method and program
CN102903362A (en) * 2011-09-02 2013-01-30 微软公司 Integrated local and cloud based speech recognition
US20140214416A1 (en) * 2013-01-30 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method and system for recognizing speech commands
CN105976808A (en) * 2016-04-18 2016-09-28 成都启英泰伦科技有限公司 Intelligent speech recognition system and method
CN106611599A (en) * 2015-10-21 2017-05-03 展讯通信(上海)有限公司 Voice recognition method and device based on artificial neural network and electronic equipment
CN107622770A (en) * 2017-09-30 2018-01-23 百度在线网络技术(北京)有限公司 voice awakening method and device
CN109378000A (en) * 2018-12-19 2019-02-22 科大讯飞股份有限公司 Voice awakening method, device, system, equipment, server and storage medium
CN109961792A (en) * 2019-03-04 2019-07-02 百度在线网络技术(北京)有限公司 The method and apparatus of voice for identification
CN109997154A (en) * 2017-10-30 2019-07-09 上海寒武纪信息科技有限公司 Information processing method and terminal device
CN110297702A (en) * 2019-05-27 2019-10-01 北京蓦然认知科技有限公司 A kind of multi-task parallel treating method and apparatus
CN111091819A (en) * 2018-10-08 2020-05-01 蔚来汽车有限公司 Voice recognition device and method, voice interaction system and method
CN210575088U (en) * 2019-03-15 2020-05-19 上海华镇电子科技有限公司 Voice recognition household appliance control device
CN112102816A (en) * 2020-08-17 2020-12-18 北京百度网讯科技有限公司 Speech recognition method, apparatus, system, electronic device and storage medium

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001020597A1 (en) * 1999-09-15 2001-03-22 Conexant Systems, Inc. Automatic speech recognition to control integrated communication devices
CA2407930A1 (en) * 2000-06-23 2002-01-03 Minebea Co., Ltd. Input device for voice recognition and articulation using keystroke data.
CN101996627A (en) * 2009-08-21 2011-03-30 索尼公司 Speech processing apparatus, speech processing method and program
CN102903362A (en) * 2011-09-02 2013-01-30 微软公司 Integrated local and cloud based speech recognition
WO2013033481A1 (en) * 2011-09-02 2013-03-07 Microsoft Corporation Integrated local and cloud based speech recognition
US20140214416A1 (en) * 2013-01-30 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method and system for recognizing speech commands
CN106611599A (en) * 2015-10-21 2017-05-03 展讯通信(上海)有限公司 Voice recognition method and device based on artificial neural network and electronic equipment
CN105976808A (en) * 2016-04-18 2016-09-28 成都启英泰伦科技有限公司 Intelligent speech recognition system and method
CN107622770A (en) * 2017-09-30 2018-01-23 百度在线网络技术(北京)有限公司 voice awakening method and device
CN109997154A (en) * 2017-10-30 2019-07-09 上海寒武纪信息科技有限公司 Information processing method and terminal device
CN111091819A (en) * 2018-10-08 2020-05-01 蔚来汽车有限公司 Voice recognition device and method, voice interaction system and method
CN109378000A (en) * 2018-12-19 2019-02-22 科大讯飞股份有限公司 Voice awakening method, device, system, equipment, server and storage medium
CN109961792A (en) * 2019-03-04 2019-07-02 百度在线网络技术(北京)有限公司 The method and apparatus of voice for identification
CN210575088U (en) * 2019-03-15 2020-05-19 上海华镇电子科技有限公司 Voice recognition household appliance control device
CN110297702A (en) * 2019-05-27 2019-10-01 北京蓦然认知科技有限公司 A kind of multi-task parallel treating method and apparatus
CN112102816A (en) * 2020-08-17 2020-12-18 北京百度网讯科技有限公司 Speech recognition method, apparatus, system, electronic device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GONDI S: "Performance evaluation of offline speech recognition on edge devices", ELECTRONICS, vol. 10, no. 21, 1 November 2021 (2021-11-01) *
柯家年: "语音识别在视频会议中的应用研究及实现", 中国优秀硕士学位论文全文数据库信息科技辑, no. 1, 15 January 2015 (2015-01-15) *

Also Published As

Publication number Publication date
CN114944155B (en) 2024-06-04

Similar Documents

Publication Publication Date Title
EP3830716B1 (en) Storage edge controller with a metadata computational engine
CN110534099B (en) Voice wake-up processing method and device, storage medium and electronic equipment
US20200027462A1 (en) Voice control system, wakeup method and wakeup apparatus therefor, electrical appliance and co-processor
US10601599B2 (en) Voice command processing in low power devices
WO2019157888A1 (en) Control device, method and equipment for processor
WO2020038010A1 (en) Intelligent device, voice wake-up method, voice wake-up apparatus, and storage medium
US20180293974A1 (en) Spoken language understanding based on buffered keyword spotting and speech recognition
CN105976808B (en) Intelligent voice recognition system and method
US11822958B2 (en) Method and a device for data transmission between an internal memory of a system-on-chip and an external memory
CN110223687B (en) Instruction execution method and device, storage medium and electronic equipment
CN110570873A (en) voiceprint wake-up method and device, computer equipment and storage medium
CN111199733A (en) Multi-stage recognition voice awakening method and device, computer storage medium and equipment
CN111192590A (en) Voice wake-up method, device, equipment and storage medium
CN103543814A (en) Signal processing device and signal processing method
CN114724564A (en) Voice processing method, device and system
CN114944155A (en) Offline voice recognition method combining terminal hardware and algorithm software processing
CN111654782B (en) Intelligent sound box and signal processing method
CN111179924B (en) Method and system for optimizing awakening performance based on mode switching
CN111833874B (en) Man-machine interaction method, system, equipment and storage medium based on identifier
CN116705033A (en) System on chip for wireless intelligent audio equipment and wireless processing method
CN115881124A (en) Voice wake-up recognition method, device and storage medium
CN112579005B (en) Method, device, computer equipment and storage medium for reducing average power consumption of SSD
CN110430508B (en) Microphone noise reduction processing method and computer storage medium
CN114299998A (en) Voice signal processing method and device, electronic equipment and storage medium
EP3846162A1 (en) Smart audio device, calling method for audio device, electronic device and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant