CN117877486A - Electronic equipment based on voice recognition and control method thereof - Google Patents

Electronic equipment based on voice recognition and control method thereof Download PDF

Info

Publication number
CN117877486A
CN117877486A CN202410025380.8A CN202410025380A CN117877486A CN 117877486 A CN117877486 A CN 117877486A CN 202410025380 A CN202410025380 A CN 202410025380A CN 117877486 A CN117877486 A CN 117877486A
Authority
CN
China
Prior art keywords
voice
module
signals
signal
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410025380.8A
Other languages
Chinese (zh)
Inventor
唐瑞芳
姚国康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhongke Microelectronics Co ltd
Original Assignee
Shenzhen Zhongke Microelectronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhongke Microelectronics Co ltd filed Critical Shenzhen Zhongke Microelectronics Co ltd
Priority to CN202410025380.8A priority Critical patent/CN117877486A/en
Publication of CN117877486A publication Critical patent/CN117877486A/en
Pending legal-status Critical Current

Links

Abstract

The invention is applicable to the technical field of automatic voice recognition, and provides a control method based on voice recognition, which is applied to a control system based on voice recognition and comprises a voice acquisition module, a signal processing module, a feature extraction module, a voice recognition module, a text conversion module, a voice database, an alarm, a memory and a processing center; the voice acquisition module, the signal processing module, the feature extraction module, the voice recognition module, the text conversion module, the voice database, the alarm and the memory are respectively connected with the processing center; an electronic device based on speech recognition is also provided. The invention can automatically convert the voice of dialects in any Chinese area or dialects of minority nations (including other nations) without own characters into Chinese characters on line, and solves the problem that various dialects without characters cannot be recognized as Chinese characters by voice.

Description

Electronic equipment based on voice recognition and control method thereof
Technical Field
The invention belongs to the technical field of automatic voice recognition, and particularly relates to an electronic device based on voice recognition and a control method thereof.
Background
Speech recognition is a technique that allows a machine to convert speech signals into corresponding text or commands through a recognition and understanding process, including feature extraction techniques, pattern matching criteria, model training techniques.
The existing speech recognition is mainly aimed at the national language with characters, but can not be carried out for various dialects without characters.
Disclosure of Invention
In order to solve the defects in the prior art, the invention aims to provide an LED lamp and a remote control system and a remote control method thereof, and the problems that various dialects without characters can not be subjected to voice recognition as Chinese characters are solved by arranging a voice acquisition module, a signal processing module, a feature extraction module, a voice recognition module and a character conversion module, so that the voices of dialects in any Chinese area or minority dialects without own characters can be automatically converted into the Chinese characters on line.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a control method based on voice recognition is applied to a control system based on voice recognition, and comprises a voice acquisition module, a signal processing module, a feature extraction module, a voice recognition module, a text conversion module, a voice database, an alarm, a memory and a processing center; the voice acquisition module, the signal processing module, the feature extraction module, the voice recognition module, the text conversion module, the voice database, the alarm and the memory are respectively connected with the processing center; the voice database is used for collecting voices of various dialects in each Chinese area and voices of various minority languages without own characters;
the alarm is matched with the voice standard library of various languages stored in the memory according to the acquired voice characteristic information by the processing center, if the voice characteristic information is inconsistent with the voice standard library of various languages, the alarm automatically gives out an audible alarm, proves to be the voice of a new language and transmits the voice to the voice database to serve as the standard of the voice recognition of the subsequent similar language;
the memory is responsible for information storage of a voice acquisition module, a signal processing module, a feature extraction module, a voice recognition module, a text conversion module, a voice database and an alarm, and storage of dialect voices of various recorded areas or minority nationalities;
the processing center is responsible for information transmission of a voice acquisition module, a signal processing module, a feature extraction module, a voice recognition module, a text conversion module, a voice database, an alarm and a memory, is a system hub center, is matched with a voice standard library of various languages stored in the memory according to the acquired voice feature information, and is recognized as a specific voice type if the acquired voice feature information is consistent with the voice standard library of various languages stored in the memory, and is transmitted to the voice recognition module; if the voice is inconsistent, the voice is transmitted to an alarm, the voice is proved to be the voice of a new language, and the voice is transmitted to a voice database to be used as the standard of the voice recognition of the subsequent similar language;
the control voice acquisition module acquires accents, speaking speeds, polarities and other factors of Chinese dialects and dialects of all minority nations through the recording equipment to generate target voice signals, converts the target voice signals into analog electric signals, converts the analog electric signals into digital signals through the analog-to-digital converter, and transmits the digital signals to the signal processing module;
the signal processing module is responsible for filtering and converting continuous acoustic signals into discrete digital signals, quantizing the amplitude of the digital signals into discrete numerical values, converting the quantized digital signals into binary codes, removing noise, enhancing the identifiability of voice signals and transmitting the binary codes to the feature extraction module;
the feature extraction module comprises a preparation emphasis unit, a windowing framing unit, a frequency domain transformation unit, a Mel frequency unit, a cepstrum analysis unit, a Fourier inversion unit and a feature calculation unit, wherein the processed voice signal is decomposed into frames with small time slices, the spectral features of each frame, which are the characteristics of voice, are extracted and analyzed into a series of feature vectors, and the feature vectors are transmitted to the processing center;
the voice recognition module is responsible for matching the feature vector of the voice of the recognized specific language with the trained voice model, determining the feature vector as specific text content through training and optimization of the HMM model and the ASR system, and transmitting the text content to the text conversion module;
the text conversion module is responsible for converting specific text content recognized by voice into readable standard Chinese text information, and can enable a conversion result to be more in line with grammar rules and semantic logic of natural language through methods of error correction, sentence breaking, punctuation mark addition and the like, so that the accuracy and the readability of voice-to-text are further improved.
The invention also provides a control method based on voice recognition, which comprises the following steps:
s10, when voice needs to be recognized, controlling a voice acquisition module to acquire accents, speaking speeds, polarities and other factors of Chinese dialects and dialects of all minority nations through recording equipment to generate target voice signals, converting the target voice signals into analog electric signals, converting the analog electric signals into digital signals through an analog-to-digital converter, and transmitting the digital signals to a signal processing module;
s20, the signal processing module filters and converts continuous acoustic signals into discrete digital signals, quantizes the amplitude of the digital signals into discrete numerical values, converts the quantized digital signals into binary codes, removes noise, enhances the identifiability of voice signals and transmits the binary codes to the feature extraction module;
s30, the feature extraction module decomposes the processed voice signal into frames with small time slices, extracts the spectral features of each frame, analyzes the spectral features into a series of feature vectors and transmits the feature vectors to the processing center;
s40, the processing center matches the acquired voice characteristic information with a voice standard library of various languages stored in a memory: if the voice types are consistent, the voice types are identified as specific voice types, and the specific voice types are transmitted to a voice recognition module; if the voice is inconsistent, the voice is transmitted to an alarm, the voice is proved to be the voice of a new language, and the voice is transmitted to a voice database to be used as the standard of the voice recognition of the subsequent similar language;
s50, the voice recognition module matches the feature vector of the voice of the recognized specific language with the trained voice model, determines specific text content through training and optimization of the HMM model and the ASR system, and transmits the text content to the text conversion module;
s60, the text conversion module converts specific text content recognized by the voice into readable standard Chinese text information, and the conversion result can be more in accordance with grammar rules and semantic logic of natural language by means of error correction, sentence breaking, punctuation mark addition and the like, so that the accuracy and the readability of the voice converted text are further improved.
Further, the step S30 includes the following steps:
s31, the preliminary emphasis unit changes a given time domain input signal x [ n ] into 'y [ n ] =x [ n ] -alpha x [ n-1], wherein alpha is more than or equal to 0.9 and less than or equal to 1.0, x [ n ] is an original signal, alpha x [ n-1] is an attenuation signal', the frequency of a high-frequency signal is enhanced, the attenuation of the high-frequency signal is prevented, and the high-frequency signal is transmitted to the windowing framing unit;
s32, a windowing framing unit divides an audio digital signal with an indefinite length into frames with frame lengths of 10-30 ms of a plurality of small sections with fixed lengths, the signals are changed into 'y [ n ] =w [ n ] x [ n ], y [ n ] is a signal after framing, w [ n ] is a window function, x [ n ] is an original signal', interference among frames is avoided, and the frames are transmitted to a frequency domain transformation unit;
s33, a frequency domain transformation unit transforms the voice frames with the frame lengths of 10-30 ms of a plurality of small segments from time domain signals to frequency domain signals through Fourier transformation "F (ω) is an image function of F (t), F (t) is an image primitive function of F (ω), ω is frequency, t is time, e is a base number of natural logarithms, i is an imaginary unit (i.e., square of i is-1), d is discrete in both time domain and frequency domain, dt is discrete in time domain, and a spectral feature is obtained by taking a modulus of a DFT coefficient and is transmitted to a mel frequency unit;
s34, the Mel frequency unit converts the linear frequency 'f=2595×log10 (1+f/700)' into Mel frequency 'mel (f) =2595×log10 (1+f/700)' and then carries out logarithmic operation, and transmits the result to the cepstrum analysis unit;
s35, the cepstrum analysis unit decomposes the frequency domain signal 'log|X [ m ] |=log|Hm|+log|Em|' into the product 'X [ m ] =Hm|Em' of spectrum envelope and spectrum detail through the Mel cepstrum coefficient to obtain the characteristic of the voice, and transmits the characteristic to the Fourier inversion unit;
s36, the 1 st to K th points after the Fourier inversion unit passes through inverse Fourier transform (IDFT) are K-dimensional MFCC characteristicsExtracting and transmitting to a feature calculation unit;
s37, the feature calculation unit extracts MFCC features of a section of voice signal through first-order second-order difference operation to obtain dynamic features.
The invention provides a control system based on voice recognition, which also comprises computer equipment and a computer readable storage medium; the computer equipment comprises a memory and each functional module, wherein the memory stores a computer program, and the steps of the control method based on voice recognition are realized when each functional module executes the computer program; the computer readable storage medium stores a computer program which, when executed by each functional module, implements the steps of a control method based on speech recognition as described in any one of the above.
The invention also provides an electronic device based on voice recognition, which is realized by the control method based on voice recognition.
Compared with the prior art, the invention has the beneficial effects that:
through setting up voice acquisition module, signal processing module, characteristic extraction module, speech recognition module, word conversion module, can be to the pronunciation of the dialect of any chinese area or the minority nationality (including other nationalities) dialect that does not have own characters can be online automatic conversion to chinese characters, solved and to the various dialects that do not have characters can't carry out speech recognition as chinese characters's problem.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required for the description of the embodiments or exemplary techniques will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a schematic diagram of a system module of the present invention;
FIG. 2 is a schematic diagram of a feature extraction module according to the present invention;
FIG. 3 is a flow chart of the method of the present invention;
fig. 4 is a schematic exploded view of the step S30 process in the method of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The following describes in detail the implementation of the present invention in connection with specific embodiments:
in order to make the technical problems, technical schemes and beneficial effects to be solved by the present application more clear, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
It should be noted that when a module is referred to as being "disposed on" another module, it can be directly on the other module or be indirectly on the other module. When a module is referred to as being "connected to" another module, it can be directly connected to the other module or be indirectly connected to the other module.
In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise. The meaning of "a number" is one or more than one unless specifically defined otherwise.
In the description of the present application, it should be noted that, unless explicitly specified and limited otherwise, the terms "connected," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the terms in this application will be understood by those of ordinary skill in the art as the case may be. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
Referring to fig. 1, the invention provides a control method based on voice recognition, which is applied to a control system based on voice recognition, and comprises a voice acquisition module, a signal processing module, a feature extraction module, a voice recognition module, a text conversion module, a voice database, an alarm, a memory and a processing center; the voice acquisition module, the signal processing module, the feature extraction module, the voice recognition module, the text conversion module, the voice database, the alarm and the memory are respectively connected with the processing center; the voice database is used for collecting voices of various dialects in each Chinese area and voices of various minority languages without own characters.
The alarm is matched with the voice standard library of various languages stored in the memory according to the acquired voice characteristic information by the processing center, and if the voice characteristic information is inconsistent with the voice standard library of various languages, the alarm automatically gives out an audible alarm, proves the voice of a new language and transmits the voice to the voice database to serve as a standard of voice recognition of a subsequent similar language.
The memory is responsible for information storage of a voice acquisition module, a signal processing module, a feature extraction module, a voice recognition module, a text conversion module, a voice database and an alarm, and storage of dialect voices of various recorded areas or minority nationalities.
The processing center is responsible for information transmission of a voice acquisition module, a signal processing module, a feature extraction module, a voice recognition module, a text conversion module, a voice database, an alarm and a memory, is a system hub center, is matched with a voice standard library of various languages stored in the memory according to the acquired voice feature information, and is recognized as a specific voice type if the acquired voice feature information is consistent with the voice standard library of various languages stored in the memory, and is transmitted to the voice recognition module; if the voice is inconsistent, the voice is transmitted to an alarm, the voice is proved to be the voice of a new language, and the voice is transmitted to a voice database to be used as the standard of the voice recognition of the subsequent similar language.
The control voice acquisition module acquires accents, speaking speeds, polarities and other factors of Chinese dialects and dialects of the minority areas through the recording equipment to generate target voice signals, converts the target voice signals into analog electric signals, converts the analog electric signals into digital signals through the analog-to-digital converter, can more conveniently carry out subsequent processing and analysis, and transmits the digital signals to the signal processing module.
The signal processing module is responsible for filtering and converting continuous sound wave signals into discrete digital signals, quantizing the amplitude of the digital signals into discrete numerical values, converting the quantized digital signals into binary codes, removing noise, enhancing the recognizability of voice signals, improving the accuracy of voice-to-text, and transmitting the voice signals to the feature extraction module.
Referring to fig. 2, the feature extraction module includes a pre-emphasis unit, a windowed framing unit, a frequency domain transformation unit, a mel frequency unit, a cepstrum analysis unit, a fourier inversion unit, and a feature calculation unit, where the processed speech signal is decomposed into frames with small time segments, and each frame is extracted as a spectral feature of speech and analyzed into a series of feature vectors, so that the time domain and frequency domain features of the speech can be better expressed and transmitted to the processing center.
Furthermore, the preliminary emphasis unit gives a time domain input signal x [ n ] through a first order high pass filter to be changed into 'y [ n ] = x [ n ] -alpha x [ n-1],0.9 +.ltoreq.1.0, x [ n ] is an original signal, and alpha x [ n-1] is an attenuation signal', so as to strengthen the frequency of a high frequency signal, improve the energy of a high frequency part of the signal, prevent the attenuation of the high frequency signal, and transmit the high frequency signal to a windowing framing unit.
Furthermore, the windowing and framing unit is responsible for dividing an audio digital signal with an indefinite length into frames with frame lengths of 10-30 ms of a plurality of small sections with fixed lengths, the signals are changed into 'y [ n ] =w [ n ] x [ n ], y [ n ] is a signal after framing, w [ n ] is a window function, and x [ n ] is an original signal', so that one frame has enough periods and does not change too severely, interference among frames is avoided, and the signals are transmitted to the frequency domain transformation unit.
Further, the frequency domain transformation unit transforms the voice frames with the frame length of 10-30 ms of a plurality of small segments from a time domain signal to a frequency domain signal through Fourier transformation "F (ω) is an image function of F (t), F (t) is an image source function of F (ω), ω is frequency, t is time, e is a base number of natural logarithms, i is an imaginary unit (i.e., square of i is-1), d is discrete in both time domain and frequency domain, dt is discrete in time domain ", and a spectral feature is obtained by taking a modulus of DFT coefficients and transferred to a mel frequency unit.
Further described, theIs expressed by the fourier series formula "f (t) =a 0 /2+Σ(a n ·cos(n·ω 0 ·t)+b n ·sin(n·ω 0 T) "is converted from, F (t) is a periodic function, i.e., the primitive function of F (ω), ω 0 For angular frequency, a 0 、a n 、b n Respectively fourier coefficients, n is a positive integer, and t is time.
Further, the mel frequency unit is responsible for converting the linear frequency "f=2595×log10 (1+f/700)" into the mel frequency "mel (f) =2595×log10 (1+f/700)" and then performing a logarithmic operation, and transmitting the logarithmic operation to the cepstrum analysis unit.
Further, since the perception of the human ear to the frequency is not equally spaced and approximates to a logarithmic function, a logarithmic operation is required, the number P of filter banks is required to be determined in the mel filter design, the starting frequency, the intermediate frequency and the cut-off frequency of each filter are generated in the mel domain at equal intervals according to the sampling rate fs, the number N of DFT points, the number P of filters, the intermediate frequency of the last filter is the starting frequency (presence of an overlap) of the next filter, the starting, intermediate and cut-off frequencies of each triangular filter in the mel domain are converted into a linear frequency domain, the spectral characteristics after DFT are filtered to obtain P filter bank energies, and log operations are performed to obtain Fbank characteristics.
Further stated, the DFT coefficients refer to the Fourier coefficients from FT, DTFT to DFT; the FT is Fourier transform, i.e. fourier transform, is a tool for transforming a continuous time domain signal into a continuous frequency domain signal, and is a continuous signal in both time domain and frequency domain, and the digital system can only process discrete signals, but cannot be used in the digital system; the DTFT is Discrete Fourier Transform, namely discrete time fourier transform, which is a tool for discretizing a continuous time domain signal by sampling and then transforming the continuous time domain signal into a frequency domain, but the frequency domain signal is still continuous and cannot be processed by a digital system; the DFT is an english shorthand of Discrete Fourier Transform, that is, discrete fourier transform, which is to discretize a continuous time domain signal by sampling, then transform the continuous time domain signal into a frequency domain signal, and discretize the frequency domain signal by sampling to obtain a discrete frequency domain signal.
Further, the fourier coefficient refers to a coefficient of each expansion term that can expand a linear combination of trigonometric functions if a certain condition is satisfied for an arbitrary periodic signal.
Further, the cepstrum analysis unit decomposes the frequency domain signal log|X [ m ] |=log|Hm|+log|Em| "into a product" X [ m ] =Hm ] =Em ] "of spectrum envelope and spectrum details through a Mel cepstrum coefficient to obtain the characteristic of the voice, and transmits the characteristic to the Fourier inversion unit.
Further, the Fourier inversion unit is characterized by K-dimensional MFCC through 1 st to K th points after inverse Fourier transform (IDFT)Extracted and passed to a feature calculation unit.
Further, the feature calculation unit performs MFCC feature extraction on a section of the speech signal through a first-order second-order differential operation to obtain dynamic features, but ignores dynamic continuity of the speech signal.
Further, the first-order difference is the difference between two continuous adjacent terms in the discrete function, which is the relation between the current voice frame and the previous frame, and represents the relation between the frames (two adjacent frames); the second-order difference is the relation between the first-order difference and the second-order difference, and is the dynamic relation between the adjacent three frames on the frame.
The voice recognition module is responsible for matching the feature vector of the recognized voice of the specific language with the trained voice model, determining the feature vector as specific text content through training and optimization of the HMM model and the ASR system, and transmitting the feature vector to the text conversion module.
Further, the HMM (Hidden Markov Model abbreviated as "hidden markov model") is a statistical model for modeling, and is widely used in acoustic modeling by establishing a probability relation between a state sequence and an observable sequence, including natural language processing, speech recognition, etc., and the method comprises decomposing a speech signal into a series of time sequences, namely, feature vectors, mapping the feature vectors to phonemes in a text, including a state, transition probability, and emission probability; the state refers to the current "state"; the transition probability refers to the probability of transitioning from one state to another; the emission probability refers to the probability that a certain state generates a certain observation. In speech recognition, a state may be any phoneme, and transition probabilities measure the transition probabilities between adjacent phonemes, and emission probabilities are the probabilities that a given state generates observations (i.e., mel-frequency cepstral coefficients).
Further, the ASR (Automatic Speech Recognition) system, i.e. an automatic speech recognition system, is a binary code or character sequence that converts the lexical content in human speech into a computer-readable input, and a phoneme recognition model (ASR) is built by an HMM model to translate an audio stream according to the basic units (i.e. phonemes) of a speech signal, including preliminary signal processing, feature extraction, HMM acoustic modeling, and language model.
The text conversion module is responsible for converting specific text content recognized by voice into readable standard Chinese text information, and can enable a conversion result to be more in line with grammar rules and semantic logic of natural language through methods of error correction, sentence breaking, punctuation mark addition and the like, so that the accuracy and the readability of voice-to-text are further improved.
The working principle of the system is as follows:
when the voice needs to be recognized, a voice acquisition module is controlled to acquire accents, speaking speeds, polarities and other factors of Chinese dialects and dialects of all minority areas through a recording device to generate a target voice signal, the target voice signal is converted into an analog electric signal, and the analog electric signal is converted into a digital signal through an analog-to-digital converter and is transmitted to a signal processing module; filtering and converting the continuous acoustic wave signals into discrete digital signals through a signal processing module, quantizing the amplitude of the digital signals into discrete numerical values, converting the quantized digital signals into binary codes, removing noise, enhancing the identifiability of voice signals and transmitting the binary codes to a feature extraction module; the control feature extraction module is used for decomposing the processed voice signal into frames with small time slices, extracting the spectral features of each frame, analyzing the spectral features into a series of feature vectors and transmitting the feature vectors to the processing center; matching the acquired voice characteristic information with a voice standard library of various languages stored in a memory through a processing center, and if the voice characteristic information is consistent with the voice standard library, identifying the voice standard library as a specific voice type and transmitting the voice type to a voice identification module; if the voice is inconsistent, the voice is transmitted to an alarm, the voice is proved to be the voice of a new language, and the voice is transmitted to a voice database to be used as the standard of the voice recognition of the subsequent similar language; the voice recognition module is controlled to match the characteristic vector of the voice of the specific recognized language with the trained voice model, and the characteristic vector is determined to be specific text content through training and optimization of the HMM model and the ASR system and is transmitted to the text conversion module; the specific text content recognized by the voice is converted into readable standard Chinese text information through a text conversion module, and the conversion result can be more in accordance with grammar rules and semantic logic of natural language through methods of error correction, sentence breaking, punctuation mark addition and the like, so that the accuracy and readability of the voice to text are further improved, and the voice of dialects in any region or minority dialects without own text can be converted into Chinese text on line.
Referring to fig. 3, the invention further provides a control method based on voice recognition, which comprises the following steps:
s10, when voice needs to be recognized, controlling a voice acquisition module to acquire accents, speaking speeds, polarities and other factors of Chinese dialects and dialects of the minority areas through recording equipment to generate target voice signals, converting the target voice signals into analog electric signals, converting the analog electric signals into digital signals through an analog-to-digital converter, and conveniently carrying out subsequent processing and analysis and transmitting the digital signals to a signal processing module;
s20, the signal processing module filters and converts continuous sound wave signals into discrete digital signals, quantizes the amplitude of the digital signals into discrete numerical values, converts the quantized digital signals into binary codes, removes noise, enhances the recognizability of voice signals, improves the accuracy of voice-to-word conversion, and transmits the voice signals to the feature extraction module;
s30, the feature extraction module decomposes the processed voice signal into frames with small time slices, extracts the spectral features of each frame, which are the voice, analyzes the spectral features into a series of feature vectors, can better express the time domain and frequency domain features of the voice, and transmits the time domain and frequency domain features to the processing center;
s40, the processing center matches the acquired voice characteristic information with a voice standard library of various languages stored in the memory, if the voice characteristic information is consistent with the voice standard library, the voice standard library is identified as a specific voice type, and the voice standard library is transmitted to the voice recognition module; if the voice is inconsistent, the voice is transmitted to an alarm, the voice is proved to be the voice of a new language, and the voice is transmitted to a voice database to be used as the standard of the voice recognition of the subsequent similar language;
s50, the voice recognition module matches the feature vector of the voice of the recognized specific language with the trained voice model, determines specific text content through training and optimization of the HMM model and the ASR system, and transmits the text content to the text conversion module;
s60, the text conversion module converts specific text content recognized by the voice into readable standard Chinese text information, and the conversion result can be more in accordance with grammar rules and semantic logic of natural language through methods of error correction, sentence breaking, punctuation mark addition and the like, so that the accuracy and readability of the voice conversion text are further improved, and the voice of dialects in any Chinese area or minority dialects without own text can be automatically converted into Chinese text on line.
Referring to fig. 4, the step S30 includes the following steps:
s31, the preliminary emphasis unit gives a time domain input signal x [ n ] through a first-order high-pass filter to be changed into 'y [ n ] = x [ n ] -alpha x [ n-1], alpha is more than or equal to 0.9 and less than or equal to 1.0, x [ n ] is an original signal, alpha x [ n-1] is an attenuation signal', so as to strengthen the frequency of a high-frequency signal, improve the energy of a high-frequency part of the signal, prevent the attenuation of the high-frequency signal, and transmit the high-frequency signal to the windowing framing unit;
s32, a windowing and framing unit divides an audio digital signal with an indefinite length into frames with frame lengths of 10-30 ms and of a plurality of small sections with fixed lengths, the signals are changed into ' y n=w n x n ', y n is a signal after framing, w n is a window function, x n is an original signal ', so that one frame has enough periods and cannot be changed too severely, interference among frames is avoided, and the signals are transmitted to a frequency domain transformation unit;
s33, a frequency domain transformation unit transforms the voice frames with the frame lengths of 10-30 ms of a plurality of small segments from time domain signals to frequency domain signals through Fourier transformation "F (ω) is an image function of F (t), F (t) is an image primitive function of F (ω), ω is frequency, t is time, e is a base number of natural logarithms, i is an imaginary unit (i.e., square of i is-1), d is discrete in both time domain and frequency domain, dt is discrete in time domain, and a spectral feature is obtained by taking a modulus of a DFT coefficient and is transmitted to a mel frequency unit;
s33, the Mel frequency unit converts the linear frequency 'f=2595×log10 (1+f/700)' into Mel frequency 'mel (f) =2595×log10 (1+f/700)' and then carries out logarithmic operation, and transmits the result to the cepstrum analysis unit;
s34, the cepstrum analysis unit decomposes the frequency domain signal 'log|X [ m ] |=log|Hm|+log|Em|' into the product 'X [ m ] =Hm|Em' of spectrum envelope and spectrum detail through the Mel cepstrum coefficient to obtain the characteristic of the voice, and transmits the characteristic to the Fourier inversion unit;
s35, the 1 st to K th points after the Fourier inversion unit passes through inverse Fourier transform (IDFT) are K-dimensional MFCC characteristicsExtracting and transmitting to a feature calculation unit;
s35, the feature calculation unit extracts MFCC features of a section of voice signal through first-order second-order differential operation to obtain dynamic features, but ignores dynamic continuity of the voice signal.
The invention provides a control system based on voice recognition, which also comprises computer equipment and a computer readable storage medium; the computer equipment comprises a memory and each functional module, wherein the memory stores a computer program, and the steps of the control method based on voice recognition are realized when each functional module executes the computer program; the computer readable storage medium stores a computer program which, when executed by each functional module, implements the steps of a control method based on speech recognition as described in any one of the above.
Further, the present invention is described in terms of a software program for a speech recognition based control system, which is divided into a plurality of modules or units for each speech recognition based control method, to implement software program instructions generated by the steps, which include a speech recognition based control method as described above.
The invention also provides an electronic device based on voice recognition, which is realized by the control method based on voice recognition.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Finally, it should be noted that the foregoing embodiments are merely for illustrating the technical solution of the present application and not for limiting the same, and although the present application has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present application, and are included in the protection scope of the present application.

Claims (5)

1. A control method based on voice recognition is characterized in that: the control system based on voice recognition comprises a voice acquisition module, a signal processing module, a feature extraction module, a voice recognition module, a text conversion module, a voice database, an alarm, a memory and a processing center; the voice acquisition module, the signal processing module, the feature extraction module, the voice recognition module, the text conversion module, the voice database, the alarm and the memory are respectively connected with the processing center; the voice database is used for collecting voices of various dialects in each Chinese area and voices of various minority languages without own characters;
the alarm is matched with the voice standard library of various languages stored in the memory according to the acquired voice characteristic information by the processing center, if the voice characteristic information is inconsistent with the voice standard library of various languages, the alarm automatically gives out an audible alarm, proves to be the voice of a new language and transmits the voice to the voice database to serve as the standard of the voice recognition of the subsequent similar language;
the memory is responsible for information storage of a voice acquisition module, a signal processing module, a feature extraction module, a voice recognition module, a text conversion module, a voice database and an alarm, and storage of dialect voices of various recorded areas or minority nationalities;
the processing center is responsible for information transmission of a voice acquisition module, a signal processing module, a feature extraction module, a voice recognition module, a text conversion module, a voice database, an alarm and a memory, is a system hub center, is matched with a voice standard library of various languages stored in the memory according to the acquired voice feature information, and is recognized as a specific voice type if the acquired voice feature information is consistent with the voice standard library of various languages stored in the memory, and is transmitted to the voice recognition module; if the voice is inconsistent, the voice is transmitted to an alarm, the voice is proved to be the voice of a new language, and the voice is transmitted to a voice database to be used as the standard of the voice recognition of the subsequent similar language;
the control voice acquisition module acquires accents, speaking speeds, polarities and other factors of Chinese dialects and dialects of all minority nations through the recording equipment to generate target voice signals, converts the target voice signals into analog electric signals, converts the analog electric signals into digital signals through the analog-to-digital converter, and transmits the digital signals to the signal processing module;
the signal processing module is responsible for filtering and converting continuous acoustic signals into discrete digital signals, quantizing the amplitude of the digital signals into discrete numerical values, converting the quantized digital signals into binary codes, removing noise, enhancing the identifiability of voice signals and transmitting the binary codes to the feature extraction module;
the feature extraction module comprises a preparation emphasis unit, a windowing framing unit, a frequency domain transformation unit, a Mel frequency unit, a cepstrum analysis unit, a Fourier inversion unit and a feature calculation unit, wherein the processed voice signal is decomposed into frames with small time slices, the spectral features of each frame, which are the characteristics of voice, are extracted and analyzed into a series of feature vectors, and the feature vectors are transmitted to the processing center;
the voice recognition module is responsible for matching the feature vector of the voice of the recognized specific language with the trained voice model, determining the feature vector as specific text content through training and optimization of the HMM model and the ASR system, and transmitting the text content to the text conversion module;
the text conversion module is responsible for converting specific text content recognized by voice into readable standard Chinese text information, and can enable a conversion result to be more in line with grammar rules and semantic logic of natural language through methods of error correction, sentence breaking, punctuation mark addition and the like, so that the accuracy and the readability of voice-to-text are further improved.
2. A control method based on voice recognition is characterized in that: the method comprises the following steps:
s10, when voice needs to be recognized, controlling a voice acquisition module to acquire accents, speaking speeds, polarities and other factors of Chinese dialects and dialects of all minority nations through recording equipment to generate target voice signals, converting the target voice signals into analog electric signals, converting the analog electric signals into digital signals through an analog-to-digital converter, and transmitting the digital signals to a signal processing module;
s20, the signal processing module filters and converts continuous acoustic signals into discrete digital signals, quantizes the amplitude of the digital signals into discrete numerical values, converts the quantized digital signals into binary codes, removes noise, enhances the identifiability of voice signals and transmits the binary codes to the feature extraction module;
s30, the feature extraction module decomposes the processed voice signal into frames with small time slices, extracts the spectral features of each frame, analyzes the spectral features into a series of feature vectors and transmits the feature vectors to the processing center;
s40, the processing center matches the acquired voice characteristic information with a voice standard library of various languages stored in a memory: if the voice types are consistent, the voice types are identified as specific voice types, and the specific voice types are transmitted to a voice recognition module; if the voice is inconsistent, the voice is transmitted to an alarm, the voice is proved to be the voice of a new language, and the voice is transmitted to a voice database to be used as the standard of the voice recognition of the subsequent similar language;
s50, the voice recognition module matches the feature vector of the voice of the recognized specific language with the trained voice model, determines specific text content through training and optimization of the HMM model and the ASR system, and transmits the text content to the text conversion module;
s60, the text conversion module converts specific text content recognized by the voice into readable standard Chinese text information, and the conversion result can be more in accordance with grammar rules and semantic logic of natural language by means of error correction, sentence breaking, punctuation mark addition and the like, so that the accuracy and the readability of the voice converted text are further improved.
3. A control method based on speech recognition according to claim 2, characterized in that: the step S30 includes the steps of:
s31, the preliminary emphasis unit changes a given time domain input signal x [ n ] into 'y [ n ] =x [ n ] -alpha x [ n-1], wherein alpha is more than or equal to 0.9 and less than or equal to 1.0, x [ n ] is an original signal, alpha x [ n-1] is an attenuation signal', the frequency of a high-frequency signal is enhanced, the attenuation of the high-frequency signal is prevented, and the high-frequency signal is transmitted to the windowing framing unit;
s32, a windowing framing unit divides an audio digital signal with an indefinite length into frames with frame lengths of 10-30 ms of a plurality of small sections with fixed lengths, the signals are changed into 'y [ n ] =w [ n ] x [ n ], y [ n ] is a signal after framing, w [ n ] is a window function, x [ n ] is an original signal', interference among frames is avoided, and the frames are transmitted to a frequency domain transformation unit;
s33, a frequency domain transformation unit transforms the voice frames with the frame lengths of 10-30 ms of a plurality of small segments from time domain signals to frequency domain signals through Fourier transformation "Taking F (t) as an image function of F (omega), taking omega as a frequency, taking t as time, taking e as a base of natural logarithm, taking i as an imaginary unit (i.e. square of i is-1), taking d as discretization in time domain and frequency domain, taking dt as discretization in time domain, taking the modulus of DFT coefficient to obtain spectral characteristics, and transmitting the spectral characteristics to a Mel frequency unit;
s34, mel frequency unit converts linear frequency "f=2595×log10 (1+f/700)" into mel frequency
"mel (f) =2595×log10 (1+f/700)", and then performing logarithmic operation, and transmitting to a cepstrum analysis unit;
s35, the cepstrum analysis unit decomposes the frequency domain signal 'log|X [ m ] |=log|Hm|+log|Em|' into the product 'X [ m ] =Hm|Em' of spectrum envelope and spectrum detail through the Mel cepstrum coefficient to obtain the characteristic of the voice, and transmits the characteristic to the Fourier inversion unit;
s36, the 1 st to K th points after the Fourier inversion unit passes through inverse Fourier transform (IDFT) are K-dimensional MFCC characteristicsExtracting and transmitting to a feature calculation unit;
s37, the feature calculation unit extracts MFCC features of a section of voice signal through first-order second-order difference operation to obtain dynamic features.
4. The control method based on voice recognition according to claim 1, wherein: the control system based on voice recognition also comprises computer equipment and a computer readable storage medium; the computer device comprises a memory and each functional module, wherein the memory stores a computer program, and each functional module realizes the steps of a control method based on voice recognition according to any one of the above claims 2 and 3 when executing the computer program; the computer-readable storage medium has stored thereon a computer program which, when executed by the respective functional modules, implements the steps of a speech recognition-based control method as claimed in any one of the preceding claims 2, 3.
5. An electronic device based on speech recognition, characterized in that: is achieved by a control method based on speech recognition as claimed in the preceding claims 1 to 4.
CN202410025380.8A 2024-01-08 2024-01-08 Electronic equipment based on voice recognition and control method thereof Pending CN117877486A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410025380.8A CN117877486A (en) 2024-01-08 2024-01-08 Electronic equipment based on voice recognition and control method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410025380.8A CN117877486A (en) 2024-01-08 2024-01-08 Electronic equipment based on voice recognition and control method thereof

Publications (1)

Publication Number Publication Date
CN117877486A true CN117877486A (en) 2024-04-12

Family

ID=90591167

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410025380.8A Pending CN117877486A (en) 2024-01-08 2024-01-08 Electronic equipment based on voice recognition and control method thereof

Country Status (1)

Country Link
CN (1) CN117877486A (en)

Similar Documents

Publication Publication Date Title
US11875775B2 (en) Voice conversion system and training method therefor
CN101944359B (en) Voice recognition method facing specific crowd
US7089184B2 (en) Speech recognition for recognizing speaker-independent, continuous speech
CN102543073B (en) Shanghai dialect phonetic recognition information processing method
CN112002348B (en) Method and system for recognizing speech anger emotion of patient
KR20230056741A (en) Synthetic Data Augmentation Using Voice Transformation and Speech Recognition Models
Ghule et al. Feature extraction techniques for speech recognition: A review
CN110853629A (en) Speech recognition digital method based on deep learning
JP2001166789A (en) Method and device for voice recognition of chinese using phoneme similarity vector at beginning or end
CN102237083A (en) Portable interpretation system based on WinCE platform and language recognition method thereof
CN113744722A (en) Off-line speech recognition matching device and method for limited sentence library
Kanabur et al. An extensive review of feature extraction techniques, challenges and trends in automatic speech recognition
CN111798846A (en) Voice command word recognition method and device, conference terminal and conference terminal system
CN107123419A (en) The optimization method of background noise reduction in the identification of Sphinx word speeds
Mishra et al. An Overview of Hindi Speech Recognition
Thalengala et al. Study of sub-word acoustical models for Kannada isolated word recognition system
CN117877486A (en) Electronic equipment based on voice recognition and control method thereof
Mahadevaswamy et al. Robust perceptual wavelet packet features for recognition of continuous Kannada speech
Cettolo et al. Automatic detection of semantic boundaries based on acoustic and lexical knowledge.
Mankala et al. Automatic speech processing using HTK for Telugu language
Akram et al. Design of an Urdu Speech Recognizer based upon acoustic phonetic modeling approach
CN113744715A (en) Vocoder speech synthesis method, device, computer equipment and storage medium
Thalengala et al. Effect of time-domain windowing on isolated speech recognition system performance
Khalifa et al. Statistical modeling for speech recognition
Gadekar et al. Analysis of speech recognition techniques

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination