CN109377982B - Effective voice obtaining method - Google Patents
Effective voice obtaining method Download PDFInfo
- Publication number
- CN109377982B CN109377982B CN201810956017.2A CN201810956017A CN109377982B CN 109377982 B CN109377982 B CN 109377982B CN 201810956017 A CN201810956017 A CN 201810956017A CN 109377982 B CN109377982 B CN 109377982B
- Authority
- CN
- China
- Prior art keywords
- sampling
- voice
- energy value
- frequency
- point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 18
- 238000005070 sampling Methods 0.000 claims abstract description 98
- 238000001228 spectrum Methods 0.000 claims abstract description 27
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000002146 bilateral effect Effects 0.000 claims 1
- 238000012545 processing Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
- G10L15/05—Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
Abstract
The invention discloses an effective voice acquisition method, which comprises the following steps: acquiring a starting point and an ending point of a voice to be recognized; sequentially sampling the voice to be recognized according to a preset sampling frequency and a preset sampling size, wherein the sampled audio data correspond to a plurality of sampling points of the voice to be recognized; all sampled audio data are subjected to FFT to obtain a plurality of sampled frequency spectrums; when the energy value acquired when the frequency in the sampling frequency spectrum is located in the frequency band of 300-1000 Hz is greater than a preset energy value n1 and the acquired energy variance is greater than a preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is located in the range of effective voice; otherwise, judging that the sampling point corresponding to the sampling frequency spectrum is positioned in the range of the noise; taking a first sampling point in the sampling point sequence of the effective voice as a starting point of the effective voice; and taking the first sampling point in the sampling point sequence of the noise as the end point of the effective voice. Which can realize accurate acquisition of effective speech from speech to be recognized.
Description
Technical Field
The invention relates to the field of voice signal processing, in particular to a method for acquiring effective voice.
Background
In recent decades, some key progress is made in design of detailed model, parameter extraction and optimization, and adaptive technology of system. The speech recognition technology is more and more mature, the accuracy is gradually improved, and corresponding speech products are available in the market.
In the intelligent recording and broadcasting system, the man-machine interaction experience is continuously improved, so that a teacher does not need to manage the recording and broadcasting system, the voice command words are recognized to control the common functions of the recording and broadcasting system, and the teacher can forget the existence of the recording and broadcasting system, so that the teacher can concentrate on and teach more. When the teacher goes class, the teacher only needs to say the voice to start recording, and the recording and playing system starts to record the video. When the end of the class, the recording of the class can be finished by saying that the recording is stopped.
At present, a corresponding command word recognition module is available in the market, but most applications can be networked to realize the recognition of the command word, so that the application of a command word recognition function in an embedded recording and broadcasting system is prevented, and the small and efficient command word recognition is very promising in the embedded system.
The small and efficient command word recognition system firstly needs to detect and process a section of voice spoken by a teacher and extract effective voice from the section of voice, so that the effective voice is recognized.
Disclosure of Invention
In view of the above technical problems, an object of the present invention is to provide a method for obtaining effective speech, which can achieve accurate obtaining of effective speech from speech to be recognized.
The invention adopts the following technical scheme:
an efficient speech acquisition method comprising the steps of:
acquiring a starting point and an end point of a voice to be recognized;
obtaining effective voice of voice to be recognized; the effective voice of the voice to be recognized is complete voice which starts from the starting point and ends at the ending point;
the method for acquiring the starting point and the end point of the speech to be recognized comprises the following steps:
sequentially sampling the voice to be recognized according to a preset sampling frequency and a preset sampling size to obtain a plurality of sampled audio data, wherein the sampled audio data correspond to a plurality of sampling points of the voice to be recognized; all sampled audio data are subjected to FFT to obtain a plurality of sampled frequency spectrums;
acquiring energy values of all sampling frequency spectrum frequencies at 100-1000 Hz; sequentially comparing the energy values with a preset energy value n 1;
acquiring energy variances of all sampling frequency spectrum frequencies within a frequency range of 300-1000 Hz; sequentially comparing the energy variance with a preset energy value n 2;
when the energy value acquired when the frequency in the sampling frequency spectrum is located in the frequency band of 300-1000 Hz is greater than a preset energy value n1 and the acquired energy variance is greater than a preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is located in the range of effective voice;
when the energy value obtained when the frequency in the sampling frequency spectrum frequency is located in the frequency range of 300-1000 Hz is not greater than the preset energy value n1 or the obtained energy variance is not greater than the preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is located in the range of noise;
arranging all sampling points in the range of the complete voice according to a time sequence to obtain a sampling point sequence of the complete voice arranged according to the time sequence, and taking a first sampling point in the sampling point sequence of the effective voice as a starting point of the effective voice;
and arranging all sampling points which are positioned in the range of the noise and the sampling time of which is positioned behind the starting point of the effective voice according to a time sequence to obtain a noise sampling point sequence arranged according to the time sequence, wherein the first sampling point in the noise sampling point sequence is used as the end point of the effective voice.
Further, the preset sample size is 2048 pieces of audio data.
Further, the preset energy value n1 is 38000-60000J.
Further, the preset energy value n2 is 30-70J.
Compared with the prior art, the invention has the beneficial effects that:
the method and the device realize the detection processing of the voice to be recognized by acquiring the starting point and the ending point of the voice to be recognized, starting from the starting point and taking the complete voice finished by the ending point as the effective voice, and extract the effective voice from the voice to be recognized, thereby recognizing the effective voice. Furthermore, the accuracy of judging the starting point and the ending point of the voice to be recognized is improved by comparing the energy variance of the frequency band with the preset energy value N2.
Drawings
FIG. 1 is a flow chart of an efficient speech acquisition method according to the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and specific embodiments, and it should be noted that, in the premise of no conflict, the following described embodiments or technical features may be arbitrarily combined to form a new embodiment:
example (b):
referring to fig. 1, the effective speech obtaining method includes the following steps:
step S100, acquiring a starting point and an end point of a voice to be recognized;
s200, acquiring effective voice of the voice to be recognized; the effective voice of the voice to be recognized is complete voice which starts from the starting point and ends at the ending point;
the method for acquiring the starting point and the end point of the speech to be recognized comprises the following steps:
step S1001, sampling the voice to be recognized in sequence according to the preset sampling frequency and the sampling size to obtain a plurality of sampling audio data, wherein the sampling audio data comprises sampling voiceThe frequency data corresponds to a plurality of sampling points of the voice to be recognized; and all the sampled audio data are subjected to FFT to obtain a plurality of sampled frequency spectrums. Specifically, the method comprises the following steps: the speech to be recognized is a finite-length discrete signal x (N), N is 0,1, …, N-1, and the sampling size N preferred in the present invention is 2048. Dividing the finite-length discrete signal x (n) into the sum of two sequences of even and odd numbers, resulting in: x (n) ═ x1(n)+x2(n); both x1(N) and x2(N) are N/2 in length, x1(N) is an even sequence, and x2(N) is an odd sequence. Calculation formula by FFT fourier transform:
obtaining N complex x (k) frequency domains, and modulo the obtained complex x (k) to obtain N amplitudes complx (N) (N ═ 0, 1.. N);
s1002, acquiring energy values of all sampling frequency spectrum frequencies at 100-1000 Hz; sequentially comparing the energy values with a preset energy value n 1; the energy value calculation method comprises the following steps: the frequency domain obtained according to the FFT fourier transform is left-right symmetric in (N/2), that is, only (N/2) frequencies need to be calculated, that is, (FS is a frame rate that needs to be calculated, i is 01. (N/2), N is the number of samples, and FS is the sampling frequency of the section of audio, to obtain (N/2) frequencies of the section of frequency spectrum, which correspond to the amplitude complx (N) in one-to-one correspondence, and to obtain the amplitude (energy) corresponding to each frequency;
step S1003, acquiring energy variances of all sampling frequency spectrum frequencies within the frequency range of 300-1000 Hz; sequentially comparing the energy variance with a preset energy value n 2; in particular, the energy variance formula(wherein, S is a variance value, m is the number exceeding the preset energy value N1, complx (i) is the amplitude value corresponding to the exceeding the preset energy value N1, averageComplx is the average of all amplitude values exceeding the preset energy value N1);
step S1004, when the energy value acquired by the frequency in the frequency band of 300-1000 Hz in the sampling frequency spectrum is greater than a preset energy value n1 and the acquired energy variance is greater than a preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is located in the range of effective voice;
step S1005, arranging all sampling points in the range of the complete voice according to a time sequence to obtain a sampling point sequence of the complete voice arranged according to the time sequence, and taking a first sampling point in the sampling point sequence of the effective voice as a starting point of the effective voice;
step S1006, when the energy value obtained when the frequency in the sampling frequency spectrum is in the frequency band of 300-1000 Hz is not greater than a preset energy value n1 or the energy variance is not greater than a preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is in the range of noise;
and step S1007, arranging all sampling points which are positioned in the range of the noise and the sampling time of which is positioned behind the starting point of the effective voice according to the time sequence to obtain a noise sampling point sequence arranged according to the time sequence, and taking the first sampling point in the noise sampling point sequence as the end point of the effective voice. The time sequence arrangement refers to the time sequence of the appearance of the sampling points in the speech to be recognized. The sampling time sequence of the sampling points is also the time sequence of the appearance of the sampling points in the voice to be recognized.
The digitized sound data is audio data. There are two important metrics in digitizing sound, namely sampling frequency and sampling size. The sampling frequency is the sampling frequency in unit time, the larger the sampling frequency is, the smaller the interval between sampling points is, the more vivid the digitized sound is, but the corresponding data volume is increased, and the more difficult the processing is; the sampling size is the number of digits of the numerical value of the size of the sample value recorded each time, the dynamic change range of sampling is determined, the more the digits are, the more exquisite the change degree of the recorded sound is, and the larger the obtained data size is. Preferably, the preset sample size is 2048 audio data. If the sampling size is too small, the section of audio obtained in the way can be inaccurate, the frequency resolution is too low, zero filling needs to be carried out through FFT, CPU resources and time consumption can be consumed under the condition of zero filling, too much sampling can be consumed, therefore, 2048 audio data with the sampling size are adopted, the precision of the resolution is guaranteed, and the CPU resources can not be consumed too much.
A section of voice is converted from a time domain to a frequency domain, and the section of voice has a quantifiable parameter at the moment, and the frequency range of the voice judges whether the section of voice has the frequency of the voice and the corresponding frequency energy value. The invention further compares the energy variance of the frequency band with the preset energy value N2, improves the accuracy of judging the starting point and the ending point of the speech to be recognized, and the energy value difference of each frequency band of most of the noise of 100-1000HZ is not large, so that the noise variance values are small.
The smaller the value of N1 and N2 is, the more sensitive the value is, the easier the trigger program judges that the voice is human voice or not noise, but the probability of false triggering is higher. According to the various tests of the project, when the preset energy value n1 is set to 38000-.
Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.
Claims (4)
1. An efficient speech acquisition method, comprising the steps of:
acquiring a starting point and an end point of a voice to be recognized;
obtaining effective voice of voice to be recognized; the effective voice of the voice to be recognized is a complete voice which starts from the starting point and ends from the ending point;
the method for acquiring the starting point and the end point of the speech to be recognized comprises the following steps:
sequentially sampling the voice to be recognized according to a preset sampling frequency and a preset sampling size to obtain a plurality of sampled audio data, wherein the sampled audio data correspond to a plurality of sampling points of the voice to be recognized; all sampled audio data are subjected to FFT to obtain a plurality of sampled frequency spectrums;
acquiring energy values of all sampling frequency spectrum frequencies at 100-1000 Hz; sequentially comparing the energy value with a preset energy value n 1; dividing the finite-length discrete signal x (n) into the sum of two sequences of even and odd numbers, resulting in: x (n) ═ x1(n)+x2(n); x1(N) and x2(N) are both N/2 in length, x1(N) is an even sequence, and x2(N) is an odd sequence; calculation formula by FFT fourier transform:
obtaining N complex numbers x (k) frequency domains, and modulo the obtained complex numbers x (k) to obtain N amplitudes complx (N), where N is 0, 1.. N;
the energy value calculation method comprises the following steps: the frequency domain obtained according to the FFT fourier transform is a characteristic of bilateral symmetry (N/2), that is, only N/2 frequencies need to be calculated, that is, FS is a frame rate that needs to be calculated, i is 0, 1.. N/2, N is a sampling number, FS is a sampling frequency of the section of audio, N/2 frequencies of the section of spectrum are obtained, and an amplitude energy corresponding to each frequency can be obtained in one-to-one correspondence with an amplitude complx (N);
acquiring energy variances of all sampling frequency spectrum frequencies within a frequency range of 300-1000 Hz; sequentially comparing the energy variance with a preset energy value n 2; in particular, the energy variance formulaWherein S is a variance value, m is the number exceeding the preset energy value N1, complx (i) is the amplitude value corresponding to the exceeding the preset energy value N1, averageComplx is the average of all amplitude values exceeding the preset energy value N1;
when the energy value acquired when the frequency in the sampling frequency spectrum is located in the frequency band of 300-1000 Hz is greater than a preset energy value n1 and the acquired energy variance is greater than a preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is located in the range of effective voice;
when the energy value obtained when the frequency in the sampling frequency spectrum frequency is located in the frequency range of 300-1000 Hz is not greater than the preset energy value n1 or the obtained energy variance is not greater than the preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is located in the range of noise;
arranging all sampling points in the range of the complete voice according to a time sequence to obtain a sampling point sequence of the complete voice arranged according to the time sequence, and taking a first sampling point in the sampling point sequence of the effective voice as a starting point of the effective voice;
and arranging all sampling points which are positioned in the range of the noise and the sampling time of which is positioned behind the starting point of the effective voice according to a time sequence to obtain a noise sampling point sequence arranged according to the time sequence, wherein the first sampling point in the noise sampling point sequence is used as the end point of the effective voice.
2. The efficient speech acquisition method of claim 1, wherein the predetermined sample size is 2048 audio data.
3. The method as claimed in claim 1, wherein the predetermined energy value n1 is 38000-.
4. The efficient speech acquisition method of claim 3 wherein the predetermined energy value n2 is 30-70J.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810956017.2A CN109377982B (en) | 2018-08-21 | 2018-08-21 | Effective voice obtaining method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810956017.2A CN109377982B (en) | 2018-08-21 | 2018-08-21 | Effective voice obtaining method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109377982A CN109377982A (en) | 2019-02-22 |
CN109377982B true CN109377982B (en) | 2022-07-05 |
Family
ID=65404358
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810956017.2A Active CN109377982B (en) | 2018-08-21 | 2018-08-21 | Effective voice obtaining method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109377982B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110210893A (en) * | 2019-05-09 | 2019-09-06 | 秒针信息技术有限公司 | Generation method, device, storage medium and the electronic device of report |
CN110365555B (en) * | 2019-08-08 | 2021-12-10 | 广州虎牙科技有限公司 | Audio delay testing method and device, electronic equipment and readable storage medium |
CN110428853A (en) * | 2019-08-30 | 2019-11-08 | 北京太极华保科技股份有限公司 | Voice activity detection method, Voice activity detection device and electronic equipment |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5579431A (en) * | 1992-10-05 | 1996-11-26 | Panasonic Technologies, Inc. | Speech detection in presence of noise by determining variance over time of frequency band limited energy |
CN101625857B (en) * | 2008-07-10 | 2012-05-09 | 新奥特(北京)视频技术有限公司 | Self-adaptive voice endpoint detection method |
CN104021789A (en) * | 2014-06-25 | 2014-09-03 | 厦门大学 | Self-adaption endpoint detection method using short-time time-frequency value |
US9672841B2 (en) * | 2015-06-30 | 2017-06-06 | Zte Corporation | Voice activity detection method and method used for voice activity detection and apparatus thereof |
CN105467428A (en) * | 2015-11-17 | 2016-04-06 | 南京航空航天大学 | Seismic wave warning method based on short-time energy detection and spectrum feature analysis |
CN106601230B (en) * | 2016-12-19 | 2020-06-02 | 苏州金峰物联网技术有限公司 | Logistics sorting place name voice recognition method and system based on continuous Gaussian mixture HMM model and logistics sorting system |
-
2018
- 2018-08-21 CN CN201810956017.2A patent/CN109377982B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN109377982A (en) | 2019-02-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110364143B (en) | Voice awakening method and device and intelligent electronic equipment | |
CN109377982B (en) | Effective voice obtaining method | |
CN101023469B (en) | Digital filtering method, digital filtering equipment | |
US20170154640A1 (en) | Method and electronic device for voice recognition based on dynamic voice model selection | |
CN109065043B (en) | Command word recognition method and computer storage medium | |
CN110021307A (en) | Audio method of calibration, device, storage medium and electronic equipment | |
CN111444382B (en) | Audio processing method and device, computer equipment and storage medium | |
CN111696580B (en) | Voice detection method and device, electronic equipment and storage medium | |
US11282514B2 (en) | Method and apparatus for recognizing voice | |
US20060100866A1 (en) | Influencing automatic speech recognition signal-to-noise levels | |
CN104732972A (en) | HMM voiceprint recognition signing-in method and system based on grouping statistics | |
CN106548786A (en) | A kind of detection method and system of voice data | |
CN112071308A (en) | Awakening word training method based on speech synthesis data enhancement | |
CN105825857A (en) | Voiceprint-recognition-based method for assisting deaf patient in determining sound type | |
US10522160B2 (en) | Methods and apparatus to identify a source of speech captured at a wearable electronic device | |
AU2024200622A1 (en) | Methods and apparatus to fingerprint an audio signal via exponential normalization | |
CN116913258B (en) | Speech signal recognition method, device, electronic equipment and computer readable medium | |
CN116746887B (en) | Audio-based sleep stage method, system, terminal and storage medium | |
CN106504756B (en) | Built-in speech recognition system and method | |
WO2018001125A1 (en) | Method and device for audio recognition | |
KR101491911B1 (en) | Sound acquisition system to remove noise in the noise environment | |
CN108962249B (en) | Voice matching method based on MFCC voice characteristics and storage medium | |
CN104240705A (en) | Intelligent voice-recognition locking system for safe box | |
CN105139866A (en) | Nanyin music recognition method and device | |
CN103778914A (en) | Anti-noise voice identification method and device based on signal-to-noise ratio weighing template characteristic matching |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |