CN109377982B - Effective voice obtaining method - Google Patents

Effective voice obtaining method Download PDF

Info

Publication number
CN109377982B
CN109377982B CN201810956017.2A CN201810956017A CN109377982B CN 109377982 B CN109377982 B CN 109377982B CN 201810956017 A CN201810956017 A CN 201810956017A CN 109377982 B CN109377982 B CN 109377982B
Authority
CN
China
Prior art keywords
sampling
voice
energy value
frequency
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810956017.2A
Other languages
Chinese (zh)
Other versions
CN109377982A (en
Inventor
赵定金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Baolun Electronics Co Ltd
Original Assignee
Guangzhou Baolun Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Baolun Electronics Co Ltd filed Critical Guangzhou Baolun Electronics Co Ltd
Priority to CN201810956017.2A priority Critical patent/CN109377982B/en
Publication of CN109377982A publication Critical patent/CN109377982A/en
Application granted granted Critical
Publication of CN109377982B publication Critical patent/CN109377982B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Abstract

The invention discloses an effective voice acquisition method, which comprises the following steps: acquiring a starting point and an ending point of a voice to be recognized; sequentially sampling the voice to be recognized according to a preset sampling frequency and a preset sampling size, wherein the sampled audio data correspond to a plurality of sampling points of the voice to be recognized; all sampled audio data are subjected to FFT to obtain a plurality of sampled frequency spectrums; when the energy value acquired when the frequency in the sampling frequency spectrum is located in the frequency band of 300-1000 Hz is greater than a preset energy value n1 and the acquired energy variance is greater than a preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is located in the range of effective voice; otherwise, judging that the sampling point corresponding to the sampling frequency spectrum is positioned in the range of the noise; taking a first sampling point in the sampling point sequence of the effective voice as a starting point of the effective voice; and taking the first sampling point in the sampling point sequence of the noise as the end point of the effective voice. Which can realize accurate acquisition of effective speech from speech to be recognized.

Description

Effective voice obtaining method
Technical Field
The invention relates to the field of voice signal processing, in particular to a method for acquiring effective voice.
Background
In recent decades, some key progress is made in design of detailed model, parameter extraction and optimization, and adaptive technology of system. The speech recognition technology is more and more mature, the accuracy is gradually improved, and corresponding speech products are available in the market.
In the intelligent recording and broadcasting system, the man-machine interaction experience is continuously improved, so that a teacher does not need to manage the recording and broadcasting system, the voice command words are recognized to control the common functions of the recording and broadcasting system, and the teacher can forget the existence of the recording and broadcasting system, so that the teacher can concentrate on and teach more. When the teacher goes class, the teacher only needs to say the voice to start recording, and the recording and playing system starts to record the video. When the end of the class, the recording of the class can be finished by saying that the recording is stopped.
At present, a corresponding command word recognition module is available in the market, but most applications can be networked to realize the recognition of the command word, so that the application of a command word recognition function in an embedded recording and broadcasting system is prevented, and the small and efficient command word recognition is very promising in the embedded system.
The small and efficient command word recognition system firstly needs to detect and process a section of voice spoken by a teacher and extract effective voice from the section of voice, so that the effective voice is recognized.
Disclosure of Invention
In view of the above technical problems, an object of the present invention is to provide a method for obtaining effective speech, which can achieve accurate obtaining of effective speech from speech to be recognized.
The invention adopts the following technical scheme:
an efficient speech acquisition method comprising the steps of:
acquiring a starting point and an end point of a voice to be recognized;
obtaining effective voice of voice to be recognized; the effective voice of the voice to be recognized is complete voice which starts from the starting point and ends at the ending point;
the method for acquiring the starting point and the end point of the speech to be recognized comprises the following steps:
sequentially sampling the voice to be recognized according to a preset sampling frequency and a preset sampling size to obtain a plurality of sampled audio data, wherein the sampled audio data correspond to a plurality of sampling points of the voice to be recognized; all sampled audio data are subjected to FFT to obtain a plurality of sampled frequency spectrums;
acquiring energy values of all sampling frequency spectrum frequencies at 100-1000 Hz; sequentially comparing the energy values with a preset energy value n 1;
acquiring energy variances of all sampling frequency spectrum frequencies within a frequency range of 300-1000 Hz; sequentially comparing the energy variance with a preset energy value n 2;
when the energy value acquired when the frequency in the sampling frequency spectrum is located in the frequency band of 300-1000 Hz is greater than a preset energy value n1 and the acquired energy variance is greater than a preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is located in the range of effective voice;
when the energy value obtained when the frequency in the sampling frequency spectrum frequency is located in the frequency range of 300-1000 Hz is not greater than the preset energy value n1 or the obtained energy variance is not greater than the preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is located in the range of noise;
arranging all sampling points in the range of the complete voice according to a time sequence to obtain a sampling point sequence of the complete voice arranged according to the time sequence, and taking a first sampling point in the sampling point sequence of the effective voice as a starting point of the effective voice;
and arranging all sampling points which are positioned in the range of the noise and the sampling time of which is positioned behind the starting point of the effective voice according to a time sequence to obtain a noise sampling point sequence arranged according to the time sequence, wherein the first sampling point in the noise sampling point sequence is used as the end point of the effective voice.
Further, the preset sample size is 2048 pieces of audio data.
Further, the preset energy value n1 is 38000-60000J.
Further, the preset energy value n2 is 30-70J.
Compared with the prior art, the invention has the beneficial effects that:
the method and the device realize the detection processing of the voice to be recognized by acquiring the starting point and the ending point of the voice to be recognized, starting from the starting point and taking the complete voice finished by the ending point as the effective voice, and extract the effective voice from the voice to be recognized, thereby recognizing the effective voice. Furthermore, the accuracy of judging the starting point and the ending point of the voice to be recognized is improved by comparing the energy variance of the frequency band with the preset energy value N2.
Drawings
FIG. 1 is a flow chart of an efficient speech acquisition method according to the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and specific embodiments, and it should be noted that, in the premise of no conflict, the following described embodiments or technical features may be arbitrarily combined to form a new embodiment:
example (b):
referring to fig. 1, the effective speech obtaining method includes the following steps:
step S100, acquiring a starting point and an end point of a voice to be recognized;
s200, acquiring effective voice of the voice to be recognized; the effective voice of the voice to be recognized is complete voice which starts from the starting point and ends at the ending point;
the method for acquiring the starting point and the end point of the speech to be recognized comprises the following steps:
step S1001, sampling the voice to be recognized in sequence according to the preset sampling frequency and the sampling size to obtain a plurality of sampling audio data, wherein the sampling audio data comprises sampling voiceThe frequency data corresponds to a plurality of sampling points of the voice to be recognized; and all the sampled audio data are subjected to FFT to obtain a plurality of sampled frequency spectrums. Specifically, the method comprises the following steps: the speech to be recognized is a finite-length discrete signal x (N), N is 0,1, …, N-1, and the sampling size N preferred in the present invention is 2048. Dividing the finite-length discrete signal x (n) into the sum of two sequences of even and odd numbers, resulting in: x (n) ═ x1(n)+x2(n); both x1(N) and x2(N) are N/2 in length, x1(N) is an even sequence, and x2(N) is an odd sequence. Calculation formula by FFT fourier transform:
Figure BDA0001772692070000041
obtaining N complex x (k) frequency domains, and modulo the obtained complex x (k) to obtain N amplitudes complx (N) (N ═ 0, 1.. N);
s1002, acquiring energy values of all sampling frequency spectrum frequencies at 100-1000 Hz; sequentially comparing the energy values with a preset energy value n 1; the energy value calculation method comprises the following steps: the frequency domain obtained according to the FFT fourier transform is left-right symmetric in (N/2), that is, only (N/2) frequencies need to be calculated, that is, (FS is a frame rate that needs to be calculated, i is 01. (N/2), N is the number of samples, and FS is the sampling frequency of the section of audio, to obtain (N/2) frequencies of the section of frequency spectrum, which correspond to the amplitude complx (N) in one-to-one correspondence, and to obtain the amplitude (energy) corresponding to each frequency;
step S1003, acquiring energy variances of all sampling frequency spectrum frequencies within the frequency range of 300-1000 Hz; sequentially comparing the energy variance with a preset energy value n 2; in particular, the energy variance formula
Figure BDA0001772692070000051
(wherein, S is a variance value, m is the number exceeding the preset energy value N1, complx (i) is the amplitude value corresponding to the exceeding the preset energy value N1, averageComplx is the average of all amplitude values exceeding the preset energy value N1);
step S1004, when the energy value acquired by the frequency in the frequency band of 300-1000 Hz in the sampling frequency spectrum is greater than a preset energy value n1 and the acquired energy variance is greater than a preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is located in the range of effective voice;
step S1005, arranging all sampling points in the range of the complete voice according to a time sequence to obtain a sampling point sequence of the complete voice arranged according to the time sequence, and taking a first sampling point in the sampling point sequence of the effective voice as a starting point of the effective voice;
step S1006, when the energy value obtained when the frequency in the sampling frequency spectrum is in the frequency band of 300-1000 Hz is not greater than a preset energy value n1 or the energy variance is not greater than a preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is in the range of noise;
and step S1007, arranging all sampling points which are positioned in the range of the noise and the sampling time of which is positioned behind the starting point of the effective voice according to the time sequence to obtain a noise sampling point sequence arranged according to the time sequence, and taking the first sampling point in the noise sampling point sequence as the end point of the effective voice. The time sequence arrangement refers to the time sequence of the appearance of the sampling points in the speech to be recognized. The sampling time sequence of the sampling points is also the time sequence of the appearance of the sampling points in the voice to be recognized.
The digitized sound data is audio data. There are two important metrics in digitizing sound, namely sampling frequency and sampling size. The sampling frequency is the sampling frequency in unit time, the larger the sampling frequency is, the smaller the interval between sampling points is, the more vivid the digitized sound is, but the corresponding data volume is increased, and the more difficult the processing is; the sampling size is the number of digits of the numerical value of the size of the sample value recorded each time, the dynamic change range of sampling is determined, the more the digits are, the more exquisite the change degree of the recorded sound is, and the larger the obtained data size is. Preferably, the preset sample size is 2048 audio data. If the sampling size is too small, the section of audio obtained in the way can be inaccurate, the frequency resolution is too low, zero filling needs to be carried out through FFT, CPU resources and time consumption can be consumed under the condition of zero filling, too much sampling can be consumed, therefore, 2048 audio data with the sampling size are adopted, the precision of the resolution is guaranteed, and the CPU resources can not be consumed too much.
A section of voice is converted from a time domain to a frequency domain, and the section of voice has a quantifiable parameter at the moment, and the frequency range of the voice judges whether the section of voice has the frequency of the voice and the corresponding frequency energy value. The invention further compares the energy variance of the frequency band with the preset energy value N2, improves the accuracy of judging the starting point and the ending point of the speech to be recognized, and the energy value difference of each frequency band of most of the noise of 100-1000HZ is not large, so that the noise variance values are small.
The smaller the value of N1 and N2 is, the more sensitive the value is, the easier the trigger program judges that the voice is human voice or not noise, but the probability of false triggering is higher. According to the various tests of the project, when the preset energy value n1 is set to 38000-.
Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.

Claims (4)

1. An efficient speech acquisition method, comprising the steps of:
acquiring a starting point and an end point of a voice to be recognized;
obtaining effective voice of voice to be recognized; the effective voice of the voice to be recognized is a complete voice which starts from the starting point and ends from the ending point;
the method for acquiring the starting point and the end point of the speech to be recognized comprises the following steps:
sequentially sampling the voice to be recognized according to a preset sampling frequency and a preset sampling size to obtain a plurality of sampled audio data, wherein the sampled audio data correspond to a plurality of sampling points of the voice to be recognized; all sampled audio data are subjected to FFT to obtain a plurality of sampled frequency spectrums;
acquiring energy values of all sampling frequency spectrum frequencies at 100-1000 Hz; sequentially comparing the energy value with a preset energy value n 1; dividing the finite-length discrete signal x (n) into the sum of two sequences of even and odd numbers, resulting in: x (n) ═ x1(n)+x2(n); x1(N) and x2(N) are both N/2 in length, x1(N) is an even sequence, and x2(N) is an odd sequence; calculation formula by FFT fourier transform:
Figure FDA0003663058740000011
obtaining N complex numbers x (k) frequency domains, and modulo the obtained complex numbers x (k) to obtain N amplitudes complx (N), where N is 0, 1.. N;
the energy value calculation method comprises the following steps: the frequency domain obtained according to the FFT fourier transform is a characteristic of bilateral symmetry (N/2), that is, only N/2 frequencies need to be calculated, that is, FS is a frame rate that needs to be calculated, i is 0, 1.. N/2, N is a sampling number, FS is a sampling frequency of the section of audio, N/2 frequencies of the section of spectrum are obtained, and an amplitude energy corresponding to each frequency can be obtained in one-to-one correspondence with an amplitude complx (N);
acquiring energy variances of all sampling frequency spectrum frequencies within a frequency range of 300-1000 Hz; sequentially comparing the energy variance with a preset energy value n 2; in particular, the energy variance formula
Figure FDA0003663058740000021
Wherein S is a variance value, m is the number exceeding the preset energy value N1, complx (i) is the amplitude value corresponding to the exceeding the preset energy value N1, averageComplx is the average of all amplitude values exceeding the preset energy value N1;
when the energy value acquired when the frequency in the sampling frequency spectrum is located in the frequency band of 300-1000 Hz is greater than a preset energy value n1 and the acquired energy variance is greater than a preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is located in the range of effective voice;
when the energy value obtained when the frequency in the sampling frequency spectrum frequency is located in the frequency range of 300-1000 Hz is not greater than the preset energy value n1 or the obtained energy variance is not greater than the preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is located in the range of noise;
arranging all sampling points in the range of the complete voice according to a time sequence to obtain a sampling point sequence of the complete voice arranged according to the time sequence, and taking a first sampling point in the sampling point sequence of the effective voice as a starting point of the effective voice;
and arranging all sampling points which are positioned in the range of the noise and the sampling time of which is positioned behind the starting point of the effective voice according to a time sequence to obtain a noise sampling point sequence arranged according to the time sequence, wherein the first sampling point in the noise sampling point sequence is used as the end point of the effective voice.
2. The efficient speech acquisition method of claim 1, wherein the predetermined sample size is 2048 audio data.
3. The method as claimed in claim 1, wherein the predetermined energy value n1 is 38000-.
4. The efficient speech acquisition method of claim 3 wherein the predetermined energy value n2 is 30-70J.
CN201810956017.2A 2018-08-21 2018-08-21 Effective voice obtaining method Active CN109377982B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810956017.2A CN109377982B (en) 2018-08-21 2018-08-21 Effective voice obtaining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810956017.2A CN109377982B (en) 2018-08-21 2018-08-21 Effective voice obtaining method

Publications (2)

Publication Number Publication Date
CN109377982A CN109377982A (en) 2019-02-22
CN109377982B true CN109377982B (en) 2022-07-05

Family

ID=65404358

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810956017.2A Active CN109377982B (en) 2018-08-21 2018-08-21 Effective voice obtaining method

Country Status (1)

Country Link
CN (1) CN109377982B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210893A (en) * 2019-05-09 2019-09-06 秒针信息技术有限公司 Generation method, device, storage medium and the electronic device of report
CN110365555B (en) * 2019-08-08 2021-12-10 广州虎牙科技有限公司 Audio delay testing method and device, electronic equipment and readable storage medium
CN110428853A (en) * 2019-08-30 2019-11-08 北京太极华保科技股份有限公司 Voice activity detection method, Voice activity detection device and electronic equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5579431A (en) * 1992-10-05 1996-11-26 Panasonic Technologies, Inc. Speech detection in presence of noise by determining variance over time of frequency band limited energy
CN101625857B (en) * 2008-07-10 2012-05-09 新奥特(北京)视频技术有限公司 Self-adaptive voice endpoint detection method
CN104021789A (en) * 2014-06-25 2014-09-03 厦门大学 Self-adaption endpoint detection method using short-time time-frequency value
US9672841B2 (en) * 2015-06-30 2017-06-06 Zte Corporation Voice activity detection method and method used for voice activity detection and apparatus thereof
CN105467428A (en) * 2015-11-17 2016-04-06 南京航空航天大学 Seismic wave warning method based on short-time energy detection and spectrum feature analysis
CN106601230B (en) * 2016-12-19 2020-06-02 苏州金峰物联网技术有限公司 Logistics sorting place name voice recognition method and system based on continuous Gaussian mixture HMM model and logistics sorting system

Also Published As

Publication number Publication date
CN109377982A (en) 2019-02-22

Similar Documents

Publication Publication Date Title
CN110364143B (en) Voice awakening method and device and intelligent electronic equipment
CN109377982B (en) Effective voice obtaining method
CN101023469B (en) Digital filtering method, digital filtering equipment
US20170154640A1 (en) Method and electronic device for voice recognition based on dynamic voice model selection
CN109065043B (en) Command word recognition method and computer storage medium
CN110021307A (en) Audio method of calibration, device, storage medium and electronic equipment
CN111444382B (en) Audio processing method and device, computer equipment and storage medium
CN111696580B (en) Voice detection method and device, electronic equipment and storage medium
US11282514B2 (en) Method and apparatus for recognizing voice
US20060100866A1 (en) Influencing automatic speech recognition signal-to-noise levels
CN104732972A (en) HMM voiceprint recognition signing-in method and system based on grouping statistics
CN106548786A (en) A kind of detection method and system of voice data
CN112071308A (en) Awakening word training method based on speech synthesis data enhancement
CN105825857A (en) Voiceprint-recognition-based method for assisting deaf patient in determining sound type
US10522160B2 (en) Methods and apparatus to identify a source of speech captured at a wearable electronic device
AU2024200622A1 (en) Methods and apparatus to fingerprint an audio signal via exponential normalization
CN116913258B (en) Speech signal recognition method, device, electronic equipment and computer readable medium
CN116746887B (en) Audio-based sleep stage method, system, terminal and storage medium
CN106504756B (en) Built-in speech recognition system and method
WO2018001125A1 (en) Method and device for audio recognition
KR101491911B1 (en) Sound acquisition system to remove noise in the noise environment
CN108962249B (en) Voice matching method based on MFCC voice characteristics and storage medium
CN104240705A (en) Intelligent voice-recognition locking system for safe box
CN105139866A (en) Nanyin music recognition method and device
CN103778914A (en) Anti-noise voice identification method and device based on signal-to-noise ratio weighing template characteristic matching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant