CN109377982B

CN109377982B - Effective voice obtaining method

Info

Publication number: CN109377982B
Application number: CN201810956017.2A
Authority: CN
Inventors: 赵定金
Original assignee: Guangzhou Baolun Electronics Co Ltd
Current assignee: Guangzhou Baolun Electronics Co Ltd
Priority date: 2018-08-21
Filing date: 2018-08-21
Publication date: 2022-07-05
Anticipated expiration: 2038-08-21
Also published as: CN109377982A

Abstract

The invention discloses an effective voice acquisition method, which comprises the following steps: acquiring a starting point and an ending point of a voice to be recognized; sequentially sampling the voice to be recognized according to a preset sampling frequency and a preset sampling size, wherein the sampled audio data correspond to a plurality of sampling points of the voice to be recognized; all sampled audio data are subjected to FFT to obtain a plurality of sampled frequency spectrums; when the energy value acquired when the frequency in the sampling frequency spectrum is located in the frequency band of 300-1000 Hz is greater than a preset energy value n1 and the acquired energy variance is greater than a preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is located in the range of effective voice; otherwise, judging that the sampling point corresponding to the sampling frequency spectrum is positioned in the range of the noise; taking a first sampling point in the sampling point sequence of the effective voice as a starting point of the effective voice; and taking the first sampling point in the sampling point sequence of the noise as the end point of the effective voice. Which can realize accurate acquisition of effective speech from speech to be recognized.

Description

Effective voice obtaining method

Technical Field

The invention relates to the field of voice signal processing, in particular to a method for acquiring effective voice.

Background

In recent decades, some key progress is made in design of detailed model, parameter extraction and optimization, and adaptive technology of system. The speech recognition technology is more and more mature, the accuracy is gradually improved, and corresponding speech products are available in the market.

In the intelligent recording and broadcasting system, the man-machine interaction experience is continuously improved, so that a teacher does not need to manage the recording and broadcasting system, the voice command words are recognized to control the common functions of the recording and broadcasting system, and the teacher can forget the existence of the recording and broadcasting system, so that the teacher can concentrate on and teach more. When the teacher goes class, the teacher only needs to say the voice to start recording, and the recording and playing system starts to record the video. When the end of the class, the recording of the class can be finished by saying that the recording is stopped.

At present, a corresponding command word recognition module is available in the market, but most applications can be networked to realize the recognition of the command word, so that the application of a command word recognition function in an embedded recording and broadcasting system is prevented, and the small and efficient command word recognition is very promising in the embedded system.

The small and efficient command word recognition system firstly needs to detect and process a section of voice spoken by a teacher and extract effective voice from the section of voice, so that the effective voice is recognized.

Disclosure of Invention

In view of the above technical problems, an object of the present invention is to provide a method for obtaining effective speech, which can achieve accurate obtaining of effective speech from speech to be recognized.

The invention adopts the following technical scheme:

an efficient speech acquisition method comprising the steps of:

acquiring a starting point and an end point of a voice to be recognized;

obtaining effective voice of voice to be recognized; the effective voice of the voice to be recognized is complete voice which starts from the starting point and ends at the ending point;

the method for acquiring the starting point and the end point of the speech to be recognized comprises the following steps:

sequentially sampling the voice to be recognized according to a preset sampling frequency and a preset sampling size to obtain a plurality of sampled audio data, wherein the sampled audio data correspond to a plurality of sampling points of the voice to be recognized; all sampled audio data are subjected to FFT to obtain a plurality of sampled frequency spectrums;

acquiring energy values of all sampling frequency spectrum frequencies at 100-1000 Hz; sequentially comparing the energy values with a preset energy value n 1;

acquiring energy variances of all sampling frequency spectrum frequencies within a frequency range of 300-1000 Hz; sequentially comparing the energy variance with a preset energy value n 2;

when the energy value acquired when the frequency in the sampling frequency spectrum is located in the frequency band of 300-1000 Hz is greater than a preset energy value n1 and the acquired energy variance is greater than a preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is located in the range of effective voice;

when the energy value obtained when the frequency in the sampling frequency spectrum frequency is located in the frequency range of 300-1000 Hz is not greater than the preset energy value n1 or the obtained energy variance is not greater than the preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is located in the range of noise;

arranging all sampling points in the range of the complete voice according to a time sequence to obtain a sampling point sequence of the complete voice arranged according to the time sequence, and taking a first sampling point in the sampling point sequence of the effective voice as a starting point of the effective voice;

and arranging all sampling points which are positioned in the range of the noise and the sampling time of which is positioned behind the starting point of the effective voice according to a time sequence to obtain a noise sampling point sequence arranged according to the time sequence, wherein the first sampling point in the noise sampling point sequence is used as the end point of the effective voice.

Further, the preset sample size is 2048 pieces of audio data.

Further, the preset energy value n1 is 38000-60000J.

Further, the preset energy value n2 is 30-70J.

Compared with the prior art, the invention has the beneficial effects that:

the method and the device realize the detection processing of the voice to be recognized by acquiring the starting point and the ending point of the voice to be recognized, starting from the starting point and taking the complete voice finished by the ending point as the effective voice, and extract the effective voice from the voice to be recognized, thereby recognizing the effective voice. Furthermore, the accuracy of judging the starting point and the ending point of the voice to be recognized is improved by comparing the energy variance of the frequency band with the preset energy value N2.

Drawings

FIG. 1 is a flow chart of an efficient speech acquisition method according to the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and specific embodiments, and it should be noted that, in the premise of no conflict, the following described embodiments or technical features may be arbitrarily combined to form a new embodiment:

example (b):

referring to fig. 1, the effective speech obtaining method includes the following steps:

step S100, acquiring a starting point and an end point of a voice to be recognized;

s200, acquiring effective voice of the voice to be recognized; the effective voice of the voice to be recognized is complete voice which starts from the starting point and ends at the ending point;

step S1001, sampling the voice to be recognized in sequence according to the preset sampling frequency and the sampling size to obtain a plurality of sampling audio data, wherein the sampling audio data comprises sampling voiceThe frequency data corresponds to a plurality of sampling points of the voice to be recognized; and all the sampled audio data are subjected to FFT to obtain a plurality of sampled frequency spectrums. Specifically, the method comprises the following steps: the speech to be recognized is a finite-length discrete signal x (N), N is 0,1, …, N-1, and the sampling size N preferred in the present invention is 2048. Dividing the finite-length discrete signal x (n) into the sum of two sequences of even and odd numbers, resulting in: x (n) ═ x₁(n)+x₂(n); both x1(N) and x2(N) are N/2 in length, x1(N) is an even sequence, and x2(N) is an odd sequence. Calculation formula by FFT fourier transform:

obtaining N complex x (k) frequency domains, and modulo the obtained complex x (k) to obtain N amplitudes complx (N) (N ═ 0, 1.. N);

s1002, acquiring energy values of all sampling frequency spectrum frequencies at 100-1000 Hz; sequentially comparing the energy values with a preset energy value n 1; the energy value calculation method comprises the following steps: the frequency domain obtained according to the FFT fourier transform is left-right symmetric in (N/2), that is, only (N/2) frequencies need to be calculated, that is, (FS is a frame rate that needs to be calculated, i is 01. (N/2), N is the number of samples, and FS is the sampling frequency of the section of audio, to obtain (N/2) frequencies of the section of frequency spectrum, which correspond to the amplitude complx (N) in one-to-one correspondence, and to obtain the amplitude (energy) corresponding to each frequency;

step S1003, acquiring energy variances of all sampling frequency spectrum frequencies within the frequency range of 300-1000 Hz; sequentially comparing the energy variance with a preset energy value n 2; in particular, the energy variance formula

(wherein, S is a variance value, m is the number exceeding the preset energy value N1, complx (i) is the amplitude value corresponding to the exceeding the preset energy value N1, averageComplx is the average of all amplitude values exceeding the preset energy value N1);

step S1004, when the energy value acquired by the frequency in the frequency band of 300-1000 Hz in the sampling frequency spectrum is greater than a preset energy value n1 and the acquired energy variance is greater than a preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is located in the range of effective voice;

step S1005, arranging all sampling points in the range of the complete voice according to a time sequence to obtain a sampling point sequence of the complete voice arranged according to the time sequence, and taking a first sampling point in the sampling point sequence of the effective voice as a starting point of the effective voice;

step S1006, when the energy value obtained when the frequency in the sampling frequency spectrum is in the frequency band of 300-1000 Hz is not greater than a preset energy value n1 or the energy variance is not greater than a preset energy value n2, judging that the sampling point corresponding to the sampling frequency spectrum is in the range of noise;

and step S1007, arranging all sampling points which are positioned in the range of the noise and the sampling time of which is positioned behind the starting point of the effective voice according to the time sequence to obtain a noise sampling point sequence arranged according to the time sequence, and taking the first sampling point in the noise sampling point sequence as the end point of the effective voice. The time sequence arrangement refers to the time sequence of the appearance of the sampling points in the speech to be recognized. The sampling time sequence of the sampling points is also the time sequence of the appearance of the sampling points in the voice to be recognized.

The digitized sound data is audio data. There are two important metrics in digitizing sound, namely sampling frequency and sampling size. The sampling frequency is the sampling frequency in unit time, the larger the sampling frequency is, the smaller the interval between sampling points is, the more vivid the digitized sound is, but the corresponding data volume is increased, and the more difficult the processing is; the sampling size is the number of digits of the numerical value of the size of the sample value recorded each time, the dynamic change range of sampling is determined, the more the digits are, the more exquisite the change degree of the recorded sound is, and the larger the obtained data size is. Preferably, the preset sample size is 2048 audio data. If the sampling size is too small, the section of audio obtained in the way can be inaccurate, the frequency resolution is too low, zero filling needs to be carried out through FFT, CPU resources and time consumption can be consumed under the condition of zero filling, too much sampling can be consumed, therefore, 2048 audio data with the sampling size are adopted, the precision of the resolution is guaranteed, and the CPU resources can not be consumed too much.

A section of voice is converted from a time domain to a frequency domain, and the section of voice has a quantifiable parameter at the moment, and the frequency range of the voice judges whether the section of voice has the frequency of the voice and the corresponding frequency energy value. The invention further compares the energy variance of the frequency band with the preset energy value N2, improves the accuracy of judging the starting point and the ending point of the speech to be recognized, and the energy value difference of each frequency band of most of the noise of 100-1000HZ is not large, so that the noise variance values are small.

The smaller the value of N1 and N2 is, the more sensitive the value is, the easier the trigger program judges that the voice is human voice or not noise, but the probability of false triggering is higher. According to the various tests of the project, when the preset energy value n1 is set to 38000-.

Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.

Claims

1. An efficient speech acquisition method, comprising the steps of:

acquiring a starting point and an end point of a voice to be recognized;

obtaining effective voice of voice to be recognized; the effective voice of the voice to be recognized is a complete voice which starts from the starting point and ends from the ending point;

acquiring energy values of all sampling frequency spectrum frequencies at 100-1000 Hz; sequentially comparing the energy value with a preset energy value n 1; dividing the finite-length discrete signal x (n) into the sum of two sequences of even and odd numbers, resulting in: x (n) ═ x₁(n)+x₂(n); x1(N) and x2(N) are both N/2 in length, x1(N) is an even sequence, and x2(N) is an odd sequence; calculation formula by FFT fourier transform:

obtaining N complex numbers x (k) frequency domains, and modulo the obtained complex numbers x (k) to obtain N amplitudes complx (N), where N is 0, 1.. N;

the energy value calculation method comprises the following steps: the frequency domain obtained according to the FFT fourier transform is a characteristic of bilateral symmetry (N/2), that is, only N/2 frequencies need to be calculated, that is, FS is a frame rate that needs to be calculated, i is 0, 1.. N/2, N is a sampling number, FS is a sampling frequency of the section of audio, N/2 frequencies of the section of spectrum are obtained, and an amplitude energy corresponding to each frequency can be obtained in one-to-one correspondence with an amplitude complx (N);

acquiring energy variances of all sampling frequency spectrum frequencies within a frequency range of 300-1000 Hz; sequentially comparing the energy variance with a preset energy value n 2; in particular, the energy variance formula

Wherein S is a variance value, m is the number exceeding the preset energy value N1, complx (i) is the amplitude value corresponding to the exceeding the preset energy value N1, averageComplx is the average of all amplitude values exceeding the preset energy value N1;

2. The efficient speech acquisition method of claim 1, wherein the predetermined sample size is 2048 audio data.

3. The method as claimed in claim 1, wherein the predetermined energy value n1 is 38000-.

4. The efficient speech acquisition method of claim 3 wherein the predetermined energy value n2 is 30-70J.