CN109377982A

CN109377982A - A kind of efficient voice acquisition methods

Info

Publication number: CN109377982A
Application number: CN201810956017.2A
Authority: CN
Inventors: 赵定金
Original assignee: Guangzhou Baolun Electronics Co Ltd
Current assignee: Guangzhou Baolun Electronics Co Ltd
Priority date: 2018-08-21
Filing date: 2018-08-21
Publication date: 2019-02-22
Anticipated expiration: 2038-08-21
Also published as: CN109377982B

Abstract

The invention discloses a kind of efficient voice acquisition methods, comprising the following steps: obtains the starting point and end point of voice to be identified；Voice to be identified is successively sampled according to preset sample frequency and sample size, the sampled audio data corresponds to several sampled points of voice to be identified；All sampled audio datas are passed sequentially through into FFT Fourier transformation and obtain several sampling frequencies；When the energy value that frequency is located at the acquisition of 300~1000Hz frequency range in sampling frequency frequency is greater than preset energy value n1, and the energy variance obtained is greater than preset energy value n2, then judge the corresponding sample of the sampling frequency in the range of efficient voice；Conversely, then judging the corresponding sample of the sampling frequency in the range of noise；Using the first sampled point in the sampling point sequence of efficient voice as the starting point of efficient voice；Using the first sampled point in the sampling point sequence of noise as the end point of efficient voice.It is able to achieve from voice to be identified and accurately obtains efficient voice.

Description

A kind of efficient voice acquisition methods

Technical field

The present invention relates to field of voice signal, and in particular to a kind of acquisition methods of efficient voice.

Background technique

In recent ten years, it is obtained in the adaptive technique of the design of refined model, parameter extraction and optimization and system Some key developments.Speech recognition technology is more and more mature, and accuracy rate is gradually improved, and has corresponding language in the market Sound product.

In intelligent recording and broadcasting system, continuous raising man-machine interaction experience easily facilitates teacher and does not need management recorded broadcast system System, voice command words identify and then control the common function of recording and broadcasting system, and teacher can forget the presence of recording and broadcasting system, more specially The heart and teaching.Teacher only needs to say " starting to record " upper class hour, and recording and broadcasting system begins to recorded video.It is said at the end of after class The recording of a class hall can be completed in " stopping recording ".

There is corresponding order word identification module currently on the market, but most application must all network and just be able to achieve order The identification of word, which hinders order word identification functions in the application of embedded recording and broadcasting system, and the order word of small-sized efficient, which identifies, to exist It is very promising in embedded system.

The order word identifying system of small-sized efficient carries out detection processing, Cong Zhongti firstly the need of the one section of voice said to teacher Efficient voice is taken out, to identify to efficient voice.

Summary of the invention

In view of the above technical problem, the purpose of the present invention is to provide a kind of acquisition methods of efficient voice, are able to achieve Efficient voice is accurately obtained from voice to be identified.

The invention adopts the following technical scheme:

A kind of efficient voice acquisition methods, comprising the following steps:

Obtain the starting point and end point of voice to be identified；

Obtain the efficient voice of voice to be identified；The efficient voice of the voice to be identified is to be started with the starting point, And the complete speech terminated with the end point；

Obtain the starting point and end point of voice to be identified the following steps are included:

Voice to be identified is successively sampled according to preset sample frequency and sample size, obtains several sampled audios Data, the sampled audio data correspond to several sampled points of voice to be identified；All sampled audio datas are passed sequentially through into FFT Fourier transformation obtains several sampling frequencies；

Obtain the energy value that all sampling frequency frequencies are located at 100~1000Hz；And by the energy value successively with it is default Energy value n1 is compared；

Obtain the energy variance that all sampling frequency frequencies are located in 300~1000Hz frequency range；And by the energy variance Successively compared with preset energy value n2；

When in sampling frequency frequency frequency be located at 300~1000Hz frequency range acquisition energy value be greater than preset energy value n1, And obtain energy variance be greater than preset energy value n2 when, then judge the corresponding sample of the sampling frequency in efficient voice Range；

When the energy value that frequency is located at the acquisition of 300~1000Hz frequency range in sampling frequency frequency is not greater than preset energy value n1 Or the energy variance obtained then judges the corresponding sample of the sampling frequency in the model of noise when not being greater than preset energy value n2 It encloses；

All sampled points for being located at the range of complete speech are chronologically arranged, the complete speech chronologically arranged Point sequence is sampled, using the first sampled point in the sampling point sequence of efficient voice as the starting point of efficient voice；

By the sampled point for being located at the range of noise and the sampled point sampling time is located at the institute after the starting point of efficient voice There is sampled point chronologically to arrange, the sampling point sequence of the noise chronologically arranged, in the sampling point sequence of noise first Sampled point is the end point of efficient voice.

Further, preset sample size is 2048 audio datas.

Further, the preset energy value n1 is 38000-60000J.

Further, the preset energy value n2 is 30-70J.

Compared with prior art, the beneficial effects of the present invention are:

The present invention is started, and with the knot by the starting point and end point of acquisition voice to be identified with the starting point The complete speech that beam spot terminates is efficient voice, realizes and carries out detection processing to voice to be identified, is therefrom extracted effectively Voice, to be identified to efficient voice.Further, by carrying out pair the energy variance of frequency range and preset energy value N2 Than improving the accuracy rate to voice starting point and end point to be identified judgement.

Detailed description of the invention

Fig. 1 is the flow diagram of efficient voice acquisition methods of the present invention.

Specific embodiment

In the following, being described further in conjunction with attached drawing and specific embodiment to the present invention, it should be noted that not Under the premise of conflicting, new implementation can be formed between various embodiments described below or between each technical characteristic in any combination Example:

Embodiment:

Referring to FIG. 1, efficient voice acquisition methods, comprising the following steps:

Step S100: the starting point and end point of voice to be identified is obtained；

Step S200: the efficient voice of voice to be identified is obtained；The efficient voice of the voice to be identified is to be opened with described Initial point starts, and the complete speech terminated with the end point；

Step S1001: voice to be identified is successively sampled according to preset sample frequency and sample size, if obtaining Dry sampled audio data, the sampled audio data correspond to several sampled points of voice to be identified；And by all sampled audio numbers Several sampling frequencies are obtained according to FFT Fourier transformation is passed sequentially through.Specific: voice to be identified has taken limit for length's discrete signal x (n), n=0,1 ..., N-1, preferred sample size N preferably takes 2048 in the present invention.There to be limit for length discrete signal x (n) It is divided into the sum of two sequences of even number and odd number, obtains: x (n)=x₁(n)+x₂(n)；The length of x1 (n) and x2 (n) is all N/2, X1 (n) is even order, and x2 (n) is odd numbered sequences.By the calculation formula of FFT Fourier transformation:

N number of plural number X (k) frequency domain is obtained, plural number X (k) modulus obtained above is obtained into N number of amplitude complx (N) (N= 0,1,...N)；

Step S1002: the energy value that all sampling frequency frequencies are located at 100~1000Hz is obtained；And by the energy value Successively compared with preset energy value n1；The energy value calculating method are as follows: according to FFT Fourier transformation obtain frequency domain is In (N/2) symmetrical characteristic, i.e., only needing to calculate (N/2) a frequency can be according to following equation fs=i* (FS/N), (its Middle fs is calculative frame per second, i=01 ... (N/2), N are number of samples, and FS is the sample frequency of this section audio, are somebody's turn to do (N/2) a frequency of section frequency spectrum, corresponds with amplitude complx (N), the corresponding amplitude (energy of each frequency can be obtained Amount)；

Step S1003: the energy variance that all sampling frequency frequencies are located in 300~1000Hz frequency range is obtained；And by institute Energy variance is stated successively to compare with preset energy value n2；Specifically, energy formula of variance(wherein, S is variance yields, and m is the number more than preset energy value N1, Complx (i) is amplitude corresponding more than preset energy value N1, and averageComplx is all width more than preset energy value N1 The average of value)；

Step S1004: it is preset when the energy value that frequency is located at the acquisition of 300~1000Hz frequency range in sampling frequency frequency is greater than Energy value n1, and obtain energy variance be greater than preset energy value n2 when, then judge the corresponding sample of the sampling frequency in The range of efficient voice；

Step S1005: all sampled points for being located at the range of complete speech are chronologically arranged, are chronologically arranged The sampling point sequence of complete speech, using the first sampled point in the sampling point sequence of efficient voice as the starting point of efficient voice；

Step S1006: when in sampling frequency frequency frequency be located at 300~1000Hz frequency range acquisition energy value be not greater than it is pre- If energy value n1 or the energy variance of acquisition are not greater than preset energy value n2, then the corresponding sample of the sampling frequency is judged In the range of noise；

Step S1007: by the sampled point for being located at the range of noise and the sampled point sampling time is located at the beginning of efficient voice All sampled points after point chronologically arrange, the sampling point sequence of the noise chronologically arranged, with the sampled point of noise The first sampled point is the end point of efficient voice in sequence.The chronologically arrangement refers to according to sampled point in voice to be identified Appearance chronological order.Sampled point sampling time sequencing is also the appearance with sampled point in voice to be identified Chronological order is successively sampled.

Digitized voice data is exactly audio data.In digitized voice, there are two important indexs, i.e. sampling frequency Rate and sample size.Sampling number in sample frequency, that is, unit time, sample frequency is bigger, and the interval between sampled point is got over Small, the sound digitized is more true to nature, but corresponding data volume increases, and deals with more difficult；Sample size is remembered The digit of the numerical value of each sample value size is recorded, it determines the dynamic range of sampling, and digit is more, can record sound Variation degree is finer and smoother, and resulting data volume is also bigger.Preferably, preset sample size is 2048 audio datas.Such as Fruit sample size is too small, this section audio obtained in this way can be inaccurate, and frequency resolution is too low, needs to become by FFT Fourier Change zero padding, the case where zero padding can expend cpu resource and time-consuming, and sampling is excessive also can be time-consuming, therefore, uses sample size 2048 A audio data both ensure that the precision of resolution ratio, will not excessively expend cpu resource.

As soon as section voice is switched to frequency domain from time domain, at this moment this section of voice has quantifiable parameter, (the frequency model of voice Enclose) judge whether there is whether this section of voice has the frequency of voice while corresponding frequency energy value is how many.Invention of the invention Point further by the way that the energy variance of frequency range is compared with preset energy value N2, improve to voice starting point to be identified with The accuracy rate of end point judgement, most of energy value in each frequency range of noise of 100-1000HZ are not much different, therefore these Noise variance yields is smaller.

N1 and N2 value is adjusted, value is smaller, sensitiveer, it is easy to which that trigger is judged as that this section of voice is exactly that voice is not made an uproar Sound, but the probability of false triggering can be bigger.According to the various tests of project, when preset energy value n1 is set as 38000- 60000J when preset energy value n2 is set as 30-70J, substantially increases the accuracy rate of starting point and end point detection.

It will be apparent to those skilled in the art that can make various other according to the above description of the technical scheme and ideas Corresponding change and deformation, and all these changes and deformation all should belong to the protection scope of the claims in the present invention Within.

Claims

1. a kind of efficient voice acquisition methods, which comprises the following steps:

Obtain the starting point and end point of voice to be identified；

Obtain the efficient voice of voice to be identified；The efficient voice of the voice to be identified be started with the starting point, and with The complete speech that the end point terminates；

Voice to be identified is successively sampled according to preset sample frequency and sample size, obtains several sampled audio numbers According to the sampled audio data corresponds to several sampled points of voice to be identified；All sampled audio datas are passed sequentially through into Fu FFT In leaf transformation obtain several sampling frequencies；

Obtain the energy value that all sampling frequency frequencies are located at 100~1000Hz；And by the energy value successively with preset energy Value n1 is compared；

Obtain the energy variance that all sampling frequency frequencies are located in 300~1000Hz frequency range；And successively by the energy variance It is compared with preset energy value n2；

When the energy value that frequency is located at the acquisition of 300~1000Hz frequency range in sampling frequency frequency is greater than preset energy value n1, and obtain When the energy variance taken is greater than preset energy value n2, then judge the corresponding sample of the sampling frequency in the model of efficient voice It encloses；

When the energy value that frequency is located at the acquisition of 300~1000Hz frequency range in sampling frequency frequency is not greater than preset energy value n1 or obtains When the energy variance taken is not greater than preset energy value n2, then judge the corresponding sample of the sampling frequency in the range of noise；

All sampled points for being located at the range of complete speech are chronologically arranged, the sampling of the complete speech chronologically arranged Point sequence, using the first sampled point in the sampling point sequence of efficient voice as the starting point of efficient voice；

By the sampled point for being located at the range of noise and the sampled point sampling time is located at after the starting point of efficient voice all adopts Sampling point chronologically arranges, the sampling point sequence of the noise chronologically arranged, with the first sampling in the sampling point sequence of noise Point is the end point of efficient voice.

2. efficient voice acquisition methods according to claim 1, which is characterized in that preset sample size is 2048 sounds Frequency evidence.

3. efficient voice acquisition methods according to claim 1, which is characterized in that the preset energy value n1 is 38000- 60000J。

4. efficient voice acquisition methods according to claim 3, which is characterized in that the preset energy value n2 is 30- 70J。