CN106373592A

CN106373592A - Audio noise tolerance punctuation processing method and system

Info

Publication number: CN106373592A
Application number: CN201610799384.7A
Authority: CN
Inventors: 胡飞
Original assignee: HUAKEFEIYANG Co Ltd
Current assignee: HUAKEFEIYANG Co Ltd
Priority date: 2016-08-31
Filing date: 2016-08-31
Publication date: 2017-02-01
Anticipated expiration: 2036-08-31
Also published as: CN106373592B

Abstract

The invention relates to an audio noise tolerance punctuation processing method and a system. The method comprises steps that multiple framing segments are acquired according to an audio; an energy threshold is acquired according to an energy value of each framing segment, a framing segment with an energy value surpassing the energy threshold Et is acquired from the framing segments according to the energy threshold, the frame segment with the energy value surpassing the energy threshold Et is taken as a middle sentence frame to scan a front sequence frame or a back sequence frame, if an energy threshold of the front sequence frame or the back sequence frame is smaller the set energy threshold Et, the frame with the energy threshold smaller than the set energy threshold Et and the middle sentence frame are merged according to the start order into an independent sentence, entropy spectrum analysis on each independent sentence is then carried out, and a final analysis sentence is acquired. Through the method, a problem of automatic punctuation incapability existing in a caption corresponding process in the prior art is solved, recorded audios and videos can not only be processed, but also audios and videos which are presently played can be further processed, for network broadcast flows, network broadcast voice cutting can be automatically carried out, subsequent links such as listening and writing links can be conveniently processed parallelly, and the processing time is shortened.

Description

Audio frequency holds punctuate processing method and the system of making an uproar

Technical field

The present invention relates to voice, captions processing technology field, more particularly, to carry out audio frequency and hold the punctuate processing method and be of making an uproar System.

Background technology

Captions make field at present, mainly pass through manually to carry out voice punctuate, the premise of artificial speech punctuate is by voice All listen one time, mark starting point and the end point of a word while dictation by patting shortcut.Due to pat There is dislocation in time delay, obtained starting point and end point, need to manually adjust.Whole flow process needs to consume the plenty of time.Than As the audio frequency of 30 minutes needs the punctuate time of time-consuming 40 minutes to 1 hour, and the productivity is extremely low.And in network direct broadcasting neck Domain, if do not made pauses in reading unpunctuated ancient writings, by manually being dictated, being difficult to carry out parallelization, and the speed of people's dictation can be slower than live speed, Cannot be carried out parallelization and cannot carry out real-time live broadcast in both illustration and text.Rely on artificial punctuate, because the speed of artificial punctuate is also than broadcasting Speed is slow, also leads to be difficult to real-time live broadcast.

Content of the invention

For above-mentioned defect of the prior art, it is an object of the invention to provide audio frequency holds the punctuate processing method and be of making an uproar System.Thus solving the problems, such as in existing captions corresponding process it is impossible to automatically be made pauses in reading unpunctuated ancient writings and noise is high.

The present invention is directed to classroom recorded broadcast and network direct broadcasting, proposes a kind of method of intelligent sound punctuate, and this method is passed through Speech analysis techniques, can quickly analyze the voice data recorded or gather automatically, and detection obtains meeting the language of subtitle specification Tablet section, saves the time that video and audio captions make.

In order to achieve the above object, the following technical scheme of present invention offer:

Audio frequency holds punctuate processing method of making an uproar, comprising:

Step s101, obtains multiple framing sections according to audio frequency；

Step s102, the energy value according to each framing section obtains energy threshold e_k；

Step s103, according to described energy threshold e_k, obtain its energy value from described each framing section and exceed energy threshold e_t；Framing section, then the preamble frame of this frame or postorder frame are scanned with this framing section for sentence intermediate frame, if preamble frame or after The energy threshold of sequence frame is less than and sets energy threshold e_t, then merging this frame by frame start sequence with described sentence intermediate frame becomes only Vertical sentence；

Step s104, before and after each sentence, two frames are respectively forwardly searched for afterwards, if the next frame searching belongs to it His sentence, then merge to two sentences；If the energy of next frame is less than e_t, and it is not belonging to other sentences, then to this frame Carry out Fourier transform, take the amplitude of 0-4000hz, be divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is v_i,i =1,2 ... z.Overall strength is v_sum, p_iProbability for every bands of a spectrum: p_iComputing formula be:

Then, the spectrum entropy of this frame is:

h = - σ_{i = 1}^{z} p_{i} \log p_{i}

The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value r_tIf, this frame Can entropy than not less than r_t, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort；

Step s105, judges whether the frame length of described independent sentence is the short sentence frame length scope setting, if so, then by history The short independent sentence specimen of storage is contrasted with currently independent sentence, if matching degree is less than setting value, independent sentence is designated and makes an uproar Sound sentence；

Step s106, independent sentence the breaking as audio frequency not being designated noise sentence that each framing section of described audio frequency is obtained Sentence.

In a preferred embodiment, described step s101 includes:

Step s1011: receive audio file；

Step s1012: the sliced time according to setting is split to described audio file, obtains multiple framing sections.

In a preferred embodiment, described step s102 includes: the energy value according to each framing section average Value obtains energy threshold e_k.

In a preferred embodiment, " if the energy threshold of preamble frame or postorder frame is less than in described step s103 Set energy threshold e_t, then merging this frame by frame start sequence with described sentence intermediate frame becomes independent sentence unit " step bag Include:

If the energy threshold of preamble frame or postorder frame is less than sets energy e_t, then judge present frame and next frame interval when Between whether less than setting interval time, if so, then merging described sentence intermediate frame by frame start sequence becomes independent sentence.

In a preferred embodiment, also include after step s103:

Step s1031: if the frame length of described independent sentence exceeds sets independent frame length, calculate the spectrum entropy of the every frame of this independent office Ratio, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independent sentences.

Simultaneously present invention also offers a kind of automatic split system carrying out audio frequency punctuate, comprising: framing unit, energy valve Value acquiring unit, independent sentence acquiring unit；Spectrum entropy analytic unit；

Described framing unit, is configured to obtain multiple framing sections according to audio frequency；

Described energy threshold acquiring unit, is configured to the energy value according to each framing section and obtains energy threshold e_k；

Described independent sentence acquiring unit, is configured to according to described energy threshold e_k, from described each framing section, obtain its energy Value exceedes energy threshold e_t；Framing section, then the preamble frame of this frame or postorder frame are swept with this framing section for sentence intermediate frame Retouch, if the energy threshold of preamble frame or postorder frame is less than sets energy threshold e_t, then this frame is risen by frame with described sentence intermediate frame Beginning order merges becomes independent sentence；

Described spectrum entropy analytic unit, is configured to two frames before and after each sentence and respectively forwardly searches for afterwards, if searched Next frame belong to other sentences, then two sentences are merged；If the energy of next frame is less than e_t, and it is not belonging to other Sentence, then carry out Fourier transform to this frame, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum, every bands of a spectrum according to fixed width Intensity be v_i, i=1,2 ... z.Overall strength is v_sum, p_iFor the probability of every bands of a spectrum, p_iComputing formula be:

p_{i} = \frac{v_{i}}{v_{s u m}}

Then, the spectrum entropy of this frame is:

h = - σ_{i = 1}^{z} p_{i} \log p_{i}

Described noise sentence judging unit, is configured to judge whether the frame length of described independent sentence is the short sentence frame length model setting Enclose, if so, then the short independent sentence specimen of historical storage and currently independent sentence are contrasted, if matching degree is less than setting value, Independent sentence is designated noise sentence；

Punctuate acquiring unit, the independent sentence not being designated noise sentence being configured to obtain each framing section of described audio frequency is made Punctuate for audio frequency.

In a preferred embodiment, described framing unit is additionally configured to: receives audio file；According to dividing of setting Time of cutting is split to described audio file, obtains multiple framing sections.

In a preferred embodiment, described energy threshold acquiring unit is additionally configured to, according to the energy of each framing section The meansigma methodss of value obtain energy threshold e_k.

In a preferred embodiment, described independent sentence acquiring unit is additionally configured to, if preamble frame or postorder frame Energy threshold is less than and sets energy e_t, then judge interval time of present frame and next frame whether less than setting interval time, if It is that then merging described sentence intermediate frame by frame start sequence becomes independent sentence.

In a preferred embodiment, also include: long sentence judging unit；

Described long sentence judging unit, if the frame length being configured to described independent sentence exceeds sets independent frame length, calculates this only The spectrum entropy ratio of the every frame of vertical office, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independences Sentence.

The invention has the benefit that the main calculating of this method is carried out in time domain, calculating speed is fast.For possible It is the limited regional area that consonant is also likely to be noise, is analyzed in conjunction with time-domain and frequency-domain, increase the accuracy of cutting.Only need Time-consuming spectrum analyses are carried out to a few frames, cutting speed is i.e. fast and accurate, has stronger noise resistance characteristic simultaneously again.With In the time point automatically generating voice cutting, the workload of audio frequency and video caption editing can be saved.Devise a set of direct utilization Existing result of calculation, no longer carries out the cutting method of quadratic character calculating, can quickly carry out long sentence cutting, and guarantee is not in Long sentence, meets the demand making captions.Using machine learning method, short sentence is carried out judge detection, judge that it is No is people's sound or noise, abandons noise, lifts accuracy further.This method both can process the sound having recorded and regard Frequency is it is also possible to process just in live audio frequency and video.For network direct broadcasting stream, can automatically network direct broadcasting voice be cut, side Continue link after an action of the bowels as dictated link parallel processing, faster processing time.

Brief description

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Have technology description in required use accompanying drawing be briefly described it should be apparent that, drawings in the following description be only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, acceptable Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is in one embodiment of the present invention, and audio frequency holds the schematic flow sheet of punctuate processing method of making an uproar；

Fig. 2 is in one embodiment of the present invention, and audio frequency holds the logic connection diagram of punctuate processing system of making an uproar.

Specific embodiment

Below in conjunction with the accompanying drawing of the present invention, technical scheme is clearly and completely described it is clear that institute The embodiment of description is only a part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, The every other embodiment that those of ordinary skill in the art are obtained under the premise of not making creative work, broadly falls into this The scope of bright protection.

Audio frequency in the present invention holds punctuate processing method of making an uproar, as shown in Figure 1, comprising:

Step s101, obtains multiple framing sections according to audio frequency.

The present invention may be mounted on server it is also possible to be arranged on personal computer or mobile computing device.Below Alleged computing terminal can be server or personal computer or mobile computing device.First, to Server uploads audio-video document, or opens audio-video document on personal computer or mobile computing device.Afterwards, count Calculation equipment extracts the audio stream in audio-video document, and audio stream unification is had symbol single-channel data to fixed sampling frequency.It Adopt framing parameter set in advance afterwards, sub-frame processing is carried out to data.

Step s1011: receive audio file；Step s1012: the sliced time according to setting enters to described audio file Row segmentation, obtains multiple framing sections.

Sub-frame processing is carried out to audio frequency.Every frame length is from 10ms to 500ms.In speech recognition, in order to accurately know Other voice, needs overlap between consecutive frame.The purpose of the present invention is not by speech recognition, can weigh therefore between frame and frame Folded or even allowed interval between consecutive frame it is also possible to not overlapping, be spaced apart 0ms to 500ms.So voice segmentation obtains Frame number will be less than frame number needed for speech recognition, thus reducing amount of calculation, improves calculating speed.With f₁,f₂,…f_m, represent and obtain Frame, each frame has n sample, is s respectively_k1,s_k2,…,s_kn, the range value of each sample is f_ki,f_k2,…,f_kn.Each frame note Record time started and end time.

Speech data be by fixed sample rate, sound is sampled after, the real number numeric string that obtains.Sample rate 16k, just Represent 16000 data of sampling in 1 second.The meaning of framing be using this burst of data by regular time section be one set as divide Analysis unit.Such as, 16k sample rate, if every frame length is 100 milliseconds, has 1600 speech datas inside 1 frame.By dividing Frame is determining the granularity of control.In this patent, generally according to 100 milliseconds of framings that is to say, that the video of n second, need to be divided into 10n frame.Certainly, can be non-conterminous between frame and frame, such as, 100 milliseconds of the interval of two frames, then the video of n second, framing is exactly 5n frame.Increase the interval between frame and frame and can reduce totalframes, improve analyze speed, but cost is time degree of accuracy can drop Low.

Step s102, the energy value according to each framing section obtains energy threshold e_k.

In this step:

Each frame is calculated with its energy e_k.Energy definition including but not limited to amplitude square and with two kinds of absolute value sum Mode.

Energy balane formula according to amplitude square and definition is:

e_{k} = σ_{i = 1}^{n} {f_{k i}}^{2}

Energy balane formula according to absolute value definition is:

e_{k} = σ_{i = 1}^{n} | f_{k i} |

Set an energy threshold e_t, search adjacent and energy all more than e_tSpeech frame, obtain speech sentence s₁,s₂,… s_j.That is to say:

s_i={ f_k| k=a, a+1, a+2 ... a+b, e_k>=e_t, and e_(a-1)<e_t, and e_(a+b+1)<e_t}.

In another embodiment, described step s101 includes:

Described step s102 includes: the meansigma methodss of the energy value according to each framing section obtain energy threshold e_k.That is, will be upper The energy value that one step obtains, divided by sample size, obtains average energy.Energy threshold is the threshold value of every frame average energy, usual root According to experience setting, certain numeral between conventional 0.001-0.01, and user can manually adjust.

Step s103, merges into independent sentence.

According to described energy threshold e_k, obtain its energy value from described each framing section and exceed energy threshold e_t；Framing Section, then be scanned to the preamble frame of this frame or postorder frame with this framing section for sentence intermediate frame, if the energy of preamble frame or postorder frame Amount threshold values is less than and sets energy threshold e_t, then merging this frame by frame start sequence with described sentence intermediate frame becomes independent sentence.

" if the energy threshold of preamble frame or postorder frame is less than sets energy threshold e in described step s103_t, then by this frame Merging by frame start sequence with described sentence intermediate frame becomes independent sentence unit " step include: if the energy of preamble frame or postorder frame Amount threshold values is less than and sets energy e_t, then whether present frame and the interval time of next frame are judged less than setting interval time, if so, Then merging described sentence intermediate frame by frame start sequence becomes independent sentence.

Before and after each sentence, two frames are respectively forwardly searched for afterwards.If the next frame searching belongs to other sentences, Two sentences are merged.If the energy of next frame is less than e_t, and be not belonging to other sentences, then Fourier is carried out to this frame Conversion, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is v_i, i=1,2 ... z. Overall strength is v_sum, p_iProbability for every bands of a spectrum.p_iComputing formula be:

p_{i} = \frac{v_{i}}{v_{s u m}}

Then, the spectrum entropy of this frame is:

h = - σ_{i = 1}^{z} p_{i} \log p_{i}

The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value r_tIf, this frame Can entropy than not less than r_t, then this frame is grouped in sentence.If scanning beginning or the end of voice flow, scan abort.

Such as, there are 10 speech frames, every frame energy is respectively:

0.05,0.12,0.002,0.004,0.1,0.2,0.4,0,5,0.001,0.12

If with 0.003 as threshold value, pass through the 3rd step, can obtain three sentences:

Sentence 1 comprises: 0.05,0.12

Sentence 2 comprises: 0.004,0.1,0.2,0.4,0.5

Sentence 3 comprises: 0.12

With sentence 2 as example, scan forward, the frame before it is 0.002, and this frame is not belonging to any sentence, and Its energy is less than threshold value 0.003, at this moment, this frame is carried out with Fourier transform, and calculating can entropy ratio.If energy entropy is than less than this Threshold value then it is assumed that this frame is not belonging to sentence 2, the end of scan forward.If can entropy ratio be not less than this threshold value then it is assumed that this Frame belongs to sentence 2, continues to scan forward next frame.Next frame is 0.12,0.12 to belong to sentence 1, then will be 2-in-1 to sentence 1 and sentence And.After having merged, foremost one frame is 0.05, has been the first frame it is impossible to scan forward, the end of scan forward.Backward The logic that the logical AND of scanning scans forward is the same.Run into energy and be less than energy threshold, calculate its energy entropy ratio, can entropy ratio be less than Energy entropy, than threshold value, the then end of scan, otherwise, continues to scan on.Run into other sentences, then merge, after merging, continue to scan on.

Afterwards, merge close sentence.For the sentence being bordered by, calculate its interval time, if interval time is less than referred to Fixed time threshold, then merge two sentences.

This step is to merge further, and such as it is assumed that every frame length is 100 milliseconds, sentence 1 comprises the 22nd, 23,24,25, 26 totally 5 frames, sentence 2 comprises 29,30,31,32,33,34,35 totally 7 frames, does not have other sentences between two sentences.This two It is spaced 2 frames between sentence, that is, 200 milliseconds.It is assumed that 10 milliseconds of the time threshold specified, because 200 milliseconds are less than 300 milliseconds, then sentence 1 and sentence 2 are merged, merge into 1 sentence.Frame 27,28 between sentence 1 and sentence 2 also one And in integrating with, the new sentence after merging comprises 22,23,24,25,26,27,28,29,30,31,32,33,34,35 totally 14 frames.

Step s104, carries out to every composing entropy analysis.

In this step, before and after each sentence, two frames are respectively forwardly searched for afterwards, if the next frame searching belongs to it His sentence, then merge to two sentences；If the energy of next frame is less than e_t, and it is not belonging to other sentences, then to this frame Carry out Fourier transform, take the amplitude of 0-4000hz, be divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is v_i,i =1,2 ... z.Overall strength is v_sum, p_iProbability for every bands of a spectrum: p_iComputing formula be:

p_{i} = \frac{v_{i}}{v_{s u m}}

Then, the spectrum entropy of this frame is:

h = - σ_{i = 1}^{z} p_{i} \log p_{i}

Step s105, identifies noise sentence；Whether the frame length judging described independent sentence is the short sentence frame length scope setting, if It is then the short independent sentence specimen of historical storage and currently independent sentence to be contrasted, if matching degree is less than setting value, will be independent Sentence is designated noise sentence；Using machine learning method, short sentence is carried out judge detection, judge whether it is people's sound or makes an uproar Sound, abandons noise, lifts accuracy further.

Step s106, obtains punctuate.The independent sentence not being designated noise sentence that each framing section of described audio frequency is obtained is made Punctuate for audio frequency.

In a preferred embodiment, also include after step s103:

Split long sentence.If the length of sentence is higher than the time threshold specified, this sentence is split.Tear open Point mode is as follows: ignores each a certain proportion of speech frame of head and the tail of sentence, remaining speech frame is traveled through.If each frame is It has been computed spectrum entropy ratio, then weight w has been used for using spectrum entropy.If not calculating spectrum entropy ratio, using this frame energy as weights w.For each frame, if in this sentence, on the left of this frame, there is nleft frame, there is nright frame on right side, definition splits coefficient value Ws is as follows: by traversal, searching makes the minimum frame of fractionation value ws of this sentence, and this sentence is divided into two sentences in left and right.If Yet suffer from long sentence in two sentences in left and right, then adopt this method long sentence to be continued to split, until not existing Long sentence.Filter too short meaningless sentence.Specify a time threshold, for less than time span sentence it is possible to It is not that people is speaking.For such sentence, adopt its energy highest one frame, calculate its mel cepstrum coefficients.During use Support vector machine (svm) grader first training is classified to it, judges whether it is the sound of people.Sound if not people Sound, then abandon this sentence.Svm classifier training mode is as follows: gathers some people's sounds from lecture video with network direct broadcasting video Sample, as positive sample, some typically inhuman sound samples are as negative sample.It is used Mel to be instructed to spectral coefficient as feature Practice, obtain model parameter.(principle of support vector machine refers to).Here other machines learning method can also be taken, such as deep Degree neutral net carries out classification and judges.

The present invention also provides the automatic split system carrying out audio frequency punctuate simultaneously, as shown in Figure 2, comprising: framing unit 101st, energy threshold acquiring unit 201, independent sentence acquiring unit 301；Spectrum entropy analytic unit 401, noise sentence judging unit 501 and Punctuate acquiring unit 601.

Described framing unit 101, is configured to obtain multiple framing sections according to audio frequency；

Described energy threshold acquiring unit 201, is configured to the energy value according to each framing section and obtains energy threshold e_k；

Described independent sentence acquiring unit 301, is configured to according to described energy threshold e_k, from described each framing section, obtain it Energy value exceedes energy threshold e_t；Framing section, then the preamble frame of this frame or postorder frame are carried out with this framing section for sentence intermediate frame Scanning, if the energy threshold of preamble frame or postorder frame is less than sets energy threshold e_t, then this frame and described sentence intermediate frame are pressed frame Start sequence merges becomes independent sentence.

Spectrum entropy analytic unit 401, is configured to two frames before and after each sentence and respectively forwardly searches for afterwards, if search Next frame belongs to other sentences, then two sentences are merged；If the energy of next frame is less than e_t, and it is not belonging to other sentences Son, then carry out Fourier transform to this frame, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, every bands of a spectrum Intensity is v_i, i=1,2 ... z.Overall strength is v_sum, p_iProbability for every bands of a spectrum.p_iComputing formula be:

p_{i} = \frac{v_{i}}{v_{s u m}}

Then, the spectrum entropy of this frame is:

h = - σ_{i = 1}^{z} p_{i} \log p_{i}

Described noise sentence judging unit 501, is configured to judge whether the frame length of described independent sentence is the short sentence frame length setting If so, the short independent sentence specimen of historical storage and currently independent sentence are then contrasted by scope, if matching degree is less than setting value, Then independent sentence is designated noise sentence；

Punctuate acquiring unit 601, is configured to the independence not being designated noise sentence obtaining each framing section of described audio frequency Sentence is as the punctuate of audio frequency

In a preferred embodiment, described framing unit 101 is additionally configured to: receives audio file；According to setting Sliced time described audio file is split, obtain multiple framing sections.

In a preferred embodiment, described energy threshold acquiring unit 201 is additionally configured to, according to each framing section The meansigma methodss of energy value obtain energy threshold e_k.

In a preferred embodiment, described independent sentence acquiring unit 301 is additionally configured to, if preamble frame or postorder frame Energy threshold be less than set energy e_t, then judge interval time of present frame and next frame whether less than setting interval time, If so, then merging described sentence intermediate frame by frame start sequence becomes independent sentence.

In a preferred embodiment, comprising: long sentence judging unit 3011；

The above, the only specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, and any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, all should contain Cover within protection scope of the present invention.Therefore, protection scope of the present invention should described be defined by scope of the claims.

Claims

1. audio frequency holds punctuate processing method of making an uproar, comprising:

Step s101, obtains multiple framing sections according to audio frequency；

Step s103, according to described energy threshold e_k, obtain its energy value from described each framing section and exceed energy threshold e_t；'s Framing section, then be scanned to the preamble frame of this frame or postorder frame with this framing section for sentence intermediate frame, if preamble frame or postorder frame Energy threshold be less than set energy threshold e_t, then this frame and described sentence intermediate frame are merged by frame start sequence and become independent Sentence；

Step s104, before and after each sentence, two frames are respectively forwardly searched for afterwards, if the next frame searching belongs to other sentences Son, then merge to two sentences；If the energy of next frame is less than e_t, and be not belonging to other sentences, then this frame is carried out Fourier transform, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is v_i, i=1, 2,…z.Overall strength is v_sum, p_iProbability for every bands of a spectrum: p_iComputing formula be:

p_{i} = \frac{v_{i}}{v_{s u m}}

Then, the spectrum entropy of this frame is:

h = - σ_{i = 1}^{z} p_{i} {logp}_{i}

The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value r_tIf, the energy entropy of this frame Than not less than r_t, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort；

Step s105, judges whether the frame length of described independent sentence is the short sentence frame length scope setting, if so, then by historical storage Short independent sentence specimen and currently independent sentence are contrasted, if matching degree is less than setting value, independent sentence are designated noise sentence；

Step s106, the punctuate not being designated the independent sentence of noise sentence as audio frequency that each framing section of described audio frequency is obtained.

2. audio frequency appearance according to claim 1 makes an uproar punctuate processing method it is characterised in that described step s101 includes:

Step s1011: receive audio file；

3. audio frequency according to claim 1 and 2 holds punctuate processing method of making an uproar it is characterised in that wrapping in described step s102 Include: the meansigma methodss of the energy value according to each framing section obtain energy threshold e_k.

4. audio frequency according to claim 1 holds punctuate processing method of making an uproar it is characterised in that " if front in described step s103 The energy threshold of sequence frame or postorder frame is less than and sets energy threshold e_t, then this frame and described sentence intermediate frame are pressed frame start sequence and close And become independent sentence unit " step include:

If the energy threshold of preamble frame or postorder frame is less than sets energy e_t, then judge that present frame with the interval time of next frame is No less than set interval time, if so, then by described sentence intermediate frame by frame start sequence merge become independent sentence.

5. the audio frequency appearance according to claim 1 or 4 makes an uproar punctuate processing method it is characterised in that also including after step s103:

Step s1031: if the frame length of described independent sentence exceeds sets independent frame length, calculate the spectrum entropy ratio of the every frame of this independent office, Using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independent sentences.

6. carry out the automatic split system of audio frequency punctuate, comprising: framing unit, energy threshold acquiring unit, independent sentence obtain single Unit, noise sentence judging unit, punctuate acquiring unit；Spectrum entropy analytic unit:

Described independent sentence acquiring unit, is configured to according to described energy threshold e_k, obtain its energy value from described each framing section and surpass Cross energy threshold e_t；Framing section, then the preamble frame of this frame or postorder frame are scanned with this framing section for sentence intermediate frame, if The energy threshold of preamble frame or postorder frame is less than and sets energy threshold e_t, then this frame and described sentence intermediate frame are pressed frame start sequence Merging becomes independent sentence；

Described spectrum entropy analytic unit, is configured to two frames before and after each sentence and respectively forwardly searches for afterwards, if searched down One frame belongs to other sentences, then two sentences are merged；If the energy of next frame is less than e_t, and it is not belonging to other sentences Son, then carry out Fourier transform to this frame, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, every bands of a spectrum Intensity is v_i, i=1,2 ... z.Overall strength is v_sum, p_iFor the probability of every bands of a spectrum, p_iComputing formula be:

p_{i} = \frac{v_{i}}{v_{s u m}}

Then, the spectrum entropy of this frame is:

h = - σ_{i = 1}^{z} p_{i} {logp}_{i}

Described noise sentence judging unit, is configured to judge whether the frame length of described independent sentence is the short sentence frame length scope setting, if It is then the short independent sentence specimen of historical storage and currently independent sentence to be contrasted, if matching degree is less than setting value, will be independent Sentence is designated noise sentence；

Punctuate acquiring unit, is configured to not be designated the independent sentence of noise sentence as sound using what each framing section of described audio frequency obtained The punctuate of frequency.

7. the automatic split system carrying out audio frequency punctuate according to claim 6 it is characterised in that described framing unit also It is configured that reception audio file；Sliced time according to setting is split to described audio file, obtains multiple framing sections.

8. the automatic split system carrying out audio frequency punctuate according to claim 6 or 7 is it is characterised in that described energy valve Value acquiring unit is additionally configured to, and the meansigma methodss of the energy value according to each framing section obtain energy threshold e_k.

9. the automatic split system carrying out audio frequency punctuate according to claim 6 is it is characterised in that described independent sentence obtains Unit is additionally configured to, if the energy threshold of preamble frame or postorder frame is less than sets energy e_t, then present frame and next frame are judged Interval time, whether if so, then merging described sentence intermediate frame by frame start sequence became independent sentence less than setting interval time.

10. carry out the automatic split system of audio frequency punctuate it is characterised in that also including according to claim 6 or 9: long Sentence judging unit；

Described long sentence judging unit, if the frame length being configured to described independent sentence exceeds sets independent frame length, calculates this independent office The spectrum entropy ratio of every frame, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independent sentences.