CN106373592A - Audio noise tolerance punctuation processing method and system - Google Patents

Audio noise tolerance punctuation processing method and system Download PDF

Info

Publication number
CN106373592A
CN106373592A CN201610799384.7A CN201610799384A CN106373592A CN 106373592 A CN106373592 A CN 106373592A CN 201610799384 A CN201610799384 A CN 201610799384A CN 106373592 A CN106373592 A CN 106373592A
Authority
CN
China
Prior art keywords
frame
sentence
energy
independent
energy threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610799384.7A
Other languages
Chinese (zh)
Other versions
CN106373592B (en
Inventor
胡飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HUAKEFEIYANG Co Ltd
Original Assignee
HUAKEFEIYANG Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HUAKEFEIYANG Co Ltd filed Critical HUAKEFEIYANG Co Ltd
Priority to CN201610799384.7A priority Critical patent/CN106373592B/en
Publication of CN106373592A publication Critical patent/CN106373592A/en
Application granted granted Critical
Publication of CN106373592B publication Critical patent/CN106373592B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling

Abstract

The invention relates to an audio noise tolerance punctuation processing method and a system. The method comprises steps that multiple framing segments are acquired according to an audio; an energy threshold is acquired according to an energy value of each framing segment, a framing segment with an energy value surpassing the energy threshold Et is acquired from the framing segments according to the energy threshold, the frame segment with the energy value surpassing the energy threshold Et is taken as a middle sentence frame to scan a front sequence frame or a back sequence frame, if an energy threshold of the front sequence frame or the back sequence frame is smaller the set energy threshold Et, the frame with the energy threshold smaller than the set energy threshold Et and the middle sentence frame are merged according to the start order into an independent sentence, entropy spectrum analysis on each independent sentence is then carried out, and a final analysis sentence is acquired. Through the method, a problem of automatic punctuation incapability existing in a caption corresponding process in the prior art is solved, recorded audios and videos can not only be processed, but also audios and videos which are presently played can be further processed, for network broadcast flows, network broadcast voice cutting can be automatically carried out, subsequent links such as listening and writing links can be conveniently processed parallelly, and the processing time is shortened.

Description

Audio frequency holds punctuate processing method and the system of making an uproar
Technical field
The present invention relates to voice, captions processing technology field, more particularly, to carry out audio frequency and hold the punctuate processing method and be of making an uproar System.
Background technology
Captions make field at present, mainly pass through manually to carry out voice punctuate, the premise of artificial speech punctuate is by voice All listen one time, mark starting point and the end point of a word while dictation by patting shortcut.Due to pat There is dislocation in time delay, obtained starting point and end point, need to manually adjust.Whole flow process needs to consume the plenty of time.Than As the audio frequency of 30 minutes needs the punctuate time of time-consuming 40 minutes to 1 hour, and the productivity is extremely low.And in network direct broadcasting neck Domain, if do not made pauses in reading unpunctuated ancient writings, by manually being dictated, being difficult to carry out parallelization, and the speed of people's dictation can be slower than live speed, Cannot be carried out parallelization and cannot carry out real-time live broadcast in both illustration and text.Rely on artificial punctuate, because the speed of artificial punctuate is also than broadcasting Speed is slow, also leads to be difficult to real-time live broadcast.
Content of the invention
For above-mentioned defect of the prior art, it is an object of the invention to provide audio frequency holds the punctuate processing method and be of making an uproar System.Thus solving the problems, such as in existing captions corresponding process it is impossible to automatically be made pauses in reading unpunctuated ancient writings and noise is high.
The present invention is directed to classroom recorded broadcast and network direct broadcasting, proposes a kind of method of intelligent sound punctuate, and this method is passed through Speech analysis techniques, can quickly analyze the voice data recorded or gather automatically, and detection obtains meeting the language of subtitle specification Tablet section, saves the time that video and audio captions make.
In order to achieve the above object, the following technical scheme of present invention offer:
Audio frequency holds punctuate processing method of making an uproar, comprising:
Step s101, obtains multiple framing sections according to audio frequency;
Step s102, the energy value according to each framing section obtains energy threshold ek
Step s103, according to described energy threshold ek, obtain its energy value from described each framing section and exceed energy threshold et;Framing section, then the preamble frame of this frame or postorder frame are scanned with this framing section for sentence intermediate frame, if preamble frame or after The energy threshold of sequence frame is less than and sets energy threshold et, then merging this frame by frame start sequence with described sentence intermediate frame becomes only Vertical sentence;
Step s104, before and after each sentence, two frames are respectively forwardly searched for afterwards, if the next frame searching belongs to it His sentence, then merge to two sentences;If the energy of next frame is less than et, and it is not belonging to other sentences, then to this frame Carry out Fourier transform, take the amplitude of 0-4000hz, be divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is vi,i =1,2 ... z.Overall strength is vsum, piProbability for every bands of a spectrum: piComputing formula be:
Then, the spectrum entropy of this frame is:
h = - σ i = 1 z p i log p i
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, this frame Can entropy than not less than rt, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort;
Step s105, judges whether the frame length of described independent sentence is the short sentence frame length scope setting, if so, then by history The short independent sentence specimen of storage is contrasted with currently independent sentence, if matching degree is less than setting value, independent sentence is designated and makes an uproar Sound sentence;
Step s106, independent sentence the breaking as audio frequency not being designated noise sentence that each framing section of described audio frequency is obtained Sentence.
In a preferred embodiment, described step s101 includes:
Step s1011: receive audio file;
Step s1012: the sliced time according to setting is split to described audio file, obtains multiple framing sections.
In a preferred embodiment, described step s102 includes: the energy value according to each framing section average Value obtains energy threshold ek.
In a preferred embodiment, " if the energy threshold of preamble frame or postorder frame is less than in described step s103 Set energy threshold et, then merging this frame by frame start sequence with described sentence intermediate frame becomes independent sentence unit " step bag Include:
If the energy threshold of preamble frame or postorder frame is less than sets energy et, then judge present frame and next frame interval when Between whether less than setting interval time, if so, then merging described sentence intermediate frame by frame start sequence becomes independent sentence.
In a preferred embodiment, also include after step s103:
Step s1031: if the frame length of described independent sentence exceeds sets independent frame length, calculate the spectrum entropy of the every frame of this independent office Ratio, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independent sentences.
Simultaneously present invention also offers a kind of automatic split system carrying out audio frequency punctuate, comprising: framing unit, energy valve Value acquiring unit, independent sentence acquiring unit;Spectrum entropy analytic unit;
Described framing unit, is configured to obtain multiple framing sections according to audio frequency;
Described energy threshold acquiring unit, is configured to the energy value according to each framing section and obtains energy threshold ek
Described independent sentence acquiring unit, is configured to according to described energy threshold ek, from described each framing section, obtain its energy Value exceedes energy threshold et;Framing section, then the preamble frame of this frame or postorder frame are swept with this framing section for sentence intermediate frame Retouch, if the energy threshold of preamble frame or postorder frame is less than sets energy threshold et, then this frame is risen by frame with described sentence intermediate frame Beginning order merges becomes independent sentence;
Described spectrum entropy analytic unit, is configured to two frames before and after each sentence and respectively forwardly searches for afterwards, if searched Next frame belong to other sentences, then two sentences are merged;If the energy of next frame is less than et, and it is not belonging to other Sentence, then carry out Fourier transform to this frame, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum, every bands of a spectrum according to fixed width Intensity be vi, i=1,2 ... z.Overall strength is vsum, piFor the probability of every bands of a spectrum, piComputing formula be:
p i = v i v s u m
Then, the spectrum entropy of this frame is:
h = - σ i = 1 z p i log p i
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, this frame Can entropy than not less than rt, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort;
Described noise sentence judging unit, is configured to judge whether the frame length of described independent sentence is the short sentence frame length model setting Enclose, if so, then the short independent sentence specimen of historical storage and currently independent sentence are contrasted, if matching degree is less than setting value, Independent sentence is designated noise sentence;
Punctuate acquiring unit, the independent sentence not being designated noise sentence being configured to obtain each framing section of described audio frequency is made Punctuate for audio frequency.
In a preferred embodiment, described framing unit is additionally configured to: receives audio file;According to dividing of setting Time of cutting is split to described audio file, obtains multiple framing sections.
In a preferred embodiment, described energy threshold acquiring unit is additionally configured to, according to the energy of each framing section The meansigma methodss of value obtain energy threshold ek.
In a preferred embodiment, described independent sentence acquiring unit is additionally configured to, if preamble frame or postorder frame Energy threshold is less than and sets energy et, then judge interval time of present frame and next frame whether less than setting interval time, if It is that then merging described sentence intermediate frame by frame start sequence becomes independent sentence.
In a preferred embodiment, also include: long sentence judging unit;
Described long sentence judging unit, if the frame length being configured to described independent sentence exceeds sets independent frame length, calculates this only The spectrum entropy ratio of the every frame of vertical office, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independences Sentence.
The invention has the benefit that the main calculating of this method is carried out in time domain, calculating speed is fast.For possible It is the limited regional area that consonant is also likely to be noise, is analyzed in conjunction with time-domain and frequency-domain, increase the accuracy of cutting.Only need Time-consuming spectrum analyses are carried out to a few frames, cutting speed is i.e. fast and accurate, has stronger noise resistance characteristic simultaneously again.With In the time point automatically generating voice cutting, the workload of audio frequency and video caption editing can be saved.Devise a set of direct utilization Existing result of calculation, no longer carries out the cutting method of quadratic character calculating, can quickly carry out long sentence cutting, and guarantee is not in Long sentence, meets the demand making captions.Using machine learning method, short sentence is carried out judge detection, judge that it is No is people's sound or noise, abandons noise, lifts accuracy further.This method both can process the sound having recorded and regard Frequency is it is also possible to process just in live audio frequency and video.For network direct broadcasting stream, can automatically network direct broadcasting voice be cut, side Continue link after an action of the bowels as dictated link parallel processing, faster processing time.
Brief description
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Have technology description in required use accompanying drawing be briefly described it should be apparent that, drawings in the following description be only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, acceptable Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is in one embodiment of the present invention, and audio frequency holds the schematic flow sheet of punctuate processing method of making an uproar;
Fig. 2 is in one embodiment of the present invention, and audio frequency holds the logic connection diagram of punctuate processing system of making an uproar.
Specific embodiment
Below in conjunction with the accompanying drawing of the present invention, technical scheme is clearly and completely described it is clear that institute The embodiment of description is only a part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, The every other embodiment that those of ordinary skill in the art are obtained under the premise of not making creative work, broadly falls into this The scope of bright protection.
Audio frequency in the present invention holds punctuate processing method of making an uproar, as shown in Figure 1, comprising:
Step s101, obtains multiple framing sections according to audio frequency.
The present invention may be mounted on server it is also possible to be arranged on personal computer or mobile computing device.Below Alleged computing terminal can be server or personal computer or mobile computing device.First, to Server uploads audio-video document, or opens audio-video document on personal computer or mobile computing device.Afterwards, count Calculation equipment extracts the audio stream in audio-video document, and audio stream unification is had symbol single-channel data to fixed sampling frequency.It Adopt framing parameter set in advance afterwards, sub-frame processing is carried out to data.
Step s1011: receive audio file;Step s1012: the sliced time according to setting enters to described audio file Row segmentation, obtains multiple framing sections.
Sub-frame processing is carried out to audio frequency.Every frame length is from 10ms to 500ms.In speech recognition, in order to accurately know Other voice, needs overlap between consecutive frame.The purpose of the present invention is not by speech recognition, can weigh therefore between frame and frame Folded or even allowed interval between consecutive frame it is also possible to not overlapping, be spaced apart 0ms to 500ms.So voice segmentation obtains Frame number will be less than frame number needed for speech recognition, thus reducing amount of calculation, improves calculating speed.With f1,f2,…fm, represent and obtain Frame, each frame has n sample, is s respectivelyk1,sk2,…,skn, the range value of each sample is fki,fk2,…,fkn.Each frame note Record time started and end time.
Speech data be by fixed sample rate, sound is sampled after, the real number numeric string that obtains.Sample rate 16k, just Represent 16000 data of sampling in 1 second.The meaning of framing be using this burst of data by regular time section be one set as divide Analysis unit.Such as, 16k sample rate, if every frame length is 100 milliseconds, has 1600 speech datas inside 1 frame.By dividing Frame is determining the granularity of control.In this patent, generally according to 100 milliseconds of framings that is to say, that the video of n second, need to be divided into 10n frame.Certainly, can be non-conterminous between frame and frame, such as, 100 milliseconds of the interval of two frames, then the video of n second, framing is exactly 5n frame.Increase the interval between frame and frame and can reduce totalframes, improve analyze speed, but cost is time degree of accuracy can drop Low.
Step s102, the energy value according to each framing section obtains energy threshold ek.
In this step:
Each frame is calculated with its energy ek.Energy definition including but not limited to amplitude square and with two kinds of absolute value sum Mode.
Energy balane formula according to amplitude square and definition is:
e k = σ i = 1 n f k i 2
Energy balane formula according to absolute value definition is:
e k = σ i = 1 n | f k i |
Set an energy threshold et, search adjacent and energy all more than etSpeech frame, obtain speech sentence s1,s2,… sj.That is to say:
si={ fk| k=a, a+1, a+2 ... a+b, ek>=et, and e(a-1)<et, and e(a+b+1)<et}.
In another embodiment, described step s101 includes:
Described step s102 includes: the meansigma methodss of the energy value according to each framing section obtain energy threshold ek.That is, will be upper The energy value that one step obtains, divided by sample size, obtains average energy.Energy threshold is the threshold value of every frame average energy, usual root According to experience setting, certain numeral between conventional 0.001-0.01, and user can manually adjust.
Step s103, merges into independent sentence.
According to described energy threshold ek, obtain its energy value from described each framing section and exceed energy threshold et;Framing Section, then be scanned to the preamble frame of this frame or postorder frame with this framing section for sentence intermediate frame, if the energy of preamble frame or postorder frame Amount threshold values is less than and sets energy threshold et, then merging this frame by frame start sequence with described sentence intermediate frame becomes independent sentence.
" if the energy threshold of preamble frame or postorder frame is less than sets energy threshold e in described step s103t, then by this frame Merging by frame start sequence with described sentence intermediate frame becomes independent sentence unit " step include: if the energy of preamble frame or postorder frame Amount threshold values is less than and sets energy et, then whether present frame and the interval time of next frame are judged less than setting interval time, if so, Then merging described sentence intermediate frame by frame start sequence becomes independent sentence.
Before and after each sentence, two frames are respectively forwardly searched for afterwards.If the next frame searching belongs to other sentences, Two sentences are merged.If the energy of next frame is less than et, and be not belonging to other sentences, then Fourier is carried out to this frame Conversion, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is vi, i=1,2 ... z. Overall strength is vsum, piProbability for every bands of a spectrum.piComputing formula be:
p i = v i v s u m
Then, the spectrum entropy of this frame is:
h = - &sigma; i = 1 z p i log p i
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, this frame Can entropy than not less than rt, then this frame is grouped in sentence.If scanning beginning or the end of voice flow, scan abort.
Such as, there are 10 speech frames, every frame energy is respectively:
0.05,0.12,0.002,0.004,0.1,0.2,0.4,0,5,0.001,0.12
If with 0.003 as threshold value, pass through the 3rd step, can obtain three sentences:
Sentence 1 comprises: 0.05,0.12
Sentence 2 comprises: 0.004,0.1,0.2,0.4,0.5
Sentence 3 comprises: 0.12
With sentence 2 as example, scan forward, the frame before it is 0.002, and this frame is not belonging to any sentence, and Its energy is less than threshold value 0.003, at this moment, this frame is carried out with Fourier transform, and calculating can entropy ratio.If energy entropy is than less than this Threshold value then it is assumed that this frame is not belonging to sentence 2, the end of scan forward.If can entropy ratio be not less than this threshold value then it is assumed that this Frame belongs to sentence 2, continues to scan forward next frame.Next frame is 0.12,0.12 to belong to sentence 1, then will be 2-in-1 to sentence 1 and sentence And.After having merged, foremost one frame is 0.05, has been the first frame it is impossible to scan forward, the end of scan forward.Backward The logic that the logical AND of scanning scans forward is the same.Run into energy and be less than energy threshold, calculate its energy entropy ratio, can entropy ratio be less than Energy entropy, than threshold value, the then end of scan, otherwise, continues to scan on.Run into other sentences, then merge, after merging, continue to scan on.
Afterwards, merge close sentence.For the sentence being bordered by, calculate its interval time, if interval time is less than referred to Fixed time threshold, then merge two sentences.
This step is to merge further, and such as it is assumed that every frame length is 100 milliseconds, sentence 1 comprises the 22nd, 23,24,25, 26 totally 5 frames, sentence 2 comprises 29,30,31,32,33,34,35 totally 7 frames, does not have other sentences between two sentences.This two It is spaced 2 frames between sentence, that is, 200 milliseconds.It is assumed that 10 milliseconds of the time threshold specified, because 200 milliseconds are less than 300 milliseconds, then sentence 1 and sentence 2 are merged, merge into 1 sentence.Frame 27,28 between sentence 1 and sentence 2 also one And in integrating with, the new sentence after merging comprises 22,23,24,25,26,27,28,29,30,31,32,33,34,35 totally 14 frames.
Step s104, carries out to every composing entropy analysis.
In this step, before and after each sentence, two frames are respectively forwardly searched for afterwards, if the next frame searching belongs to it His sentence, then merge to two sentences;If the energy of next frame is less than et, and it is not belonging to other sentences, then to this frame Carry out Fourier transform, take the amplitude of 0-4000hz, be divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is vi,i =1,2 ... z.Overall strength is vsum, piProbability for every bands of a spectrum: piComputing formula be:
p i = v i v s u m
Then, the spectrum entropy of this frame is:
h = - &sigma; i = 1 z p i log p i
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, this frame Can entropy than not less than rt, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort;
Step s105, identifies noise sentence;Whether the frame length judging described independent sentence is the short sentence frame length scope setting, if It is then the short independent sentence specimen of historical storage and currently independent sentence to be contrasted, if matching degree is less than setting value, will be independent Sentence is designated noise sentence;Using machine learning method, short sentence is carried out judge detection, judge whether it is people's sound or makes an uproar Sound, abandons noise, lifts accuracy further.
Step s106, obtains punctuate.The independent sentence not being designated noise sentence that each framing section of described audio frequency is obtained is made Punctuate for audio frequency.
In a preferred embodiment, also include after step s103:
Step s1031: if the frame length of described independent sentence exceeds sets independent frame length, calculate the spectrum entropy of the every frame of this independent office Ratio, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independent sentences.
Split long sentence.If the length of sentence is higher than the time threshold specified, this sentence is split.Tear open Point mode is as follows: ignores each a certain proportion of speech frame of head and the tail of sentence, remaining speech frame is traveled through.If each frame is It has been computed spectrum entropy ratio, then weight w has been used for using spectrum entropy.If not calculating spectrum entropy ratio, using this frame energy as weights w.For each frame, if in this sentence, on the left of this frame, there is nleft frame, there is nright frame on right side, definition splits coefficient value Ws is as follows: by traversal, searching makes the minimum frame of fractionation value ws of this sentence, and this sentence is divided into two sentences in left and right.If Yet suffer from long sentence in two sentences in left and right, then adopt this method long sentence to be continued to split, until not existing Long sentence.Filter too short meaningless sentence.Specify a time threshold, for less than time span sentence it is possible to It is not that people is speaking.For such sentence, adopt its energy highest one frame, calculate its mel cepstrum coefficients.During use Support vector machine (svm) grader first training is classified to it, judges whether it is the sound of people.Sound if not people Sound, then abandon this sentence.Svm classifier training mode is as follows: gathers some people's sounds from lecture video with network direct broadcasting video Sample, as positive sample, some typically inhuman sound samples are as negative sample.It is used Mel to be instructed to spectral coefficient as feature Practice, obtain model parameter.(principle of support vector machine refers to).Here other machines learning method can also be taken, such as deep Degree neutral net carries out classification and judges.
The present invention also provides the automatic split system carrying out audio frequency punctuate simultaneously, as shown in Figure 2, comprising: framing unit 101st, energy threshold acquiring unit 201, independent sentence acquiring unit 301;Spectrum entropy analytic unit 401, noise sentence judging unit 501 and Punctuate acquiring unit 601.
Described framing unit 101, is configured to obtain multiple framing sections according to audio frequency;
Described energy threshold acquiring unit 201, is configured to the energy value according to each framing section and obtains energy threshold ek
Described independent sentence acquiring unit 301, is configured to according to described energy threshold ek, from described each framing section, obtain it Energy value exceedes energy threshold et;Framing section, then the preamble frame of this frame or postorder frame are carried out with this framing section for sentence intermediate frame Scanning, if the energy threshold of preamble frame or postorder frame is less than sets energy threshold et, then this frame and described sentence intermediate frame are pressed frame Start sequence merges becomes independent sentence.
Spectrum entropy analytic unit 401, is configured to two frames before and after each sentence and respectively forwardly searches for afterwards, if search Next frame belongs to other sentences, then two sentences are merged;If the energy of next frame is less than et, and it is not belonging to other sentences Son, then carry out Fourier transform to this frame, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, every bands of a spectrum Intensity is vi, i=1,2 ... z.Overall strength is vsum, piProbability for every bands of a spectrum.piComputing formula be:
p i = v i v s u m
Then, the spectrum entropy of this frame is:
h = - &sigma; i = 1 z p i log p i
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, this frame Can entropy than not less than rt, then this frame is grouped in sentence.If scanning beginning or the end of voice flow, scan abort.
Described noise sentence judging unit 501, is configured to judge whether the frame length of described independent sentence is the short sentence frame length setting If so, the short independent sentence specimen of historical storage and currently independent sentence are then contrasted by scope, if matching degree is less than setting value, Then independent sentence is designated noise sentence;
Punctuate acquiring unit 601, is configured to the independence not being designated noise sentence obtaining each framing section of described audio frequency Sentence is as the punctuate of audio frequency
In a preferred embodiment, described framing unit 101 is additionally configured to: receives audio file;According to setting Sliced time described audio file is split, obtain multiple framing sections.
In a preferred embodiment, described energy threshold acquiring unit 201 is additionally configured to, according to each framing section The meansigma methodss of energy value obtain energy threshold ek.
In a preferred embodiment, described independent sentence acquiring unit 301 is additionally configured to, if preamble frame or postorder frame Energy threshold be less than set energy et, then judge interval time of present frame and next frame whether less than setting interval time, If so, then merging described sentence intermediate frame by frame start sequence becomes independent sentence.
In a preferred embodiment, comprising: long sentence judging unit 3011;
Described long sentence judging unit, if the frame length being configured to described independent sentence exceeds sets independent frame length, calculates this only The spectrum entropy ratio of the every frame of vertical office, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independences Sentence.
The above, the only specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, and any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, all should contain Cover within protection scope of the present invention.Therefore, protection scope of the present invention should described be defined by scope of the claims.

Claims (10)

1. audio frequency holds punctuate processing method of making an uproar, comprising:
Step s101, obtains multiple framing sections according to audio frequency;
Step s102, the energy value according to each framing section obtains energy threshold ek
Step s103, according to described energy threshold ek, obtain its energy value from described each framing section and exceed energy threshold et;'s Framing section, then be scanned to the preamble frame of this frame or postorder frame with this framing section for sentence intermediate frame, if preamble frame or postorder frame Energy threshold be less than set energy threshold et, then this frame and described sentence intermediate frame are merged by frame start sequence and become independent Sentence;
Step s104, before and after each sentence, two frames are respectively forwardly searched for afterwards, if the next frame searching belongs to other sentences Son, then merge to two sentences;If the energy of next frame is less than et, and be not belonging to other sentences, then this frame is carried out Fourier transform, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is vi, i=1, 2,…z.Overall strength is vsum, piProbability for every bands of a spectrum: piComputing formula be:
p i = v i v s u m
Then, the spectrum entropy of this frame is:
h = - &sigma; i = 1 z p i logp i
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, the energy entropy of this frame Than not less than rt, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort;
Step s105, judges whether the frame length of described independent sentence is the short sentence frame length scope setting, if so, then by historical storage Short independent sentence specimen and currently independent sentence are contrasted, if matching degree is less than setting value, independent sentence are designated noise sentence;
Step s106, the punctuate not being designated the independent sentence of noise sentence as audio frequency that each framing section of described audio frequency is obtained.
2. audio frequency appearance according to claim 1 makes an uproar punctuate processing method it is characterised in that described step s101 includes:
Step s1011: receive audio file;
Step s1012: the sliced time according to setting is split to described audio file, obtains multiple framing sections.
3. audio frequency according to claim 1 and 2 holds punctuate processing method of making an uproar it is characterised in that wrapping in described step s102 Include: the meansigma methodss of the energy value according to each framing section obtain energy threshold ek.
4. audio frequency according to claim 1 holds punctuate processing method of making an uproar it is characterised in that " if front in described step s103 The energy threshold of sequence frame or postorder frame is less than and sets energy threshold et, then this frame and described sentence intermediate frame are pressed frame start sequence and close And become independent sentence unit " step include:
If the energy threshold of preamble frame or postorder frame is less than sets energy et, then judge that present frame with the interval time of next frame is No less than set interval time, if so, then by described sentence intermediate frame by frame start sequence merge become independent sentence.
5. the audio frequency appearance according to claim 1 or 4 makes an uproar punctuate processing method it is characterised in that also including after step s103:
Step s1031: if the frame length of described independent sentence exceeds sets independent frame length, calculate the spectrum entropy ratio of the every frame of this independent office, Using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independent sentences.
6. carry out the automatic split system of audio frequency punctuate, comprising: framing unit, energy threshold acquiring unit, independent sentence obtain single Unit, noise sentence judging unit, punctuate acquiring unit;Spectrum entropy analytic unit:
Described framing unit, is configured to obtain multiple framing sections according to audio frequency;
Described energy threshold acquiring unit, is configured to the energy value according to each framing section and obtains energy threshold ek
Described independent sentence acquiring unit, is configured to according to described energy threshold ek, obtain its energy value from described each framing section and surpass Cross energy threshold et;Framing section, then the preamble frame of this frame or postorder frame are scanned with this framing section for sentence intermediate frame, if The energy threshold of preamble frame or postorder frame is less than and sets energy threshold et, then this frame and described sentence intermediate frame are pressed frame start sequence Merging becomes independent sentence;
Described spectrum entropy analytic unit, is configured to two frames before and after each sentence and respectively forwardly searches for afterwards, if searched down One frame belongs to other sentences, then two sentences are merged;If the energy of next frame is less than et, and it is not belonging to other sentences Son, then carry out Fourier transform to this frame, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, every bands of a spectrum Intensity is vi, i=1,2 ... z.Overall strength is vsum, piFor the probability of every bands of a spectrum, piComputing formula be:
p i = v i v s u m
Then, the spectrum entropy of this frame is:
h = - &sigma; i = 1 z p i logp i
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, the energy entropy of this frame Than not less than rt, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort;
Described noise sentence judging unit, is configured to judge whether the frame length of described independent sentence is the short sentence frame length scope setting, if It is then the short independent sentence specimen of historical storage and currently independent sentence to be contrasted, if matching degree is less than setting value, will be independent Sentence is designated noise sentence;
Punctuate acquiring unit, is configured to not be designated the independent sentence of noise sentence as sound using what each framing section of described audio frequency obtained The punctuate of frequency.
7. the automatic split system carrying out audio frequency punctuate according to claim 6 it is characterised in that described framing unit also It is configured that reception audio file;Sliced time according to setting is split to described audio file, obtains multiple framing sections.
8. the automatic split system carrying out audio frequency punctuate according to claim 6 or 7 is it is characterised in that described energy valve Value acquiring unit is additionally configured to, and the meansigma methodss of the energy value according to each framing section obtain energy threshold ek.
9. the automatic split system carrying out audio frequency punctuate according to claim 6 is it is characterised in that described independent sentence obtains Unit is additionally configured to, if the energy threshold of preamble frame or postorder frame is less than sets energy et, then present frame and next frame are judged Interval time, whether if so, then merging described sentence intermediate frame by frame start sequence became independent sentence less than setting interval time.
10. carry out the automatic split system of audio frequency punctuate it is characterised in that also including according to claim 6 or 9: long Sentence judging unit;
Described long sentence judging unit, if the frame length being configured to described independent sentence exceeds sets independent frame length, calculates this independent office The spectrum entropy ratio of every frame, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independent sentences.
CN201610799384.7A 2016-08-31 2016-08-31 Audio holds processing method and the system of making pauses in reading unpunctuated ancient writings of making an uproar Active CN106373592B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610799384.7A CN106373592B (en) 2016-08-31 2016-08-31 Audio holds processing method and the system of making pauses in reading unpunctuated ancient writings of making an uproar

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610799384.7A CN106373592B (en) 2016-08-31 2016-08-31 Audio holds processing method and the system of making pauses in reading unpunctuated ancient writings of making an uproar

Publications (2)

Publication Number Publication Date
CN106373592A true CN106373592A (en) 2017-02-01
CN106373592B CN106373592B (en) 2019-04-23

Family

ID=57899361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610799384.7A Active CN106373592B (en) 2016-08-31 2016-08-31 Audio holds processing method and the system of making pauses in reading unpunctuated ancient writings of making an uproar

Country Status (1)

Country Link
CN (1) CN106373592B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107424628A (en) * 2017-08-08 2017-12-01 哈尔滨理工大学 A kind of method that specific objective sound end is searched under noisy environment
CN109389999A (en) * 2018-09-28 2019-02-26 北京亿幕信息技术有限公司 A kind of high performance audio-video is made pauses in reading unpunctuated ancient writings method and system automatically

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000132177A (en) * 1998-10-20 2000-05-12 Canon Inc Device and method for processing voice
CN1622193A (en) * 2004-12-24 2005-06-01 北京中星微电子有限公司 Voice signal detection method
CN101625862A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Method for detecting voice interval in automatic caption generating system
CN103345922A (en) * 2013-07-05 2013-10-09 张巍 Large-length voice full-automatic segmentation method
CN103426440A (en) * 2013-08-22 2013-12-04 厦门大学 Voice endpoint detection device and voice endpoint detection method utilizing energy spectrum entropy spatial information
CN107424628A (en) * 2017-08-08 2017-12-01 哈尔滨理工大学 A kind of method that specific objective sound end is searched under noisy environment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000132177A (en) * 1998-10-20 2000-05-12 Canon Inc Device and method for processing voice
CN1622193A (en) * 2004-12-24 2005-06-01 北京中星微电子有限公司 Voice signal detection method
CN101625862A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Method for detecting voice interval in automatic caption generating system
CN103345922A (en) * 2013-07-05 2013-10-09 张巍 Large-length voice full-automatic segmentation method
CN103426440A (en) * 2013-08-22 2013-12-04 厦门大学 Voice endpoint detection device and voice endpoint detection method utilizing energy spectrum entropy spatial information
CN107424628A (en) * 2017-08-08 2017-12-01 哈尔滨理工大学 A kind of method that specific objective sound end is searched under noisy environment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SUN YIMING 等: "voice activity detection based on the improved dual-threshold method", 《2015 INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION》 *
王洋 等: "基于时频结合的带噪语音端点检测算法", 《黑龙江大学自然科学学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107424628A (en) * 2017-08-08 2017-12-01 哈尔滨理工大学 A kind of method that specific objective sound end is searched under noisy environment
CN109389999A (en) * 2018-09-28 2019-02-26 北京亿幕信息技术有限公司 A kind of high performance audio-video is made pauses in reading unpunctuated ancient writings method and system automatically

Also Published As

Publication number Publication date
CN106373592B (en) 2019-04-23

Similar Documents

Publication Publication Date Title
CN106157951B (en) Carry out the automatic method for splitting and system of audio punctuate
CN108564942B (en) Voice emotion recognition method and system based on adjustable sensitivity
CN103345922B (en) A kind of large-length voice full-automatic segmentation method
CN101685634B (en) Children speech emotion recognition method
CN100514446C (en) Pronunciation evaluating method based on voice identification and voice analysis
CN105427858A (en) Method and system for achieving automatic voice classification
CN104200804A (en) Various-information coupling emotion recognition method for human-computer interaction
CN105374352A (en) Voice activation method and system
CN101625857A (en) Self-adaptive voice endpoint detection method
CN101751919A (en) Spoken Chinese stress automatic detection method
CN103617799A (en) Method for detecting English statement pronunciation quality suitable for mobile device
CN105825852A (en) Oral English reading test scoring method
CN104517605B (en) A kind of sound bite splicing system and method for phonetic synthesis
CN106875943A (en) A kind of speech recognition system for big data analysis
CN105261246A (en) Spoken English error correcting system based on big data mining technology
CN101625862B (en) Method for detecting voice interval in automatic caption generating system
CN106303695A (en) Audio translation multiple language characters processing method and system
CN110176228A (en) A kind of small corpus audio recognition method and system
CN106373592A (en) Audio noise tolerance punctuation processing method and system
CN103035252B (en) Chinese speech signal processing method, Chinese speech signal processing device and hearing aid device
Amir et al. Unresolved anger: Prosodic analysis and classification of speech from a therapeutic setting
DE60318450T2 (en) Apparatus and method for segmentation of audio data in meta-patterns
CN112231440A (en) Voice search method based on artificial intelligence
CN101419796A (en) Device and method for automatically splitting speech signal of single character
CN110299133A (en) The method for determining illegally to broadcast based on keyword

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant