CN106373592A - Audio noise tolerance punctuation processing method and system - Google Patents
Audio noise tolerance punctuation processing method and system Download PDFInfo
- Publication number
- CN106373592A CN106373592A CN201610799384.7A CN201610799384A CN106373592A CN 106373592 A CN106373592 A CN 106373592A CN 201610799384 A CN201610799384 A CN 201610799384A CN 106373592 A CN106373592 A CN 106373592A
- Authority
- CN
- China
- Prior art keywords
- frame
- sentence
- energy
- independent
- energy threshold
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 13
- 238000009432 framing Methods 0.000 claims abstract description 65
- 238000000034 method Methods 0.000 claims abstract description 20
- 238000001228 spectrum Methods 0.000 claims description 57
- 238000003860 storage Methods 0.000 claims description 6
- 238000005520 cutting process Methods 0.000 abstract description 7
- 238000004458 analytical method Methods 0.000 abstract description 5
- 238000010183 spectrum analysis Methods 0.000 abstract 1
- 238000004364 calculation method Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000005194 fractionation Methods 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/278—Subtitling
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Telephonic Communication Services (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to an audio noise tolerance punctuation processing method and a system. The method comprises steps that multiple framing segments are acquired according to an audio; an energy threshold is acquired according to an energy value of each framing segment, a framing segment with an energy value surpassing the energy threshold Et is acquired from the framing segments according to the energy threshold, the frame segment with the energy value surpassing the energy threshold Et is taken as a middle sentence frame to scan a front sequence frame or a back sequence frame, if an energy threshold of the front sequence frame or the back sequence frame is smaller the set energy threshold Et, the frame with the energy threshold smaller than the set energy threshold Et and the middle sentence frame are merged according to the start order into an independent sentence, entropy spectrum analysis on each independent sentence is then carried out, and a final analysis sentence is acquired. Through the method, a problem of automatic punctuation incapability existing in a caption corresponding process in the prior art is solved, recorded audios and videos can not only be processed, but also audios and videos which are presently played can be further processed, for network broadcast flows, network broadcast voice cutting can be automatically carried out, subsequent links such as listening and writing links can be conveniently processed parallelly, and the processing time is shortened.
Description
Technical field
The present invention relates to voice, captions processing technology field, more particularly, to carry out audio frequency and hold the punctuate processing method and be of making an uproar
System.
Background technology
Captions make field at present, mainly pass through manually to carry out voice punctuate, the premise of artificial speech punctuate is by voice
All listen one time, mark starting point and the end point of a word while dictation by patting shortcut.Due to pat
There is dislocation in time delay, obtained starting point and end point, need to manually adjust.Whole flow process needs to consume the plenty of time.Than
As the audio frequency of 30 minutes needs the punctuate time of time-consuming 40 minutes to 1 hour, and the productivity is extremely low.And in network direct broadcasting neck
Domain, if do not made pauses in reading unpunctuated ancient writings, by manually being dictated, being difficult to carry out parallelization, and the speed of people's dictation can be slower than live speed,
Cannot be carried out parallelization and cannot carry out real-time live broadcast in both illustration and text.Rely on artificial punctuate, because the speed of artificial punctuate is also than broadcasting
Speed is slow, also leads to be difficult to real-time live broadcast.
Content of the invention
For above-mentioned defect of the prior art, it is an object of the invention to provide audio frequency holds the punctuate processing method and be of making an uproar
System.Thus solving the problems, such as in existing captions corresponding process it is impossible to automatically be made pauses in reading unpunctuated ancient writings and noise is high.
The present invention is directed to classroom recorded broadcast and network direct broadcasting, proposes a kind of method of intelligent sound punctuate, and this method is passed through
Speech analysis techniques, can quickly analyze the voice data recorded or gather automatically, and detection obtains meeting the language of subtitle specification
Tablet section, saves the time that video and audio captions make.
In order to achieve the above object, the following technical scheme of present invention offer:
Audio frequency holds punctuate processing method of making an uproar, comprising:
Step s101, obtains multiple framing sections according to audio frequency;
Step s102, the energy value according to each framing section obtains energy threshold ek;
Step s103, according to described energy threshold ek, obtain its energy value from described each framing section and exceed energy threshold
et;Framing section, then the preamble frame of this frame or postorder frame are scanned with this framing section for sentence intermediate frame, if preamble frame or after
The energy threshold of sequence frame is less than and sets energy threshold et, then merging this frame by frame start sequence with described sentence intermediate frame becomes only
Vertical sentence;
Step s104, before and after each sentence, two frames are respectively forwardly searched for afterwards, if the next frame searching belongs to it
His sentence, then merge to two sentences;If the energy of next frame is less than et, and it is not belonging to other sentences, then to this frame
Carry out Fourier transform, take the amplitude of 0-4000hz, be divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is vi,i
=1,2 ... z.Overall strength is vsum, piProbability for every bands of a spectrum: piComputing formula be:
Then, the spectrum entropy of this frame is:
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, this frame
Can entropy than not less than rt, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort;
Step s105, judges whether the frame length of described independent sentence is the short sentence frame length scope setting, if so, then by history
The short independent sentence specimen of storage is contrasted with currently independent sentence, if matching degree is less than setting value, independent sentence is designated and makes an uproar
Sound sentence;
Step s106, independent sentence the breaking as audio frequency not being designated noise sentence that each framing section of described audio frequency is obtained
Sentence.
In a preferred embodiment, described step s101 includes:
Step s1011: receive audio file;
Step s1012: the sliced time according to setting is split to described audio file, obtains multiple framing sections.
In a preferred embodiment, described step s102 includes: the energy value according to each framing section average
Value obtains energy threshold ek.
In a preferred embodiment, " if the energy threshold of preamble frame or postorder frame is less than in described step s103
Set energy threshold et, then merging this frame by frame start sequence with described sentence intermediate frame becomes independent sentence unit " step bag
Include:
If the energy threshold of preamble frame or postorder frame is less than sets energy et, then judge present frame and next frame interval when
Between whether less than setting interval time, if so, then merging described sentence intermediate frame by frame start sequence becomes independent sentence.
In a preferred embodiment, also include after step s103:
Step s1031: if the frame length of described independent sentence exceeds sets independent frame length, calculate the spectrum entropy of the every frame of this independent office
Ratio, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independent sentences.
Simultaneously present invention also offers a kind of automatic split system carrying out audio frequency punctuate, comprising: framing unit, energy valve
Value acquiring unit, independent sentence acquiring unit;Spectrum entropy analytic unit;
Described framing unit, is configured to obtain multiple framing sections according to audio frequency;
Described energy threshold acquiring unit, is configured to the energy value according to each framing section and obtains energy threshold ek;
Described independent sentence acquiring unit, is configured to according to described energy threshold ek, from described each framing section, obtain its energy
Value exceedes energy threshold et;Framing section, then the preamble frame of this frame or postorder frame are swept with this framing section for sentence intermediate frame
Retouch, if the energy threshold of preamble frame or postorder frame is less than sets energy threshold et, then this frame is risen by frame with described sentence intermediate frame
Beginning order merges becomes independent sentence;
Described spectrum entropy analytic unit, is configured to two frames before and after each sentence and respectively forwardly searches for afterwards, if searched
Next frame belong to other sentences, then two sentences are merged;If the energy of next frame is less than et, and it is not belonging to other
Sentence, then carry out Fourier transform to this frame, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum, every bands of a spectrum according to fixed width
Intensity be vi, i=1,2 ... z.Overall strength is vsum, piFor the probability of every bands of a spectrum, piComputing formula be:
Then, the spectrum entropy of this frame is:
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, this frame
Can entropy than not less than rt, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort;
Described noise sentence judging unit, is configured to judge whether the frame length of described independent sentence is the short sentence frame length model setting
Enclose, if so, then the short independent sentence specimen of historical storage and currently independent sentence are contrasted, if matching degree is less than setting value,
Independent sentence is designated noise sentence;
Punctuate acquiring unit, the independent sentence not being designated noise sentence being configured to obtain each framing section of described audio frequency is made
Punctuate for audio frequency.
In a preferred embodiment, described framing unit is additionally configured to: receives audio file;According to dividing of setting
Time of cutting is split to described audio file, obtains multiple framing sections.
In a preferred embodiment, described energy threshold acquiring unit is additionally configured to, according to the energy of each framing section
The meansigma methodss of value obtain energy threshold ek.
In a preferred embodiment, described independent sentence acquiring unit is additionally configured to, if preamble frame or postorder frame
Energy threshold is less than and sets energy et, then judge interval time of present frame and next frame whether less than setting interval time, if
It is that then merging described sentence intermediate frame by frame start sequence becomes independent sentence.
In a preferred embodiment, also include: long sentence judging unit;
Described long sentence judging unit, if the frame length being configured to described independent sentence exceeds sets independent frame length, calculates this only
The spectrum entropy ratio of the every frame of vertical office, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independences
Sentence.
The invention has the benefit that the main calculating of this method is carried out in time domain, calculating speed is fast.For possible
It is the limited regional area that consonant is also likely to be noise, is analyzed in conjunction with time-domain and frequency-domain, increase the accuracy of cutting.Only need
Time-consuming spectrum analyses are carried out to a few frames, cutting speed is i.e. fast and accurate, has stronger noise resistance characteristic simultaneously again.With
In the time point automatically generating voice cutting, the workload of audio frequency and video caption editing can be saved.Devise a set of direct utilization
Existing result of calculation, no longer carries out the cutting method of quadratic character calculating, can quickly carry out long sentence cutting, and guarantee is not in
Long sentence, meets the demand making captions.Using machine learning method, short sentence is carried out judge detection, judge that it is
No is people's sound or noise, abandons noise, lifts accuracy further.This method both can process the sound having recorded and regard
Frequency is it is also possible to process just in live audio frequency and video.For network direct broadcasting stream, can automatically network direct broadcasting voice be cut, side
Continue link after an action of the bowels as dictated link parallel processing, faster processing time.
Brief description
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
Have technology description in required use accompanying drawing be briefly described it should be apparent that, drawings in the following description be only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, acceptable
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is in one embodiment of the present invention, and audio frequency holds the schematic flow sheet of punctuate processing method of making an uproar;
Fig. 2 is in one embodiment of the present invention, and audio frequency holds the logic connection diagram of punctuate processing system of making an uproar.
Specific embodiment
Below in conjunction with the accompanying drawing of the present invention, technical scheme is clearly and completely described it is clear that institute
The embodiment of description is only a part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention,
The every other embodiment that those of ordinary skill in the art are obtained under the premise of not making creative work, broadly falls into this
The scope of bright protection.
Audio frequency in the present invention holds punctuate processing method of making an uproar, as shown in Figure 1, comprising:
Step s101, obtains multiple framing sections according to audio frequency.
The present invention may be mounted on server it is also possible to be arranged on personal computer or mobile computing device.Below
Alleged computing terminal can be server or personal computer or mobile computing device.First, to
Server uploads audio-video document, or opens audio-video document on personal computer or mobile computing device.Afterwards, count
Calculation equipment extracts the audio stream in audio-video document, and audio stream unification is had symbol single-channel data to fixed sampling frequency.It
Adopt framing parameter set in advance afterwards, sub-frame processing is carried out to data.
Step s1011: receive audio file;Step s1012: the sliced time according to setting enters to described audio file
Row segmentation, obtains multiple framing sections.
Sub-frame processing is carried out to audio frequency.Every frame length is from 10ms to 500ms.In speech recognition, in order to accurately know
Other voice, needs overlap between consecutive frame.The purpose of the present invention is not by speech recognition, can weigh therefore between frame and frame
Folded or even allowed interval between consecutive frame it is also possible to not overlapping, be spaced apart 0ms to 500ms.So voice segmentation obtains
Frame number will be less than frame number needed for speech recognition, thus reducing amount of calculation, improves calculating speed.With f1,f2,…fm, represent and obtain
Frame, each frame has n sample, is s respectivelyk1,sk2,…,skn, the range value of each sample is fki,fk2,…,fkn.Each frame note
Record time started and end time.
Speech data be by fixed sample rate, sound is sampled after, the real number numeric string that obtains.Sample rate 16k, just
Represent 16000 data of sampling in 1 second.The meaning of framing be using this burst of data by regular time section be one set as divide
Analysis unit.Such as, 16k sample rate, if every frame length is 100 milliseconds, has 1600 speech datas inside 1 frame.By dividing
Frame is determining the granularity of control.In this patent, generally according to 100 milliseconds of framings that is to say, that the video of n second, need to be divided into
10n frame.Certainly, can be non-conterminous between frame and frame, such as, 100 milliseconds of the interval of two frames, then the video of n second, framing is exactly
5n frame.Increase the interval between frame and frame and can reduce totalframes, improve analyze speed, but cost is time degree of accuracy can drop
Low.
Step s102, the energy value according to each framing section obtains energy threshold ek.
In this step:
Each frame is calculated with its energy ek.Energy definition including but not limited to amplitude square and with two kinds of absolute value sum
Mode.
Energy balane formula according to amplitude square and definition is:
Energy balane formula according to absolute value definition is:
Set an energy threshold et, search adjacent and energy all more than etSpeech frame, obtain speech sentence s1,s2,…
sj.That is to say:
si={ fk| k=a, a+1, a+2 ... a+b, ek>=et, and e(a-1)<et, and e(a+b+1)<et}.
In another embodiment, described step s101 includes:
Described step s102 includes: the meansigma methodss of the energy value according to each framing section obtain energy threshold ek.That is, will be upper
The energy value that one step obtains, divided by sample size, obtains average energy.Energy threshold is the threshold value of every frame average energy, usual root
According to experience setting, certain numeral between conventional 0.001-0.01, and user can manually adjust.
Step s103, merges into independent sentence.
According to described energy threshold ek, obtain its energy value from described each framing section and exceed energy threshold et;Framing
Section, then be scanned to the preamble frame of this frame or postorder frame with this framing section for sentence intermediate frame, if the energy of preamble frame or postorder frame
Amount threshold values is less than and sets energy threshold et, then merging this frame by frame start sequence with described sentence intermediate frame becomes independent sentence.
" if the energy threshold of preamble frame or postorder frame is less than sets energy threshold e in described step s103t, then by this frame
Merging by frame start sequence with described sentence intermediate frame becomes independent sentence unit " step include: if the energy of preamble frame or postorder frame
Amount threshold values is less than and sets energy et, then whether present frame and the interval time of next frame are judged less than setting interval time, if so,
Then merging described sentence intermediate frame by frame start sequence becomes independent sentence.
Before and after each sentence, two frames are respectively forwardly searched for afterwards.If the next frame searching belongs to other sentences,
Two sentences are merged.If the energy of next frame is less than et, and be not belonging to other sentences, then Fourier is carried out to this frame
Conversion, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is vi, i=1,2 ... z.
Overall strength is vsum, piProbability for every bands of a spectrum.piComputing formula be:
Then, the spectrum entropy of this frame is:
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, this frame
Can entropy than not less than rt, then this frame is grouped in sentence.If scanning beginning or the end of voice flow, scan abort.
Such as, there are 10 speech frames, every frame energy is respectively:
0.05,0.12,0.002,0.004,0.1,0.2,0.4,0,5,0.001,0.12
If with 0.003 as threshold value, pass through the 3rd step, can obtain three sentences:
Sentence 1 comprises: 0.05,0.12
Sentence 2 comprises: 0.004,0.1,0.2,0.4,0.5
Sentence 3 comprises: 0.12
With sentence 2 as example, scan forward, the frame before it is 0.002, and this frame is not belonging to any sentence, and
Its energy is less than threshold value 0.003, at this moment, this frame is carried out with Fourier transform, and calculating can entropy ratio.If energy entropy is than less than this
Threshold value then it is assumed that this frame is not belonging to sentence 2, the end of scan forward.If can entropy ratio be not less than this threshold value then it is assumed that this
Frame belongs to sentence 2, continues to scan forward next frame.Next frame is 0.12,0.12 to belong to sentence 1, then will be 2-in-1 to sentence 1 and sentence
And.After having merged, foremost one frame is 0.05, has been the first frame it is impossible to scan forward, the end of scan forward.Backward
The logic that the logical AND of scanning scans forward is the same.Run into energy and be less than energy threshold, calculate its energy entropy ratio, can entropy ratio be less than
Energy entropy, than threshold value, the then end of scan, otherwise, continues to scan on.Run into other sentences, then merge, after merging, continue to scan on.
Afterwards, merge close sentence.For the sentence being bordered by, calculate its interval time, if interval time is less than referred to
Fixed time threshold, then merge two sentences.
This step is to merge further, and such as it is assumed that every frame length is 100 milliseconds, sentence 1 comprises the 22nd, 23,24,25,
26 totally 5 frames, sentence 2 comprises 29,30,31,32,33,34,35 totally 7 frames, does not have other sentences between two sentences.This two
It is spaced 2 frames between sentence, that is, 200 milliseconds.It is assumed that 10 milliseconds of the time threshold specified, because 200 milliseconds are less than
300 milliseconds, then sentence 1 and sentence 2 are merged, merge into 1 sentence.Frame 27,28 between sentence 1 and sentence 2 also one
And in integrating with, the new sentence after merging comprises 22,23,24,25,26,27,28,29,30,31,32,33,34,35 totally 14 frames.
Step s104, carries out to every composing entropy analysis.
In this step, before and after each sentence, two frames are respectively forwardly searched for afterwards, if the next frame searching belongs to it
His sentence, then merge to two sentences;If the energy of next frame is less than et, and it is not belonging to other sentences, then to this frame
Carry out Fourier transform, take the amplitude of 0-4000hz, be divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is vi,i
=1,2 ... z.Overall strength is vsum, piProbability for every bands of a spectrum: piComputing formula be:
Then, the spectrum entropy of this frame is:
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, this frame
Can entropy than not less than rt, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort;
Step s105, identifies noise sentence;Whether the frame length judging described independent sentence is the short sentence frame length scope setting, if
It is then the short independent sentence specimen of historical storage and currently independent sentence to be contrasted, if matching degree is less than setting value, will be independent
Sentence is designated noise sentence;Using machine learning method, short sentence is carried out judge detection, judge whether it is people's sound or makes an uproar
Sound, abandons noise, lifts accuracy further.
Step s106, obtains punctuate.The independent sentence not being designated noise sentence that each framing section of described audio frequency is obtained is made
Punctuate for audio frequency.
In a preferred embodiment, also include after step s103:
Step s1031: if the frame length of described independent sentence exceeds sets independent frame length, calculate the spectrum entropy of the every frame of this independent office
Ratio, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independent sentences.
Split long sentence.If the length of sentence is higher than the time threshold specified, this sentence is split.Tear open
Point mode is as follows: ignores each a certain proportion of speech frame of head and the tail of sentence, remaining speech frame is traveled through.If each frame is
It has been computed spectrum entropy ratio, then weight w has been used for using spectrum entropy.If not calculating spectrum entropy ratio, using this frame energy as weights
w.For each frame, if in this sentence, on the left of this frame, there is nleft frame, there is nright frame on right side, definition splits coefficient value
Ws is as follows: by traversal, searching makes the minimum frame of fractionation value ws of this sentence, and this sentence is divided into two sentences in left and right.If
Yet suffer from long sentence in two sentences in left and right, then adopt this method long sentence to be continued to split, until not existing
Long sentence.Filter too short meaningless sentence.Specify a time threshold, for less than time span sentence it is possible to
It is not that people is speaking.For such sentence, adopt its energy highest one frame, calculate its mel cepstrum coefficients.During use
Support vector machine (svm) grader first training is classified to it, judges whether it is the sound of people.Sound if not people
Sound, then abandon this sentence.Svm classifier training mode is as follows: gathers some people's sounds from lecture video with network direct broadcasting video
Sample, as positive sample, some typically inhuman sound samples are as negative sample.It is used Mel to be instructed to spectral coefficient as feature
Practice, obtain model parameter.(principle of support vector machine refers to).Here other machines learning method can also be taken, such as deep
Degree neutral net carries out classification and judges.
The present invention also provides the automatic split system carrying out audio frequency punctuate simultaneously, as shown in Figure 2, comprising: framing unit
101st, energy threshold acquiring unit 201, independent sentence acquiring unit 301;Spectrum entropy analytic unit 401, noise sentence judging unit 501 and
Punctuate acquiring unit 601.
Described framing unit 101, is configured to obtain multiple framing sections according to audio frequency;
Described energy threshold acquiring unit 201, is configured to the energy value according to each framing section and obtains energy threshold ek;
Described independent sentence acquiring unit 301, is configured to according to described energy threshold ek, from described each framing section, obtain it
Energy value exceedes energy threshold et;Framing section, then the preamble frame of this frame or postorder frame are carried out with this framing section for sentence intermediate frame
Scanning, if the energy threshold of preamble frame or postorder frame is less than sets energy threshold et, then this frame and described sentence intermediate frame are pressed frame
Start sequence merges becomes independent sentence.
Spectrum entropy analytic unit 401, is configured to two frames before and after each sentence and respectively forwardly searches for afterwards, if search
Next frame belongs to other sentences, then two sentences are merged;If the energy of next frame is less than et, and it is not belonging to other sentences
Son, then carry out Fourier transform to this frame, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, every bands of a spectrum
Intensity is vi, i=1,2 ... z.Overall strength is vsum, piProbability for every bands of a spectrum.piComputing formula be:
Then, the spectrum entropy of this frame is:
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, this frame
Can entropy than not less than rt, then this frame is grouped in sentence.If scanning beginning or the end of voice flow, scan abort.
Described noise sentence judging unit 501, is configured to judge whether the frame length of described independent sentence is the short sentence frame length setting
If so, the short independent sentence specimen of historical storage and currently independent sentence are then contrasted by scope, if matching degree is less than setting value,
Then independent sentence is designated noise sentence;
Punctuate acquiring unit 601, is configured to the independence not being designated noise sentence obtaining each framing section of described audio frequency
Sentence is as the punctuate of audio frequency
In a preferred embodiment, described framing unit 101 is additionally configured to: receives audio file;According to setting
Sliced time described audio file is split, obtain multiple framing sections.
In a preferred embodiment, described energy threshold acquiring unit 201 is additionally configured to, according to each framing section
The meansigma methodss of energy value obtain energy threshold ek.
In a preferred embodiment, described independent sentence acquiring unit 301 is additionally configured to, if preamble frame or postorder frame
Energy threshold be less than set energy et, then judge interval time of present frame and next frame whether less than setting interval time,
If so, then merging described sentence intermediate frame by frame start sequence becomes independent sentence.
In a preferred embodiment, comprising: long sentence judging unit 3011;
Described long sentence judging unit, if the frame length being configured to described independent sentence exceeds sets independent frame length, calculates this only
The spectrum entropy ratio of the every frame of vertical office, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independences
Sentence.
The above, the only specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, and any
Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, all should contain
Cover within protection scope of the present invention.Therefore, protection scope of the present invention should described be defined by scope of the claims.
Claims (10)
1. audio frequency holds punctuate processing method of making an uproar, comprising:
Step s101, obtains multiple framing sections according to audio frequency;
Step s102, the energy value according to each framing section obtains energy threshold ek;
Step s103, according to described energy threshold ek, obtain its energy value from described each framing section and exceed energy threshold et;'s
Framing section, then be scanned to the preamble frame of this frame or postorder frame with this framing section for sentence intermediate frame, if preamble frame or postorder frame
Energy threshold be less than set energy threshold et, then this frame and described sentence intermediate frame are merged by frame start sequence and become independent
Sentence;
Step s104, before and after each sentence, two frames are respectively forwardly searched for afterwards, if the next frame searching belongs to other sentences
Son, then merge to two sentences;If the energy of next frame is less than et, and be not belonging to other sentences, then this frame is carried out
Fourier transform, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is vi, i=1,
2,…z.Overall strength is vsum, piProbability for every bands of a spectrum: piComputing formula be:
Then, the spectrum entropy of this frame is:
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, the energy entropy of this frame
Than not less than rt, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort;
Step s105, judges whether the frame length of described independent sentence is the short sentence frame length scope setting, if so, then by historical storage
Short independent sentence specimen and currently independent sentence are contrasted, if matching degree is less than setting value, independent sentence are designated noise sentence;
Step s106, the punctuate not being designated the independent sentence of noise sentence as audio frequency that each framing section of described audio frequency is obtained.
2. audio frequency appearance according to claim 1 makes an uproar punctuate processing method it is characterised in that described step s101 includes:
Step s1011: receive audio file;
Step s1012: the sliced time according to setting is split to described audio file, obtains multiple framing sections.
3. audio frequency according to claim 1 and 2 holds punctuate processing method of making an uproar it is characterised in that wrapping in described step s102
Include: the meansigma methodss of the energy value according to each framing section obtain energy threshold ek.
4. audio frequency according to claim 1 holds punctuate processing method of making an uproar it is characterised in that " if front in described step s103
The energy threshold of sequence frame or postorder frame is less than and sets energy threshold et, then this frame and described sentence intermediate frame are pressed frame start sequence and close
And become independent sentence unit " step include:
If the energy threshold of preamble frame or postorder frame is less than sets energy et, then judge that present frame with the interval time of next frame is
No less than set interval time, if so, then by described sentence intermediate frame by frame start sequence merge become independent sentence.
5. the audio frequency appearance according to claim 1 or 4 makes an uproar punctuate processing method it is characterised in that also including after step s103:
Step s1031: if the frame length of described independent sentence exceeds sets independent frame length, calculate the spectrum entropy ratio of the every frame of this independent office,
Using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independent sentences.
6. carry out the automatic split system of audio frequency punctuate, comprising: framing unit, energy threshold acquiring unit, independent sentence obtain single
Unit, noise sentence judging unit, punctuate acquiring unit;Spectrum entropy analytic unit:
Described framing unit, is configured to obtain multiple framing sections according to audio frequency;
Described energy threshold acquiring unit, is configured to the energy value according to each framing section and obtains energy threshold ek;
Described independent sentence acquiring unit, is configured to according to described energy threshold ek, obtain its energy value from described each framing section and surpass
Cross energy threshold et;Framing section, then the preamble frame of this frame or postorder frame are scanned with this framing section for sentence intermediate frame, if
The energy threshold of preamble frame or postorder frame is less than and sets energy threshold et, then this frame and described sentence intermediate frame are pressed frame start sequence
Merging becomes independent sentence;
Described spectrum entropy analytic unit, is configured to two frames before and after each sentence and respectively forwardly searches for afterwards, if searched down
One frame belongs to other sentences, then two sentences are merged;If the energy of next frame is less than et, and it is not belonging to other sentences
Son, then carry out Fourier transform to this frame, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, every bands of a spectrum
Intensity is vi, i=1,2 ... z.Overall strength is vsum, piFor the probability of every bands of a spectrum, piComputing formula be:
Then, the spectrum entropy of this frame is:
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, the energy entropy of this frame
Than not less than rt, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort;
Described noise sentence judging unit, is configured to judge whether the frame length of described independent sentence is the short sentence frame length scope setting, if
It is then the short independent sentence specimen of historical storage and currently independent sentence to be contrasted, if matching degree is less than setting value, will be independent
Sentence is designated noise sentence;
Punctuate acquiring unit, is configured to not be designated the independent sentence of noise sentence as sound using what each framing section of described audio frequency obtained
The punctuate of frequency.
7. the automatic split system carrying out audio frequency punctuate according to claim 6 it is characterised in that described framing unit also
It is configured that reception audio file;Sliced time according to setting is split to described audio file, obtains multiple framing sections.
8. the automatic split system carrying out audio frequency punctuate according to claim 6 or 7 is it is characterised in that described energy valve
Value acquiring unit is additionally configured to, and the meansigma methodss of the energy value according to each framing section obtain energy threshold ek.
9. the automatic split system carrying out audio frequency punctuate according to claim 6 is it is characterised in that described independent sentence obtains
Unit is additionally configured to, if the energy threshold of preamble frame or postorder frame is less than sets energy et, then present frame and next frame are judged
Interval time, whether if so, then merging described sentence intermediate frame by frame start sequence became independent sentence less than setting interval time.
10. carry out the automatic split system of audio frequency punctuate it is characterised in that also including according to claim 6 or 9: long
Sentence judging unit;
Described long sentence judging unit, if the frame length being configured to described independent sentence exceeds sets independent frame length, calculates this independent office
The spectrum entropy ratio of every frame, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independent sentences.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610799384.7A CN106373592B (en) | 2016-08-31 | 2016-08-31 | Audio holds processing method and the system of making pauses in reading unpunctuated ancient writings of making an uproar |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610799384.7A CN106373592B (en) | 2016-08-31 | 2016-08-31 | Audio holds processing method and the system of making pauses in reading unpunctuated ancient writings of making an uproar |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106373592A true CN106373592A (en) | 2017-02-01 |
CN106373592B CN106373592B (en) | 2019-04-23 |
Family
ID=57899361
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610799384.7A Active CN106373592B (en) | 2016-08-31 | 2016-08-31 | Audio holds processing method and the system of making pauses in reading unpunctuated ancient writings of making an uproar |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106373592B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107424628A (en) * | 2017-08-08 | 2017-12-01 | 哈尔滨理工大学 | A kind of method that specific objective sound end is searched under noisy environment |
CN109389999A (en) * | 2018-09-28 | 2019-02-26 | 北京亿幕信息技术有限公司 | A kind of high performance audio-video is made pauses in reading unpunctuated ancient writings method and system automatically |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000132177A (en) * | 1998-10-20 | 2000-05-12 | Canon Inc | Device and method for processing voice |
CN1622193A (en) * | 2004-12-24 | 2005-06-01 | 北京中星微电子有限公司 | Voice signal detection method |
CN101625862A (en) * | 2008-07-10 | 2010-01-13 | 新奥特(北京)视频技术有限公司 | Method for detecting voice interval in automatic caption generating system |
CN103345922A (en) * | 2013-07-05 | 2013-10-09 | 张巍 | Large-length voice full-automatic segmentation method |
CN103426440A (en) * | 2013-08-22 | 2013-12-04 | 厦门大学 | Voice endpoint detection device and voice endpoint detection method utilizing energy spectrum entropy spatial information |
CN107424628A (en) * | 2017-08-08 | 2017-12-01 | 哈尔滨理工大学 | A kind of method that specific objective sound end is searched under noisy environment |
-
2016
- 2016-08-31 CN CN201610799384.7A patent/CN106373592B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000132177A (en) * | 1998-10-20 | 2000-05-12 | Canon Inc | Device and method for processing voice |
CN1622193A (en) * | 2004-12-24 | 2005-06-01 | 北京中星微电子有限公司 | Voice signal detection method |
CN101625862A (en) * | 2008-07-10 | 2010-01-13 | 新奥特(北京)视频技术有限公司 | Method for detecting voice interval in automatic caption generating system |
CN103345922A (en) * | 2013-07-05 | 2013-10-09 | 张巍 | Large-length voice full-automatic segmentation method |
CN103426440A (en) * | 2013-08-22 | 2013-12-04 | 厦门大学 | Voice endpoint detection device and voice endpoint detection method utilizing energy spectrum entropy spatial information |
CN107424628A (en) * | 2017-08-08 | 2017-12-01 | 哈尔滨理工大学 | A kind of method that specific objective sound end is searched under noisy environment |
Non-Patent Citations (2)
Title |
---|
SUN YIMING 等: "voice activity detection based on the improved dual-threshold method", 《2015 INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION》 * |
王洋 等: "基于时频结合的带噪语音端点检测算法", 《黑龙江大学自然科学学报》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107424628A (en) * | 2017-08-08 | 2017-12-01 | 哈尔滨理工大学 | A kind of method that specific objective sound end is searched under noisy environment |
CN109389999A (en) * | 2018-09-28 | 2019-02-26 | 北京亿幕信息技术有限公司 | A kind of high performance audio-video is made pauses in reading unpunctuated ancient writings method and system automatically |
Also Published As
Publication number | Publication date |
---|---|
CN106373592B (en) | 2019-04-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106157951B (en) | Carry out the automatic method for splitting and system of audio punctuate | |
CN101685634B (en) | Children speech emotion recognition method | |
CN101751919B (en) | Spoken Chinese stress automatic detection method | |
CN103345922B (en) | A kind of large-length voice full-automatic segmentation method | |
CN100514446C (en) | Pronunciation evaluating method based on voice identification and voice analysis | |
CN104200804A (en) | Various-information coupling emotion recognition method for human-computer interaction | |
CN105374352A (en) | Voice activation method and system | |
CN101625857A (en) | Self-adaptive voice endpoint detection method | |
CN105825852A (en) | Oral English reading test scoring method | |
CN104517605B (en) | A kind of sound bite splicing system and method for phonetic synthesis | |
CN101625862B (en) | Method for detecting voice interval in automatic caption generating system | |
CN106875943A (en) | A kind of speech recognition system for big data analysis | |
CN106297765B (en) | Phoneme synthesizing method and system | |
CN110176228A (en) | A kind of small corpus audio recognition method and system | |
CN106303695A (en) | Audio translation multiple language characters processing method and system | |
CN103035252B (en) | Chinese speech signal processing method, Chinese speech signal processing device and hearing aid device | |
CN106373592A (en) | Audio noise tolerance punctuation processing method and system | |
Amir et al. | Unresolved anger: Prosodic analysis and classification of speech from a therapeutic setting | |
DE60318450T2 (en) | Apparatus and method for segmentation of audio data in meta-patterns | |
CN112231440A (en) | Voice search method based on artificial intelligence | |
CN101419796A (en) | Device and method for automatically splitting speech signal of single character | |
CN110299133A (en) | The method for determining illegally to broadcast based on keyword | |
Cahyaningtyas et al. | Development of under-resourced Bahasa Indonesia speech corpus | |
CN111210845B (en) | Pathological voice detection device based on improved autocorrelation characteristics | |
Vlaj et al. | Quick and efficient definition of hangbefore and hangover criteria for voice activity detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |