CN105845129A

CN105845129A - Method and system for dividing sentences in audio and automatic caption generation method and system for video files

Info

Publication number: CN105845129A
Application number: CN201610178500.3A
Authority: CN
Inventors: 蔡炜
Original assignee: Leshi Zhixin Electronic Technology Tianjin Co Ltd; LeTV Holding Beijing Co Ltd
Current assignee: Leshi Zhixin Electronic Technology Tianjin Co Ltd; LeTV Holding Beijing Co Ltd
Priority date: 2016-03-25
Filing date: 2016-03-25
Publication date: 2016-08-10

Abstract

The embodiment of the invention discloses a method and system for dividing sentences in audio and an automatic caption generation method and system for video files. The method for dividing sentences in audio includes the steps of identifying first pause, identifying a first sentence, identifying second pause, determining whether the audio is finished, and if not, repeating the above sentence/pause identification step until the audio is finished, wherein the pause has a minimal length restriction, the sentence has a minimal length restriction and a maximal length restriction. The speech recognition rate is thus increased, which makes full automatic caption production possible.

Description

A kind of in audio frequency the method and system of cutting sentence and the captions of video file automatically generate Method and system

Technical field

The present invention relates to electronic technology field, be specifically related to a kind of method of cutting sentence in audio frequency And system, and the captions automatic generation method of video file and system.

Background technology

Captions refer to the non-visual contents such as the dialogue inside with written form display films and television programs, also refer to shadow Regard as the word of product post-production, be indispensable for films and television programs.Existing captions system Make mainly to be accomplished manually by captions producer, including dictating, translate, polish, time shaft and after The flow processs such as phase, inefficiency, complex procedures, and need substantial amounts of manpower and materials.

Summary of the invention

Therefore, the technical problem to be solved in the present invention is that existing captions make efficiency is low, operation Complexity, and need substantial amounts of manpower and materials.

To this end, embodiments provide a kind of method of cutting sentence in audio frequency, including:

S1, identification the first pause, described pause includes quiet section and/or non-speech segment, and records described First time started paused and end time；

S2, identify that the first sentence, described sentence include voice segments, and opening of described first sentence is set Time beginning was the described first end time paused；

S3, identification the second pause, and recorded for the described second time started paused and end time, if The end time putting the first sentence was the described second time started paused, and completed described first sentence Cutting；

S4, judge whether audio frequency terminates, as do not terminated then repeating said steps S2-S3, terminate then to perform Step S5；

S5, end；

Wherein, described pause has minimum length and limits, and is used for ignoring short sound information；Described sentence There is minimum length limit, for filtering out the invalid information in short-term in audio frequency；Described sentence also has Greatest length limits, and for limiting the length of sentence, improves this recognition accuracy.

Preferably, the minimum length of described pause is limited to 2 audio sections.

Preferably, the minimum length of described sentence is limited to 3 audio sections.

Preferably, the greatest length of described sentence limits is 50 audio sections.

The embodiment of the present invention additionally provides the captions automatic generation method of a kind of video file, including following Step:

S1, the audio frequency extracted in pending video file；

S2, classifying the audio section in described audio frequency, classification includes quiet, voice and non-voice；

S3, with any one method of cutting sentence in audio frequency aforementioned, cutting sentence in described audio frequency；

S4, described sentence is carried out speech recognition, and when recording corresponding text and the start-stop of each sentence Between information；

S5, generate captions according to described text and beginning and ending time information.

Preferably, in described step S1, utilize ffmpeg to extract audio frequency, and solved by corresponding Code device says that described audio decoder is PCM data.

Preferably, in described step S2, utilize Marsyas that described audio section is classified.

Preferably, in described step S4, utilize HTK, as identification facility, described sentence is carried out language Sound identification.

The embodiment of the present invention additionally provides the system of a kind of sentence of cutting in video, including:

Pause identification module, for identifying the pause including quiet section and/or non-speech segment, and record stops The time started paused and end time；

Sentence identification module, for identifying the sentence including voice segments, and arranges the time started of sentence For the end time of adjacent previous pause, the end time of sentence is that adjacent later is paused Time started；

Audio frequency terminates judge module, is used for judging whether audio frequency terminates.

The embodiment of the present invention additionally provides the captions automatic creation system of a kind of video file, including:

Audio extraction module, for extracting the audio frequency in described video file；

Audio section sort module, for classifying the audio section in described audio frequency, classification includes quiet Sound, voice and non-voice；

Sentence cutting module, for utilizing the system of the sentence of cutting in video described in claim 9, Cutting sentence in described audio frequency；

Sound identification module, for described sentence carries out speech recognition, and records the right of each sentence Answer text and beginning and ending time information；

Captions generation module, generates word for the text corresponding according to described sentence and beginning and ending time information Curtain.

The embodiment of the present invention is the method and system of cutting sentence in audio frequency, and the captions of video file Automatic generation method and system, by increase pause minimum length limit, pause minimum length limit and Three variablees such as sentence greatest length restriction, improve phonetic recognization rate so that full automatic captions system It is made for possibility.

Accompanying drawing explanation

In order to be illustrated more clearly that the specific embodiment of the invention or technical scheme of the prior art, under The accompanying drawing used required in detailed description of the invention or description of the prior art will be briefly described by face, It should be evident that the accompanying drawing in describing below is some embodiments of the present invention, general for this area From the point of view of logical technical staff, on the premise of not paying creative work, it is also possible to obtain according to these accompanying drawings Obtain other accompanying drawing.

Fig. 1 is the flow chart of the method for cutting sentence in audio frequency of the embodiment of the present invention；

Fig. 2 is the flow chart of the captions automatic generation method of the video file of the embodiment of the present invention；

Fig. 3 is the structured flowchart of the system of the sentence of cutting in video of the embodiment of the present invention；

Fig. 4 is the structured flowchart of the captions automatic creation system of the video file of the embodiment of the present invention.

Detailed description of the invention

Below in conjunction with accompanying drawing, technical scheme is clearly and completely described, it is clear that Described embodiment is a part of embodiment of the present invention rather than whole embodiments.Based on this Embodiment in bright, those of ordinary skill in the art are obtained under not making creative work premise Every other embodiment, broadly fall into the scope of protection of the invention.

With specific embodiment, technical scheme is described in detail below in conjunction with the accompanying drawings.

As it is shown in figure 1, embodiments provide a kind of method of cutting sentence in audio frequency, bag Include:

S1, identifying the first pause, this pause includes quiet section and/or non-speech segment, and record this first The time started paused and end time.

Concrete, this first time started paused can be the time started of this audio frequency, the end time It can be time of starting of first voice segments.

S2, identifying the first sentence, sentence includes voice segments, and arranges the time started of this first sentence For this first end time paused.

S3, identification the second pause, and record this second time started paused and end time, arrange The end time of this first sentence is this second time started paused, and completes the cutting of the first sentence.

S4, judge whether audio frequency terminates, as do not terminated, repeat step S2-S3, terminate then to perform step S5。

S5, end.

Wherein, this pause has minimum length and limits, and is used for ignoring short sound information；This sentence has Minimum length limits, for filtering out the invalid information in short-term in audio frequency；This sentence also has and greatly enhances most Degree limits, and for limiting the length of sentence, improves this recognition accuracy.

The purpose of cutting sentence is to obtain the short sentence being prone to carry out speech recognition, accurately detecting sentence Time started and end time be crucial, because only that reach higher end-point detection precision, just may be used To accomplish with a definite target in view, it is achieved sentence length sum purpose is controlled.But, the breakpoint of detection sentence Easily cause two kinds of extreme cases: one is to have the most extremely short sentence, and some length is only one to two Audio section.These sentences the most only comprise one or two word, even do not comprise any effective voice letter Breath；Two is some long sentences occur, and some is up to even tens seconds tens of seconds, includes some semantemes complete Whole unit.Both of these case all can have a strong impact on discrimination.

The method of the cutting sentence of the embodiment of the present invention is by increasing above-mentioned three variablees, and pause is Little length limitation, the minimum length of sentence limit and the greatest length of sentence limits, it is possible to effective Avoid the generation of above two extreme case, thus improve phonetic recognization rate.

Preferably, the minimum length of this pause is limited to 2 audio sections.

As it has been described above, arranging minimum length and limiting is to ignore shorter sound information, such as speaking The instantaneous ventilation etc. of people, with protection integrity in short.Through research repeatedly and the experiment of applicant, Think and be limited to 2 audio sections by the minimum length arranging pause so that in continuous speech unit Single non-voice unit will not be regarded as a pause, thus protects the integrity of sentence.

Preferably, the minimum length of this sentence is limited to 3 audio sections.

Concrete, the number of the voice segments that the minimum length of sentence i.e. sentence is to be comprised.Increase sentence The effect that the minimum length of son limits is to filter out the invalid information in short-term in audio frequency, such as speaker's Tussicula.It has been found that by setting minimum sentence as 3 audio sections, i.e. ignore overall length and be less than The voice unit of 0.48 second, can effective filter out as tussiculaed, sighing, the invalid information in short-term such as ventilation.

Preferably, the greatest length of this sentence limits is 50 audio sections.

The length of sentence is long, will increase the difficulty of speech recognition, reduces discrimination.Therefore, one When the number of the voice segments that sentence is comprised reaches certain limit, method should be taked to make sentence as soon as possible Terminate.The present invention is 50 audio sections by arranging the greatest length of sentence, after reaching this limit Even single non-voice unit also can be regarded as a pause, effectively limit the length of sentence, Improve the recognition accuracy of sentence.

As in figure 2 it is shown, the embodiment of the present invention additionally provides the captions side of automatically generating of a kind of video file Method, comprises the following steps:

S1, the audio frequency extracted in pending video file.

S2, classifying the audio section in this audio frequency, classification includes quiet, voice and non-voice.

S3, with any one method of cutting sentence in audio frequency above-mentioned, cutting sentence in this audio frequency.

S4, this sentence is carried out speech recognition, and record corresponding text and the beginning and ending time of each sentence Information.

Concrete, captions are srt text subtitle.The kind of captions has a variety of, the most popular Subtitling format have graphical format and text formatting two class.For graphical format captions, text Form captions have that size is little, form simple, be easy to make and the feature such as amendment.Wherein srt form Text subtitle is most widely used, can compatible various conventional media players.

Preferably, in order to optimize display effect, spectators are facilitated to watch captions, by longer in recognition result Sentence cutting be that multirow shows.

Preferably, in step sl, ffmpeg is utilized to extract audio frequency, and by corresponding decoder Say that described audio decoder is PCM data.

Concrete, the interface setting frame length provided by Marsyas is 32ms, and segment length is 0.16s, I.e. one audio section comprises 5 audio frames.

Preferably, in step s 4, utilize HTK, as identification facility, described sentence is carried out voice knowledge Not.

Specifically, HTK is utilized to carry out sentence identification as large vocabulary continuous speech recognition instrument, Become some texts throughout one's life, store identification text results and the start-stop of correspondence of each sentence Temporal information.

As it is shown on figure 3, the embodiment of the present invention additionally provides the system 1 of a kind of sentence of cutting in video, Including:

Pause identification module 2, for identifying the pause including quiet section and/or non-speech segment, and record The time started paused and end time；

Sentence identification module 3, for identifying the sentence including voice segments, and arranges the time started of sentence For the end time of adjacent previous pause, the end time of sentence is that adjacent later is paused Time started；

Audio frequency terminates judge module 4, is used for judging whether audio frequency terminates；

As shown in Figure 4, the embodiment of the present invention additionally provides the captions of a kind of video file and automatically generates and is System 11, including:

Audio extraction module 12, for extracting the audio frequency in described video file；

Audio section sort module 13, for classifying the audio section in described audio frequency, classification includes Quiet, voice and non-voice；

Sentence cutting module 14, for utilize the sentence of cutting in video described in claim 9 be System, cutting sentence in described audio frequency；

Sound identification module 15, for described sentence carries out speech recognition, and records each sentence Corresponding text and beginning and ending time information；

Captions generation module 16, generates for the text corresponding according to described sentence and beginning and ending time information Captions.

Obviously, above-described embodiment is only for clearly demonstrating example, and not to embodiment party The restriction of formula.For those of ordinary skill in the field, the most also may be used To make other changes in different forms.Here without also all of embodiment being given With exhaustive.And the obvious change thus extended out or variation are still in the guarantor of the invention Protect among scope.

Claims

1. the method for cutting sentence in audio frequency, it is characterised in that comprise the following steps:

S5, end；

Method the most according to claim 1, it is characterised in that the minimum length limit of described pause It is made as 2 audio sections.

3. according to the method described in claim 1-2, it is characterised in that the minimum length of described sentence It is limited to 3 audio sections.

4. according to the method described in any one of claim 1-3, it is characterised in that described sentence is It is 50 audio sections that long length limits.

5. the captions automatic generation method of a video file, it is characterised in that comprise the following steps:

S1, the audio frequency extracted in pending video file；

S3, the use method of cutting sentence in audio frequency according to any one of claim 1-4, in institute State cutting sentence in audio frequency；

Method the most according to claim 5, it is characterised in that in described step S1, utilizes Ffmpeg extracts audio frequency, and says that described audio decoder is PCM data by corresponding decoder.

7. according to the method described in claim 5-6, it is characterised in that in described step S2, profit With Marsyas, described audio section is classified.

8. according to the method described in claim 5-7, it is characterised in that in described step S4, profit As identification facility, described sentence is carried out speech recognition with HTK.

9. the system of a cutting sentence in video, it is characterised in that including:

Audio frequency terminates judge module, is used for judging whether audio frequency terminates；

10. the captions automatic creation system of a video file, it is characterised in that including: