CN105898556A

CN105898556A - Plug-in subtitle automatic synchronization method and device

Info

Publication number: CN105898556A
Application number: CN201511018280.XA
Authority: CN
Inventors: 蔡炜
Original assignee: Leshi Zhixin Electronic Technology Tianjin Co Ltd
Current assignee: Leshi Zhixin Electronic Technology Tianjin Co Ltd
Priority date: 2015-12-30
Filing date: 2015-12-30
Publication date: 2016-08-24

Abstract

The invention relates to the technical field of video play and discloses a plug-in subtitle automatic synchronization method and device. The method comprises the following steps: extracting an audio portion of a video file, and carrying out decoding on the audio portion to obtain pulse coding modulation data; dividing the pulse coding modulation data into audio clips, and classifying the audio clips; dividing the audio clips, which are classified as speeches, into short sentences, and determining the start time and end time of each short sentence; searching a match item in a plug-in subtitle file according to the determined start time and end time of the short sentences; changing the start time of the match item into presentation time stamp PTS of a current video, and updating the starting time of each item, the start time of which is larger than that of the match item, in the plug-in subtitle file according to the presentation time stamp. The display time of the subtitle file is allowed to be consistent with the play time of audio/video, thereby realizing automatic synchronization of plug-in subtitles and improving watch experience of a user.

Description

The automatic synchronous method of a kind of plug-in captions and device

Technical field

The present invention relates to video display arts field, particularly to the automatic synchronous method of a kind of plug-in captions And device.

Background technology

Captions (subtitles ofmotion picture) refer to make with written form display TV, film, stage The non-visual contents such as the dialogue in product, also refer to the word of films and television programs post-production.In Making Movies etc. During video work, video file and subtitle file can be integrated, so not do when playing The captions that method changes and removes are referred to as embedded captions.And in some works, respective self-existent video literary composition Part and subtitle file are each individually present, and then can import the subtitle file of required version when video playback, This kind of subtitle file is referred to as plug-in captions.Comparing embedded captions, plug-in captions have versatile and flexible, import Convenience and the advantage without compromising on video quality etc..

Plug-in captions typically use specific subtitles software to carry out captions making.This production method firstly the need of Manually listen complete lines, according to the content described in every lines, complete lines captions are input to electricity Among Ziwen basis, it utilizes specific subtitles software, while listen caption content, while carry out manual punctuate, with really The initial time of fixed each dialogue and dialogue length, the most so-called " time shaft ".When whole captions systems Making complete, captions software can export the plug-in subtitle file of a certain or several form.When certain plays system When system is capable of identify that and supports the broadcast mode of plug-in captions, these captions can be loaded when video playback File.But, the own characteristic made due to plug-in subtitle file determines, the time of plug-in subtitle file Labelling accuracy is poor, poor with the synchronicity of audio frequency and video when causing playing, and user manually regulates captions Reproduction time then seem cumbersome, have a strong impact on the normal viewing of user.

Summary of the invention

It is an object of the invention to provide automatic synchronous method and the device of a kind of plug-in captions so that captions The display time of file is consistent with the reproduction time of audio frequency and video, thus realizes the automatic synchronization of plug-in captions, The viewing improving user is experienced.

For solving above-mentioned technical problem, embodiments of the present invention provide the most same of a kind of plug-in captions One step process, comprises the steps of the audio-frequency unit extracting video file, and is decoded audio-frequency unit, Obtain pulse code modulation data；Described pulse code modulation data is cut into audio fragment, and to institute State audio fragment to classify；Wherein, the classification of described classification comprises: quiet, voice and non-voice； The described audio fragment being categorized as voice is divided into short sentence, and determines initial time and the knot of described short sentence The bundle time；Initial time according to the described short sentence determined and end time, search in plug-in subtitle file One occurrence of rope；The initial time of described occurrence is changed to the reproduction time stamp PTS of current video, And stab according to described reproduction time, update initial time rising more than described occurrence in plug-in subtitle file The initial time of each of time beginning.

Embodiments of the present invention additionally provide the automatic synchronizing apparatus of a kind of plug-in captions, comprise: extract Module, cutting module, division module, search module and more new module；Described extraction module is used for extracting The audio-frequency unit of video file, and audio-frequency unit is decoded, it is thus achieved that pulse code modulation data；Institute State cutting module for described pulse code modulation data being cut into audio fragment, and to described audio frequency sheet Duan Jinhang classifies；Wherein, the classification of described classification comprises: quiet, voice and non-voice；Described division Module for being divided into short sentence by the described audio fragment being categorized as voice, and determines the initial of described short sentence Time and end time；Described search module is for the initial time according to the described short sentence determined and end Time, plug-in subtitle file is searched for an occurrence；Described more new module is for by described occurrence Initial time change to the reproduction time stamp PTS of current video, and stab according to described reproduction time, more In new plug-in subtitle file during initial more than each of initial time of described occurrence of initial time Between.

Embodiment of the present invention in terms of existing technologies, extracts the audio-frequency unit of video file, and right Audio-frequency unit is decoded obtaining pulse code modulation data, and pulse code modulation data is cut into audio frequency Fragment, and audio fragment is categorized as voice, quiet, non-voice, and then would be classified as the audio frequency of voice Fragment be divided into short sentence, and determine initial time and the end time of short sentence, and then short according to determine The initial time of sentence and end time, plug-in subtitle file is searched for an occurrence, and by occurrence Initial time change to current video reproduction time stamp PTS, and according to reproduction time stab, outside renewal Hang initial time in subtitle file and be more than the initial time of each of the initial time of occurrence, so that Obtain display time and the video playback automatic synchronization of the dialogue of subtitle file, improve user's viewing and experience.

Preferably, in the described initial time according to the described short sentence determined and end time, at plug-in word Curtain file is searched in the step of an occurrence, comprise following sub-step: before and after described initial time In preset duration, in described plug-in subtitle file, find corresponding entry；In the described corresponding entry found, Find out all items in error allowed band of the dialogue duration with described short sentence；If the item number found out More than one, by the described short sentence determined upper one record with the described item found out upper one record into Row compares, until finding most like one as occurrence.Thus improve the coupling of captions and audio frequency and video Efficiency and accuracy.

Preferably, described, described audio fragment is divided in the step of short sentence, enters according to speech pause Row divides；Wherein, described speech pause is including at least the audio section of the first preset number.Such that it is able to carry The efficiency that high statement divides.

Preferably, described first preset number is 2.Such that it is able to ignore shorter sound information, more Protect well integrity in short.

Preferably, described short sentence is including at least the audio section of the second preset number, described second preset number It it is 3.Such that it is able to the invalid information in short-term filtered out in audio frequency, improve the efficiency that statement divides.

Accompanying drawing explanation

Fig. 1 is the flow chart of the automatic synchronous method according to the plug-in captions of first embodiment of the invention；

Fig. 2 is according to first embodiment of the invention short sentence and subtitle item matching algorithm schematic diagram；

Fig. 3 is the structured flowchart of the automatic synchronizing apparatus according to the plug-in captions of second embodiment of the invention.

Detailed description of the invention

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to this Bright each embodiment is explained in detail.But, it will be understood by those skilled in the art that In each embodiment of the present invention, propose many technology to make reader be more fully understood that the application thin Joint.But, even if there is no these ins and outs and many variations based on following embodiment and amendment, The application each claim technical scheme required for protection can also be realized.

First embodiment of the present invention relates to the automatic synchronous method of a kind of plug-in captions, and idiographic flow is such as Shown in Fig. 1, comprise the steps of

Step 10: extract the audio-frequency unit of video file, and audio-frequency unit is decoded, it is thus achieved that pulse Coding modulation data.

Video file is obtained by video flowing and audio stream synthesis, during online broadcasting video, first from regarding Frequency file extracts audio stream.The storehouse ffmpeg that increases income can be used to extract the audio-frequency unit of video file, By respective decoder, audio-frequency unit is decoded as PCM (Pulse Coding Modulation, pulse again Coded modulation, is called for short PCM) data.

Step 11: pulse code modulation data is cut into audio fragment, and audio fragment is classified.

In present embodiment, it is possible to use Marsyas software audio frequency (the pulse code modulation number to extracting According to) classify, such as, pass through Marsyas, it can be determined that go out the classification of this voice data: quiet, Voice and non-voice.The a length of 32ms of interface setting audio frame that can be provided by Marsyas, and by 5 Individual audio frame as an audio section, i.e. a length of 0.16s of audio section.In categorizing process, can be with audio frequency Section carries out a subseries for unit, improves the efficiency of classification.Present embodiment is for the classification of audio fragment Method is not specifically limited, as long as can voice and non-voice be distinguished.As can be seen here, pass through The classification of this step can obtain initial time and the end time of sound bite in audio fragment, for from Audio fragment extracts speech sentences lay the first stone.

Step 12: the audio fragment that would be classified as voice is divided into short sentence, and determines the initial time of short sentence And the end time.By the classification of step 11, it may be determined that voice, non-voice, quiet etc. initial time Between and the end time, and then according to speech pause, voice can be divided into short sentence.

The beginning detecting sentence in present embodiment is the key that short sentence divides with end, because only that reach To higher end-point detection precision, just can accomplish with a definite target in view, it is achieved to sentence length sum purpose control System.This step is based on the classification information obtained in step 11, by taking the segmentation algorithm preset can be from Audio frequency intercepts out voice unit (i.e. short sentence).Specifically, following strategy can be used to carry out audio frequency cut Point: during to enter the time point of quiet section before continuous speech section or non-speech segment as the beginning of sentence Between, the time point of last voice segments during to terminate continuous speech section is as the end time of sentence. So to the speech pause using certain time length i.e. available after audio frequency cutting as phase partitioning boundary, semantic To the short sentence in complete " class sentence " unit, i.e. present embodiment.

But, it is likely to result in some extreme cases by above-mentioned cutting strategy detection sentence end points: such as Can mark off some extremely short sentences, the length of these sentences is only one to two audio sections, such sentence Son generally only comprises one or two word, does not even comprise any effective voice messaging, therefore these sentences Needs are filtered out and cannot function as sentence effectively and carry out Subtitle Demonstration.

In order to improve cutting efficiency, cutting strategy arranges speech pause including at least the first preset number Audio section, it is preferred that the audio section of the first preset number is such as 2 audio sections.By arranging language The minimum length that sound pauses, can ignore the instantaneous ventilation etc. of shorter sound information, such as speaker, It is thus possible to the integrity that protection is in short.

Further, the short sentence being syncopated as includes at least the audio section of the second preset number, it is preferred that the The audio section of two preset number can be such as 3 audio sections, i.e. ignores the overall length voice less than 0.48 second Unit, by limiting the minimum length of sentence, can filter out the invalid information in short-term in audio frequency, such as Speaker tussiculas.

Should be appreciated that present embodiment is for the first preset data or the concrete numerical value of the second preset number Be not restricted, in actual application, can according to the feature of language be adjusted with more accurately, the most true Determine initial time and the end time of statement element.

By step 12, the section audio extracted is cut into the most independent statement, and has obtained Take initial time and the end time of statement, may determine that the playing duration of statement accordingly.

Step 13: according to initial time and the end time of the short sentence determined, search in plug-in subtitle file One occurrence of rope.

Generally, plug-in subtitle file includes initial time, the information of dialogue duration etc..This embodiment party Formula, when playing, obtains plug-in subtitle file, and according to plug-in subtitle file create one < initial time, Dialogue duration > data structure datastruct1, during such that it is able to find each dialogue initial easily Between and dialogue duration.Then time according to the short sentence (i.e. dialogue in video) marked off in step 12 initial Between and the end time in data structure datastruct1, find occurrence item.

Specifically, step 13 comprises following sub-step:

Sub-step 130: before and after initial time in preset duration, finds corresponding in plug-in subtitle file ?.

Ideally each dialogue in audio-frequency unit (short sentence being similar in present embodiment) rise Time beginning, end time dialogue corresponding with subtitle file (i.e. corresponding entry in present embodiment) rise Time beginning, end time synchronize.Owing to subtitle file of the prior art makes, cause captions The initial time of corresponding entry, end time etc. and the initial time of dialogue, end in audio-frequency unit in file There is deviation in the time.Therefore, in preset duration, (in the most possible captions, corresponding entry rises this step needs Time beginning and the difference of the initial time of audio frequency dialogue) interior, from plug-in captions, find corresponding entry, this enforcement Preset duration in mode can be 1 minute, i.e. in the initial time of the short sentence extracted from video file Front and back in plug-in captions, find corresponding entry in 1 minute.Should be appreciated that preset duration can be according to captions The actual features of file is set, and present embodiment is not restricted for the specific size of preset duration.

Sub-step 131: in the corresponding entry found, finds out the dialogue duration with short sentence and allows model in error Enclose interior all items.

Such as, in 1 minute before and after the initial time of short sentence, in datastruct1 search with The error of the dialogue duration of short sentence was at all items of 3 seconds.Such as, during the dialogue of short sentence a length of 4 seconds, as Fruit is in 1 minute, and finding dialogue duration subtitle item between 2.5 seconds to 5.5 seconds has 3, then Extract this 3 corresponding entry.Should be appreciated that the present embodiment concrete numerical value for error allowed band Illustrate understanding merely for convenience, protection scope of the present invention can not be limited with this.

Sub-step 132: judge that the item number found out is the most more than one.If the item number found out is one Individual, then it is assumed that this corresponding entry is the occurrence of corresponding audio frequency, continues executing with step 14, if the item found out Number is more than one, then immediate occurrence in needing to screen further, therefore continues executing with step 133。

Sub-step 133: a upper record of a upper record of the short sentence determined with the item found out is carried out Relatively, until finding most like one as occurrence.

Now be illustrated below: as in figure 2 it is shown, such as in step 131 short sentence P at datastruct1 In find 3 subtitle item (i.e. subtitle item A, subtitle item B, subtitle item C), then continue short sentence P Upper one record short sentence P-1 respectively with subtitle item A, subtitle item B, the previous subtitle item of subtitle item C A-1, subtitle item B-1, subtitle item C-1 match, and matching algorithm can be to compare initial time and dialogue Durations etc., if short sentence P-1 finds the subtitle item of more than 2, then continue upper the one of short sentence P-1 Individual record short sentence P-2 mates with a upper record of the multiple subtitle item found respectively, class according to this Push away, until finding the subtitle item matched with short sentence.

Step 14: the initial time of occurrence is changed to the reproduction time stamp PTS of current video, and root Stab according to reproduction time, update initial time in plug-in subtitle file and be more than each of the initial time of occurrence The initial time of item.

Specifically, first the initial time of occurrence is changed to the reproduction time stamp PTS of current video (Presentation time stamp, current time is stabbed, and is called for short PTS), it is possible to by following public affairs When formula updates initial more than each of the initial time of occurrence of initial time in plug-in subtitle file Between:

Initial time 2=initial time 1-(item. initial time-video.pts)

Wherein, item. initial time is the initial time of current matching item, and video.pts is current video The time of frame, then (item. initial time-video.pts) represents between current matching item and audio frequency and video Time difference.Initial time 1 represents the initial time of the front subtitle item of correction in datastruct1, time initial Between 2 represent in datastruct1 the initial time of subtitle item after correction.

Present embodiment can be embedded in playout software, in video display process, in video playback It is performed both by present embodiment in starting end and predetermined time interval (such as 10 minutes) afterwards, i.e. obtains Take the voice data with certain time length, be decoded thus obtain pulse code modulation data, then will This portion of audio data carries out classifying and being cut into short sentence, and finds the coupling of short sentence in subtitle file , and then update occurrence and reproduction time and be positioned at the initial time of all captions after this occurrence. Or, it is also possible to the initial time of dialogues all in voice data is mated so that plug-in captions with Audio frequency and video Complete Synchronization, reaches more preferably viewing effect.

Present embodiment in terms of existing technologies, extracts the audio-frequency unit of video file, to audio portion Divide and be decoded obtaining pulse code modulation data, such that it is able to the voice messaging in audio frequency is analyzed, Again pulse code modulation data is cut into audio fragment, such that it is able to divide by analyzing just audio fragment Class is voice, quiet, non-voice, can would be classified as further voice audio fragment be divided into short Sentence, and initial time and the end time of short sentence is determined with the reproduction time stamp PTS of current video frame, then Initial time according to the short sentence determined and end time, plug-in subtitle file is searched for an occurrence, Such that it is able to the initial time of occurrence to be changed to the reproduction time stamp PTS of current video, and according to broadcasting Put timestamp, update initial time in plug-in subtitle file and be more than each 's of the initial time of occurrence Initial time.By above-mentioned steps, present embodiment can be according to dialogue automatic time correction subtitle item The display time, make Subtitle Demonstration and audio and video playing time consistency, so that plug-in captions and audio frequency and video Automatic synchronization, reaches preferably viewing effect, improves Consumer's Experience.

The step of the most various methods divides, and is intended merely to describe clear, it is achieved time can merge into one Step or split some step, is decomposed into multiple step, as long as comprising identical logical relation, All in the protection domain of this patent；To adding inessential amendment in algorithm or in flow process or drawing Enter inessential design, but do not change the core design of its algorithm and flow process all at the protection model of this patent In enclosing.

Second embodiment of the invention relates to the automatic synchronizing apparatus of a kind of plug-in captions, as it is shown on figure 3, Comprise: extraction module, cutting module, division module, search module and more new module.

Extraction module is for extracting the audio-frequency unit of video file, and is decoded audio-frequency unit, it is thus achieved that Pulse code modulation data.

Cutting module for being cut into audio fragment by pulse code modulation data, and carries out audio fragment Classification, wherein, the classification of classification comprises: quiet, voice and non-voice.

Divide module and be divided into short sentence for the audio fragment that would be classified as voice, and determine the initial of short sentence Time and end time.Specifically, module is divided for carrying out short sentence division, voice according to speech pause Pause the audio section including at least the first preset number, and is divided into audio fragment including at least second The short sentence of the audio section of preset number.Wherein, the audio section of the first preset number time a length of, second is pre- If Serpentis purpose audio frequency end time a length of.Should be appreciated that the first preset number and the second preset number Own characteristic according to voice data and subtitle file is set, and present embodiment is for its concrete number Value is not restricted.

Search module comprises further: initial matched sub-block, dialogue matched sub-block and comparison match Module.Initial matched sub-block is used for before and after initial time in preset duration, in plug-in subtitle file Find corresponding entry for the initial time according to the short sentence determined and end time, in plug-in subtitle file Search for an occurrence.Dialogue matched sub-block is used in the corresponding entry that initial matched sub-block finds, Find out all items in error allowed band of the dialogue duration with short sentence.Comparison match submodule is used for When item number that dialogue matched sub-block is found out is more than one, by the upper record of the short sentence determined and look for A upper record of the item gone out compares, until finding most like one as occurrence.

More new module stabs PTS for the reproduction time that the initial time of occurrence changes to current video, And stab according to reproduction time, update initial time in plug-in subtitle file and be more than the initial time of occurrence The initial time of each.

Present embodiment in terms of existing technologies, by extracting the voice data in video file, and Voice data is classified, is cut into statement, thus obtain the initial time of accurate statement, end Time, and in subtitle file, find occurrence accordingly, and by the initial time in occurrence correspondingly Modify so that subtitle file synchronizes to reach Tong Bu with audio frequency and video.Therefore, present embodiment is without user Manually regulate plug-in captions, it is possible to make plug-in captions automatic synchronization in audio frequency and video, thus reach preferably Viewing effect, improves Consumer's Experience.

It is seen that, present embodiment is the system embodiment corresponding with the first embodiment, this enforcement Mode can be worked in coordination enforcement with the first embodiment.The relevant technical details mentioned in first embodiment The most effective, in order to reduce repetition, repeat no more here.Correspondingly, this enforcement The relevant technical details mentioned in mode is also applicable in the first embodiment.

It is noted that each module involved in present embodiment is logic module, in reality In application, a logical block can be a physical location, it is also possible to be one of a physical location Point, it is also possible to realize with the combination of multiple physical locations.Additionally, for the innovative part highlighting the present invention, Not by the unit the closest with solving technical problem relation proposed by the invention in present embodiment Introduce, but this is not intended that in present embodiment the unit that there is not other.

It will be understood by those skilled in the art that the respective embodiments described above are realize the present invention concrete Embodiment, and in actual applications, can to it, various changes can be made in the form and details, and the most inclined From the spirit and scope of the present invention.

Claims

1. the automatic synchronous method of plug-in captions, it is characterised in that comprise the steps of

Extract the audio-frequency unit of video file, and audio-frequency unit is decoded, it is thus achieved that pulse code modulation Data；

Described pulse code modulation data is cut into audio fragment, and described audio fragment is carried out point Class；Wherein, the classification of described classification comprises: quiet, voice and non-voice；

The described audio fragment being categorized as voice is divided into short sentence, and determines the initial time of described short sentence And the end time；

Initial time according to the described short sentence determined and end time, plug-in subtitle file is searched for one Individual occurrence；

The initial time of described occurrence is changed to the reproduction time stamp PTS of current video, and according to institute State reproduction time stamp, update initial time in plug-in subtitle file and be more than the initial time of described occurrence The initial time of each.

The automatic synchronous method of plug-in captions the most according to claim 1, it is characterised in that The described initial time according to the described short sentence determined and end time, plug-in subtitle file is searched for one In the step of individual occurrence, comprise following sub-step:

Before and after described initial time in preset duration, in described plug-in subtitle file, find corresponding entry；

In the described corresponding entry found, find out the dialogue duration with described short sentence in error allowed band All items；

If the item number found out is more than one, a upper record of the described short sentence determined is looked for described A upper record of the item gone out compares, until finding most like one as occurrence.

The automatic synchronous method of plug-in captions the most according to claim 1 and 2, it is characterised in that Described, described audio fragment is divided in the step of short sentence, divides according to speech pause；

Wherein, described speech pause is including at least the audio section of the first preset number.

The automatic synchronous method of plug-in captions the most according to claim 3, it is characterised in that institute Stating the first preset number is 2.

The automatic synchronous method of plug-in captions the most according to claim 3, it is characterised in that institute State the short sentence audio section including at least the second preset number.

The automatic synchronous method of plug-in captions the most according to claim 5, it is characterised in that institute Stating the second preset number is 3.

The automatic synchronous method of plug-in captions the most according to claim 1, it is characterised in that

In the described initial time determining described short sentence and the step of end time, to enter continuous speech Before Duan quiet section or the time point of non-speech segment are as the time started of sentence, to terminate continuous speech The time point of last voice segments during section is as the end time of sentence.

8. the automatic synchronizing apparatus of plug-in captions, it is characterised in that comprise: extraction module, cut Sub-module, division module, search module and more new module；

Described extraction module is for extracting the audio-frequency unit of video file, and is decoded audio-frequency unit, Obtain pulse code modulation data；

Described cutting module is used for being cut into described pulse code modulation data audio fragment, and to described Audio fragment is classified；Wherein, the classification of described classification comprises: quiet, voice and non-voice；

Described division module is for being divided into short sentence by the described audio fragment being categorized as voice, and determines institute State initial time and the end time of short sentence；

Described search module is for the initial time according to the described short sentence determined and end time, plug-in Subtitle file is searched for an occurrence；

When described more new module for changing to the broadcasting of current video by the initial time of described occurrence Between stab PTS, and stab according to described reproduction time, update in plug-in subtitle file initial time more than described The initial time of each of the initial time of occurrence.

The self-synchronous system of plug-in captions the most according to claim 8, it is characterised in that institute State search module to comprise: initial matched sub-block, dialogue matched sub-block and comparison match submodule；

Described initial matched sub-block is used for before and after described initial time in preset duration, described plug-in Subtitle file finds corresponding entry；

Described dialogue matched sub-block, in the corresponding entry that described initial matched sub-block finds, is found out With the dialogue duration of the described short sentence all items in error allowed band；

Described comparison match submodule is more than one for the item number found out in described dialogue matched sub-block Time individual, a upper record of a upper record of the described short sentence determined with the described item found out is compared Relatively, until finding most like one as occurrence.

The self-synchronous system of plug-in captions the most according to claim 8 or claim 9, it is characterised in that Described division module is additionally operable to divide according to speech pause；

The self-synchronous system of 11. plug-in captions according to claim 10, it is characterised in that Described division module is additionally operable to described audio fragment is divided into the audio frequency including at least the second preset number The short sentence of section.