CN107181986A

CN107181986A - The matching process and device of video and captions

Info

Publication number: CN107181986A
Application number: CN201610139767.1A
Authority: CN
Inventors: 刘青; 谢涛
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2016-03-11
Filing date: 2016-03-11
Publication date: 2017-09-19

Abstract

The invention discloses video and the matching process and device of captions, methods described includes：Obtain video segment to be matched and one or more subtitle files to be matched；The association Speech time information of each sound bite is extracted from acquired video segment and captions temporal information is extracted from one or more subtitle files to be matched；Identical rule is based respectively on, the video feature vector of the video segment and the subtitles appearances vector of one or more of subtitle files are generated according to the association Speech time information and captions temporal information extracted；Based on the video feature vector generated and subtitles appearances vector, it is determined that the subtitle file matched with the video segment.Technical scheme, the subtitle file for determining to match with video segment by subtitles appearances vector and video feature vector, solve because video caption matching error is to the puzzlement caused by user, fundamentally ensure that the correctness that video segment is matched with subtitle file.

Description

The matching process and device of video and captions

Technical field

The present embodiments relate to the matching process of multimedia technology field, more particularly to a kind of video and captions And device.

Background technology

With continuing to develop for Internet technology and multimedia technology, the video rich in expressive force and sight is made One of carrier for information, is favored by numerous users.In order to preferably show video content, in user When watching video, it will usually simultaneous display and the captions corresponding to video, it is easy to user to understand in video Hold, lifting user watches the experience of video.

The video of prior art matches the method for using filename matching with captions, that is, extracts target and regard The filename of frequency, then in the set of subtitle file, by the filename of target video and each subtitle file Filename matched, find the subtitle file that filename is most matched, choose this subtitle file as regarding The subtitle file of frequency is played.But if subtitle file name names inaccurate or naming errors, it will it is straight The selection of influence subtitle file is connect, causes the accuracy of selected subtitle file very unstable, and The filename of captions be extremely hold it is easily modified, if to be rewritten into video related for a random subtitle file Name, it is possible to cause mistake matching, be user video-see cause puzzlement.

The content of the invention

The present invention provides the matching process and device of a kind of video and captions, is easily matched with captions with solving video The problem of mistake, realize the accurate match of video and captions.

In a first aspect, the embodiments of the invention provide a kind of video and the matching process of captions, this method includes：

Obtain video segment to be matched and one or more subtitle files to be matched；

The association Speech time information of each sound bite is extracted from acquired video segment and from institute State in one or more subtitle files to be matched and extract captions temporal information；

Identical rule is based respectively on, according to the association Speech time information and captions temporal information extracted Generate the video feature vector of the video segment and the subtitles appearances of one or more of subtitle files to Amount；

Based on the video feature vector generated and subtitles appearances vector, it is determined that matched with the video segment Subtitle file.

Second aspect, the embodiment of the present invention additionally provides the coalignment of video and captions, and the device includes：

Acquiring unit, video segment to be matched and one or more subtitle files to be matched for obtaining；

Video feature extraction unit, the pass for extracting each sound bite from acquired video segment Join Speech time information；

Subtitles appearances extraction unit, for extracting word from one or more subtitle files to be matched Curtain temporal information；

Characteristic vector generation unit, for being based respectively on identical rule, during according to the association voice extracted Between information and captions temporal information generate the video feature vector of the video segment and one or many The subtitles appearances vector of individual subtitle file；

Determining unit, for based on the video feature vector generated and subtitles appearances vector, it is determined that with it is described The subtitle file of video segment matching.

The technical solution adopted in the present invention, each sound bite is extracted from acquired video segment Speech time information is associated, the video feature vector of video segment is generated according to the association Speech time information, Captions temporal information is extracted from acquired one or more subtitle files, is believed according to the captions time The subtitles appearances vector of the one or more of subtitle files of breath generation, and then based on the video feature vector It is vectorial with the subtitles appearances, it is determined that the subtitle file matched with video segment, is solved because of video caption The puzzlement caused with mistake to user, fundamentally ensure that the correctness that captions are matched in video.

Brief description of the drawings

By reading the detailed description made to non-limiting example made with reference to the following drawings, the present invention Other features, objects and advantages will become more apparent upon：

Fig. 1 is the flow chart of the matching process of the video that the embodiment of the present invention one is provided and captions；

Fig. 2 is the flow chart of the matching process of the video that the embodiment of the present invention two is provided and captions；

Fig. 3 is the flow chart of the matching process of the video that the embodiment of the present invention three is provided and captions；

Fig. 4 is the flow chart of the matching process of the video that the embodiment of the present invention four is provided and captions；

Fig. 5 is the structure chart of the coalignment of the video that the embodiment of the present invention five is provided and captions.

Embodiment

In order that the object, technical solutions and advantages of the present invention are clearer, below in conjunction with the accompanying drawings to the present invention Specific embodiment is described in further detail.It is understood that specific embodiment described herein is only It is only used for explaining the present invention, rather than limitation of the invention.It also should be noted that, for the ease of retouching State, part related to the present invention rather than full content are illustrate only in accompanying drawing.Show being discussed in greater detail It should be mentioned that some exemplary embodiments are described as what is described as flow chart before example property embodiment Processing or method.Although operations (or step) are described as the processing of order by flow chart, wherein Many operations can be implemented concurrently, concomitantly or simultaneously.In addition, the order of operations can be with It is rearranged.The processing can be terminated when its operations are completed, it is also possible to being not included in Additional step in accompanying drawing.The processing can correspond to method, function, code, subroutine, subprogram Etc..

Embodiment one

Fig. 1 is the flow chart of the matching process of the video that the embodiment of the present invention one is provided and captions, the present embodiment Method can be performed by the coalignment of video and captions, the device can be by the side of hardware and/or software Formula is realized, and can be typically integrated in the client for the subtitle file for needing to obtain matching, and for providing The server of subtitle file and/or video segment is used cooperatively.

The method of the present embodiment is specifically included：

S110, acquisition video segment to be matched and one or more subtitle files to be matched.

In this operation, video segment can be complete film, cartoon, variety show, a training Any fragment in course etc. or video, or in video after any several fragment editing splicings The fragment of generation；Voice messaging in the video segment is typically the sound of people, or can be with after processing Be identified as the sound of voice, including the sound dubbed or synthesized in cartoon, can be one's voice in speech, Sound of singing etc..

Generally, video segment to be matched is the target video fragment that user selects to play, and to be determined and target The subtitle file that video segment is matched, generally requires to search from local or online subtitle file storehouse, institute State and one or more subtitle files to be matched are frequently included in subtitle file storehouse.

The otherness of length and content in view of video segment, it is to be understood that with it is described to be matched The subtitle file of video segment matching may be one or more.For example, a content is longer or splicing Video segment, may correspond to multiple subtitle files to be matched.

S120, extract from acquired video segment the association Speech time information of each sound bite with And extract captions temporal information from one or more subtitle files to be matched.

In this operation, the association Speech time information of each sound bite can be and each sound bite phase The timing node information or time segment information of association, specifically, when can include the starting of each sound bite Intermediate node information and terminate time interval information between timing node information, each adjacent sound bite and Duration information of each sound bite etc..Similarly, captions temporal information can be in subtitle file and word The associated timing node information of curtain content or time segment information.

Association Speech time information in view of obtaining each sound bite in video segment, is compared to and passes through Speech recognition technology identifies the voice content of video segment, when extracting the association voice of each sound bite Between information it is relatively easy, and in subtitle file generally only include caption content and corresponding captions temporal information, Thereby it is preferred that extracting the association Speech time information of each sound bite from acquired video segment And captions temporal information is extracted from one or more subtitle files to be matched, to characterize video The feature of fragment and subtitle file.

S130, be based respectively on identical rule, during according to the association Speech time information and captions extracted Between information generate the video feature vector of the video segment and the captions of one or more of subtitle files Characteristic vector.

It is special according to the video that the association Speech time information extracted generates the video segment in this operation Vector is levied, the video that can generate the video segment according to the whole association Speech time information extracted is special Levy vector so that the characteristic information of video segment characterizes more detailed, abundant；Also can be according to being extracted Partial association Speech time information generate the video feature vector of the video segment, reduction video features to The dimension of amount, can more quickly determine what is matched with the video segment while accuracy is ensured Subtitle file.It is understood that the subtitles appearances vector of one or more of subtitle files, based on phase Same rule, can be generated according to all or part of captions temporal information extracted.And based on identical rule The subtitles appearances vector then generated, dimension identical generally with the element number of the video feature vector Degree is identical.

Identical rule is based respectively on, the video segment is generated according to the association Speech time information extracted Video feature vector, and one or more of subtitle files are generated according to extracted captions temporal information Subtitles appearances vector, so set and be advantageous in that, can fundamentally ensure video feature vector and word The accuracy of curtain characteristic vector matching.

S140, based on the video feature vector generated and subtitles appearances vector, it is determined that with the video segment The subtitle file of matching.

In this operation, can by vectorial being compared with the video feature vector of the subtitles appearances, And based on the video feature vector generated and subtitles appearances vector comparison result, it is determined that with the piece of video The subtitle file of section matching, it is special based on the video generated in a preferred embodiment of the present embodiment Vector sum subtitles appearances vector is levied, it is determined that the subtitle file matched with the video segment can specifically include： Space similarity between the generated video feature vector of calculating and subtitles appearances vector；And according to being counted The space similarity calculated, it is determined that target subtitle file corresponding with the video segment.

Wherein, the similarity in vector space can be for expression subtitle file and video segment in temporal characteristics On similarity degree or tightness degree.Exemplarily, space similarity can be according to subtitles appearances vector The numerical value such as COS distance, Euclidean algorithm, Pearson correlation coefficient between the video feature vector To be judged.Preferably, can be according to the space similarity highest with the video feature vector Subtitles appearances vector, it is determined that target subtitle file corresponding with the video segment.

Embodiment two

Fig. 2 is the flow chart of the matching process of a kind of video that the embodiment of the present invention two is provided and captions.This reality Apply example based on above-described embodiment one further to be optimized, in the present embodiment, by from acquired The association Speech time Advance data quality that each sound bite is extracted in video segment is：

Voice data is extracted from acquired video segment；Frequency spectrum is carried out to the voice data extracted Analysis, regard the voice data for meeting voice spectrum characteristic as speech data；Based on resulting speech data, Obtain each sound bite and corresponding association Speech time information.

Accordingly, the method for the present embodiment is specifically included：

S220, from acquired video segment extract voice data.

In video segment, video content would generally be showed by scene, and be understood as auxiliary video Subtitle file in, do not have corresponding content generally and embody content in these scenes, subtitle file one by one The voice data of the video segment is corresponded generally to, therefore can be extracted from acquired video segment Voice data, the feature for characterizing video segment, to realize purpose that subtitle file is matched with video segment.

S230, the voice data progress spectrum analysis to being extracted, will meet the audio of voice spectrum characteristic Data are used as speech data.

In order that the expressive force for obtaining video segment is stronger, the cooperation of various audios is generally also had in video segment, Got on the bus such as road and come the sound of the past sound of car, the light music for rendering atmosphere, nature, such as wind and rain sound, And generally comprised in subtitle file is perhaps lyrics etc. in the dialogue of role, therefore, it can further carry The speech data in the voice data is taken, in order to which video segment is preferably matched with subtitle file.

Generally, the speech data in voice data is extracted, it is necessary to analyze the voice data extracted, The analysis of time domain and/or frequency domain can be used, the acoustic characteristic that voice data is included is obtained.In general, frequency Domain analysis is that voice data is come out using frequency axis as coordinate representation to be analyzed, and analysis process is more terse, Anatomy problem is more deeply and conveniently.Therefore, in the present embodiment, the audio number preferably to being extracted According to progress spectrum analysis.Specifically, the voice data extracted preferably includes frequency distribution information, with Realize and spectrum analysis is carried out to the voice data, obtain speech data.Spectrum analysis is carried out to voice data Each frequency content and frequency distribution scope in voice data can be obtained, and then obtains each frequency content Amplitude distribution and Energy distribution, also can obtain corresponding frequency values of main amplitude distribution and Energy distribution etc., According to the result of spectrum analysis, the voice data for meeting voice spectrum characteristic can be regard as speech data.Wherein, Voice spectrum characteristic includes the frequency content of voice and the frequency distribution scope of voice etc..

In aforesaid operations, will meet the voice data of voice spectrum characteristic as speech data can be specifically, According to result of spectrum analysis, the voice data is clustered by spectral characteristic, voice spectrum will be met special The voice data of property is used as speech data.For example, can be divided according to the frequency of different audio types in voice data The difference of the otherness of cloth, such as amplitude distribution and Energy distribution, the voice data is carried out by spectral characteristic Cluster, is distinguish between, and then regard the voice data for meeting voice spectrum characteristic in cluster result as voice number According to.

S240, based on resulting speech data, obtain each sound bite and corresponding association Speech time Information.

, can be based on resulting speech data, more precisely from audio after speech data is got Each sound bite and the association Speech time information corresponding to each sound bite are obtained in data.Due to Caption content in subtitle file, generally corresponds to the content corresponding to the voice messaging in video segment, such as The dialogue of role or aside etc., the association Speech time information extracted corresponding to speech data are described as generation The foundation of the video feature vector of video segment so that it is more accurate to be matched with subtitle file.

S250, from one or more subtitle files to be matched extract captions temporal information.

In the present embodiment, existing captions time extracting method can be used, according to the characteristics of time data, from Captions temporal information is extracted in one or more subtitle files to be matched.

The technical scheme that the present embodiment is provided, by the voice data extracted from video segment, is accorded with The voice data of voice spectrum characteristic is closed as speech data, based on acquired in the resulting speech data Each sound bite arrived and corresponding association Speech time information, can more accurately characterize video segment Feature, and then the video that is generated to the association Speech time information and captions temporal information extracted is special Levy vector and subtitles appearances vector is analyzed, the word matched with video segment can be more accurately determined out Curtain file.

Embodiment three

Fig. 3 is the flow chart of the matching process of a kind of video that the embodiment of the present invention three is provided and captions.This reality Apply example to optimize based on above-described embodiment two, in the present embodiment, by " the association Speech time Information " is optimized for：The association Speech time information is the time interval letter between each adjacent sound bite Breath, " the captions temporal information " is optimized for：The captions temporal information is each adjacent subtitle fragment Between time interval information, and will " it is described be based respectively on identical rule, according to the association extracted Speech time information and captions temporal information generate the video feature vector and described one of the video segment The subtitles appearances vector of individual or multiple subtitle files " is optimized for：

Identical rule is based respectively on, is regarded according to the time interval information generation between each adjacent sound bite The video feature vector of frequency fragment, and according to the time interval information between each adjacent subtitle fragment, it is raw Into the subtitles appearances vector of one or more of subtitle files.

Accordingly, the method for the present embodiment is specifically included：

S220, from acquired video segment extract voice data.

S340, based on resulting speech data, obtain each sound bite and each corresponding adjacent voice Time interval information between fragment.

Specifically, can, based on resulting speech data, obtain each sound bite, and according to institute State the initial time nodal information of each sound bite and terminate timing node information, calculate each adjacent language Tablet section between time interval information, for example, can by the initial time nodal value of current speech segment with The termination timing node value of an adjacent upper sound bite makes the difference, and obtains current speech segment and adjacent upper one The time interval of sound bite.

S350, extract from one or more subtitle files to be matched each adjacent subtitle fragment it Between time interval information.

Similarly, it can be the initial time nodal information of each subtitle fragment and segmentum intercalaris when terminating according to Point information, calculates the time interval information between each adjacent subtitle fragment, according to each adjacent title stock Time interval information between section, generates the subtitles appearances vector of one or more of subtitle files.

S360, be based respectively on identical rule, according to the time interval information between each adjacent sound bite The video feature vector of video segment is generated, and is believed according to the time interval between each adjacent subtitle fragment Breath, generates the subtitles appearances vector of one or more of subtitle files.

In this operation, video segment is generated according to the time interval information between each adjacent sound bite Video feature vector, can obtain the time between each the adjacent sound bite extracted in video segment Interval information, by the time interval information between each sound bite got and adjacent sound bite, As the element of the video feature vector of video segment, the video feature vector of video segment is generated.Similarly, According to the time interval information between each adjacent subtitle fragment, it can obtain and be extracted in video segment Time interval information between each subtitle fragment and adjacent subtitle fragment, it is special as the captions of subtitle file The element of vector is levied, the subtitles appearances vector of one or more of subtitle files is generated.

Further, in order to reduce the dimension of video feature vector, the fast of video segment and subtitle file is realized Speed matching, can also be based respectively on identical selection rule, choose the accessed adjacent sound bite Between time interval in a part of numerical value, according to the time between the selected adjacent sound bite in part The numerical value at interval, generates the video feature vector of the video segment, and according to selected described adjacent A part of numerical value in the time interval of subtitle fragment, the captions for generating one or more of subtitle files are special Levy vector.For example, the time interval of the adjacent sound bite of the setting number got can be chosen, As the element of video feature vector, accordingly, based on identical selection rule, the setting got is chosen The time interval of the adjacent subtitle fragment of number, is used as the element of subtitles appearances vector.

The technical scheme that the present embodiment is provided, according to the time interval between each described adjacent sound bite Video feature vector is generated, subtitles appearances are generated according to the time interval between each described adjacent subtitle fragment Vector, and then based on the video feature vector and the subtitles appearances vector, it is determined that with the video segment The subtitle file of matching, can effectively reduce the dimension of the video feature vector and subtitles appearances vector, The overall time migration because of subtitle file and video segment, caused captions and time can also effectively be solved The problem of matching inaccurate, substantially increases the matching efficiency and accuracy rate of video segment and subtitle file.

It will be clear that in the preferred exemplary of the present embodiment, can also be by the step S220 in Fig. 3 Following step is replaced with S230：Voice data is extracted from acquired video segment, voice number is used as According to, i.e., can be without spectrum analysis in the preferred exemplary.

Example IV

Fig. 4 is the flow chart of the matching process of a kind of video that the embodiment of the present invention four is provided and captions.This reality Apply example to optimize based on above-described embodiment two, in the present embodiment, by " the association Speech time Information " is optimized for：The association Speech time information is the duration information of each sound bite, will be " described Captions temporal information " is optimized for：The captions temporal information is the duration information of each subtitle fragment, and Will " it is described to be based respectively on identical rule, according to the association Speech time information and captions time extracted Information generates the video feature vector of the video segment and the captions spy of one or more of subtitle files Levy vector " it is optimized for：

Identical rule is based respectively on, the video of video segment is generated according to the duration information of each sound bite Characteristic vector, and one or more of subtitle files are generated according to the duration information of each subtitle fragment Subtitles appearances vector.

Accordingly, the method for the present embodiment is specifically included：

S220, from acquired video segment extract voice data.

S440, based on resulting speech data, obtain each sound bite and each corresponding sound bite Duration information.

Specifically, each can be extracted from acquired video segment based on resulting speech data The duration information of sound bite, can be the node time information according to each sound bite, calculate The duration information of each sound bite, for example, can be by the termination timing node value of current speech segment with rising Beginning timing node value makes the difference, and obtains the duration information of current speech segment.

S450, the duration for extracting from one or more subtitle files to be matched each subtitle fragment Information.

Similarly, extracted from one or more subtitle files to be matched each subtitle fragment when Long message, can be the initial time nodal information and termination timing node letter according to each subtitle fragment Breath, calculates the duration information of each subtitle fragment.

S460, be based respectively on identical rule, according to the duration information of each sound bite generate video segment Video feature vector, and one or more of captions are generated according to the duration information of each subtitle fragment The subtitles appearances vector of file.

In this operation, the video feature vector of video segment is generated according to the duration information of each sound bite, Can be the duration information for obtaining each sound bite extracted in video segment, it is all by what is got The duration information of sound bite, as the element of the video feature vector of video segment, generation video segment Video feature vector.Similarly, one or more of words are generated according to the duration information of each subtitle fragment The subtitles appearances vector of curtain file, can obtain long letter when extracting each subtitle fragment in subtitle fragment Breath, using the duration information of all subtitle fragments as the vectorial element of the subtitles appearances of subtitle fragment, generation The subtitles appearances vector of one or more of subtitle files.

Further, in order to reduce the dimension of video feature vector, the fast of video segment and subtitle file is realized Speed matching, can also be based respectively on identical selection rule, choose the accessed sound bite when A part of numerical value in long message, and the numerical value of the duration information according to the selected part sound bite, The video feature vector of the video segment is generated, and chooses the duration of the accessed subtitle fragment A part of numerical value in information, it is raw according to the numerical value of the duration information of the selected part subtitle fragment Into the subtitles appearances vector of one or more of subtitle files.The setting number got can for example be chosen The sound bite duration information as video feature vector element, accordingly, based on identical choosing Rule is taken, the duration information of the subtitle fragment of the setting number got is chosen as subtitles appearances vector Element.

S440, based on the video feature vector generated and subtitles appearances vector, it is determined that with the video segment The subtitle file of matching.

The technical scheme that the present embodiment is provided, video is generated according to the duration information of each sound bite Characteristic vector, subtitles appearances vector is generated according to the duration information of each subtitle fragment, and then based on institute Video feature vector and subtitles appearances vector are stated, it is determined that the subtitle file matched with the video segment, The captions and time match caused by the overall time migration of subtitle file and video segment can effectively be solved Inaccurate the problem of, while the dimension of the video feature vector and subtitles appearances vector can be reduced effectively Degree, can greatly improve the matching efficiency and accuracy rate of video segment and subtitle file.

It will be clear that in the preferred exemplary of the present embodiment, can also be by the step S220 in Fig. 4 Following step is replaced with S230：Voice data is extracted from acquired video segment, voice number is used as According to, i.e., can be without spectrum analysis in the preferred exemplary.

On the basis of the various embodiments described above, each sound bite is extracted from acquired video segment Speech time information is associated, specifically be may include：Each sound bite is extracted from acquired video segment, And obtain the corresponding association Speech time information of each described sound bite.

Generally, in video segment comprising abundant voice messaging, voice messaging that can be in video segment, Each sound bite is extracted from the video segment of acquisition.According to setting time interval threshold, voice messaging Syllable, word and/or sentence etc. can be divided into.Exemplarily, according to the voice messaging, from acquired Each sound bite is extracted in video segment can be specifically judge in video segment current syllable with it is next Whether the time interval between syllable exceedes the silent duration threshold value of setting, if, it is determined that current syllable institute Corresponding temporal information is the termination timing node information of current speech segment, corresponding to next syllable Temporal information is the initial time nodal information of next sound bite；If it is not, then repeating aforesaid operations.Its In, the silent duration threshold value can combine the length of video segment to be matched, carry out according to the actual requirements Setting, for example：30 milliseconds, 1 second, 2 seconds, 5 seconds or 5 minutes etc., the present invention is not limited.

For example, when silent duration threshold value is set in video segment as 2 seconds, successively in the video segment Time interval between syllable and syllable, that is to say, that whether have syllable in 2 seconds after detection current syllable Occur, if so, this step is then repeated, if it is not, between then illustrating the time between current syllable and next syllable Every more than 2 seconds, then the temporal information corresponding to current syllable is regard as segmentum intercalaris during the termination of current speech segment Point information, the time corresponding to next syllable is recorded when next syllable occurs as next voice The initial time nodal information of fragment, and repeat the above steps, extracted from acquired video segment each Individual sound bite.

Using above-mentioned technical proposal, the sound bite in video segment can be effectively extracted, so it is more accurate The association Speech time information of each sound bite is effectively extracted from acquired video segment, with reality The accurate match of existing video segment and subtitle file.

Embodiment five

The structure of a kind of video and captions coalignment that are provided figure 5 illustrates the embodiment of the present invention five Figure, as shown in figure 5, described device includes：Acquiring unit 510, video feature extraction unit 520, captions Feature extraction unit 530, characteristic vector generation unit 540 and determining unit 550.

Wherein, acquiring unit 510, the video segment to be matched and to be matched one or more for obtaining Subtitle file；Video feature extraction unit 520, for extracting each language from acquired video segment The association Speech time information of tablet section；Subtitles appearances extraction unit 530, for from described to be matched one Captions temporal information is extracted in individual or multiple subtitle files；Characteristic vector generation unit 540, for respectively Based on identical rule, according to association Speech time information and captions the temporal information generation extracted The subtitles appearances vector of the video feature vector of video segment and one or more of subtitle files；It is determined that Unit 550, for based on the video feature vector generated and subtitles appearances vector, it is determined that with the video The subtitle file of fragment match.

On the basis of the various embodiments described above, the video feature extraction unit can include：Voice data is carried Modulus block, speech data acquisition module with associate Speech time data obtaining module.

Wherein, voice data extraction module, for extracting voice data from acquired video segment； Speech data acquisition module, for carrying out spectrum analysis to the voice data extracted, will meet voice frequency The voice data of spectral property is used as speech data；Speech time data obtaining module is associated, for based on gained The speech data arrived, obtains each sound bite and corresponding association Speech time information.

On the basis of the various embodiments described above, the association Speech time information can be each adjacent voice sheet Time interval information between section, the captions temporal information can be between each adjacent subtitle fragment when Between interval information, and the characteristic vector generation unit can be specifically for：Identical rule is based respectively on, The video feature vector of video segment is generated according to the time interval information between each adjacent sound bite, with And according to the time interval information between each adjacent subtitle fragment, generate one or more of subtitle files Subtitles appearances vector.

On the basis of the various embodiments described above, the association Speech time information can also be each adjacent voice The duration information of fragment, the captions temporal information can also be the duration information of each subtitle fragment, and The characteristic vector generation unit can also be specifically for：Identical rule is based respectively on, according to each voice sheet The duration information of section generates the video feature vector of video segment, and according to the when long letter of each subtitle fragment The subtitles appearances vector of the one or more of subtitle files of breath generation.

On the basis of the various embodiments described above, the determining unit can include：Computing module and determining module.

Wherein, computing module, for calculating between generated video feature vector and subtitles appearances vector Space similarity；And determining module, for according to the space similarity calculated, it is determined that being regarded with described The corresponding target subtitle file of frequency fragment.

Embodiment six

The embodiment of the present invention six provides a kind of terminal device, and the terminal device is integrated with the embodiment of the present invention Video and captions coalignment, video and captions can be carried out with captions matching process by performing video Matching.

Exemplarily, the terminal device in the present embodiment concretely mobile phone, tablet personal computer with etc. be configured with and regard The terminal device of frequency playing device.

The video that the present embodiment is provided and captions coalignment, the video provided with any embodiment of the present invention Belong to same inventive concept with captions matching process, can perform the video that is provided of any embodiment of the present invention and Captions matching process, possesses execution video functional module corresponding with captions matching process and beneficial effect.Not Ins and outs of detailed description in the present embodiment, reference can be made to video and word that any embodiment of the present invention is provided Curtain matching process.

Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.Those skilled in the art It will be appreciated that the invention is not restricted to specific embodiment described here, can enter for a person skilled in the art Row it is various it is obvious change, readjust and substitute without departing from protection scope of the present invention.Therefore, though So the present invention is described in further detail by above example, but the present invention be not limited only to Upper embodiment, without departing from the inventive concept, can also include other more equivalent embodiments, And the scope of the present invention is determined by scope of the appended claims.

Claims

1. a kind of video and captions matching process, it is characterised in that including：

2. according to the method described in claim 1, it is characterised in that extracted from acquired video segment Going out the association Speech time information of each sound bite includes：

Voice data is extracted from acquired video segment；

Spectrum analysis is carried out to the voice data extracted, the voice data for meeting voice spectrum characteristic is made For speech data；

Based on resulting speech data, each sound bite and corresponding association Speech time information are obtained.

3. method according to claim 1 or 2, it is characterised in that the association Speech time information It is the time interval information between each adjacent sound bite, the captions temporal information is each adjacent captions Time interval information between fragment, and

It is described to be based respectively on identical rule, according to the association Speech time information and captions time extracted Information generates the video feature vector of the video segment and the captions spy of one or more of subtitle files Levying vector includes：

4. method according to claim 1 or 2, it is characterised in that the association Speech time information It is the duration information of each sound bite, the captions temporal information is the duration information of each subtitle fragment, And

5. according to any described methods of claim 1-4, it is characterised in that special based on the video generated Vector sum subtitles appearances vector is levied, it is determined that including with the subtitle file that the video segment is matched：

Space similarity between the generated video feature vector of calculating and subtitles appearances vector；And

According to the space similarity calculated, it is determined that target subtitle file corresponding with the video segment.

6. a kind of video and captions coalignment, it is characterised in that including：

7. device according to claim 6, it is characterised in that video feature extraction unit includes：

Voice data extraction module, for extracting voice data from acquired video segment；

Speech data acquisition module, for carrying out spectrum analysis to the voice data extracted, will meet language The voice data of audio spectral property is used as speech data；

Speech time data obtaining module is associated, for based on resulting speech data, obtaining each voice Fragment and corresponding association Speech time information.

8. the device according to claim 6 or 7, it is characterised in that the association Speech time information It is the time interval information between each adjacent sound bite, the captions temporal information is each adjacent captions Time interval information between fragment, and

The characteristic vector generation unit is used to be based respectively on identical rule, according to each adjacent sound bite Between time interval information generate the video feature vector of video segment, and according to each adjacent title stock Time interval information between section, generates the subtitles appearances vector of one or more of subtitle files.

9. the device according to claim 6 or 7, it is characterised in that the association Speech time information It is the duration information of each adjacent sound bite, long letter when the captions temporal information is each subtitle fragment Breath, and

The characteristic vector generation unit be used for be based respectively on identical rule, according to each sound bite when Long message generates the video feature vector of video segment, and is generated according to the duration information of each subtitle fragment The subtitles appearances vector of one or more of subtitle files.

10. according to any described devices of claim 6-9, it is characterised in that the determining unit includes：

Computing module, the space phase for calculating between generated video feature vector and subtitles appearances vector Like degree；And

Determining module, for according to the space similarity calculated, it is determined that corresponding with the video segment Target subtitle file.

11. a kind of terminal device, including video as described in any in claim 6-10 matches dress with captions Put.