CN107181986A - The matching process and device of video and captions - Google Patents
The matching process and device of video and captions Download PDFInfo
- Publication number
- CN107181986A CN107181986A CN201610139767.1A CN201610139767A CN107181986A CN 107181986 A CN107181986 A CN 107181986A CN 201610139767 A CN201610139767 A CN 201610139767A CN 107181986 A CN107181986 A CN 107181986A
- Authority
- CN
- China
- Prior art keywords
- video
- information
- vector
- captions
- video segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/488—Data services, e.g. news ticker
- H04N21/4884—Data services, e.g. news ticker for displaying subtitles
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/233—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
- H04N21/23418—Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Television Signal Processing For Recording (AREA)
- Studio Circuits (AREA)
Abstract
The invention discloses video and the matching process and device of captions, methods described includes:Obtain video segment to be matched and one or more subtitle files to be matched;The association Speech time information of each sound bite is extracted from acquired video segment and captions temporal information is extracted from one or more subtitle files to be matched;Identical rule is based respectively on, the video feature vector of the video segment and the subtitles appearances vector of one or more of subtitle files are generated according to the association Speech time information and captions temporal information extracted;Based on the video feature vector generated and subtitles appearances vector, it is determined that the subtitle file matched with the video segment.Technical scheme, the subtitle file for determining to match with video segment by subtitles appearances vector and video feature vector, solve because video caption matching error is to the puzzlement caused by user, fundamentally ensure that the correctness that video segment is matched with subtitle file.
Description
Technical field
The present embodiments relate to the matching process of multimedia technology field, more particularly to a kind of video and captions
And device.
Background technology
With continuing to develop for Internet technology and multimedia technology, the video rich in expressive force and sight is made
One of carrier for information, is favored by numerous users.In order to preferably show video content, in user
When watching video, it will usually simultaneous display and the captions corresponding to video, it is easy to user to understand in video
Hold, lifting user watches the experience of video.
The video of prior art matches the method for using filename matching with captions, that is, extracts target and regard
The filename of frequency, then in the set of subtitle file, by the filename of target video and each subtitle file
Filename matched, find the subtitle file that filename is most matched, choose this subtitle file as regarding
The subtitle file of frequency is played.But if subtitle file name names inaccurate or naming errors, it will it is straight
The selection of influence subtitle file is connect, causes the accuracy of selected subtitle file very unstable, and
The filename of captions be extremely hold it is easily modified, if to be rewritten into video related for a random subtitle file
Name, it is possible to cause mistake matching, be user video-see cause puzzlement.
The content of the invention
The present invention provides the matching process and device of a kind of video and captions, is easily matched with captions with solving video
The problem of mistake, realize the accurate match of video and captions.
In a first aspect, the embodiments of the invention provide a kind of video and the matching process of captions, this method includes:
Obtain video segment to be matched and one or more subtitle files to be matched;
The association Speech time information of each sound bite is extracted from acquired video segment and from institute
State in one or more subtitle files to be matched and extract captions temporal information;
Identical rule is based respectively on, according to the association Speech time information and captions temporal information extracted
Generate the video feature vector of the video segment and the subtitles appearances of one or more of subtitle files to
Amount;
Based on the video feature vector generated and subtitles appearances vector, it is determined that matched with the video segment
Subtitle file.
Second aspect, the embodiment of the present invention additionally provides the coalignment of video and captions, and the device includes:
Acquiring unit, video segment to be matched and one or more subtitle files to be matched for obtaining;
Video feature extraction unit, the pass for extracting each sound bite from acquired video segment
Join Speech time information;
Subtitles appearances extraction unit, for extracting word from one or more subtitle files to be matched
Curtain temporal information;
Characteristic vector generation unit, for being based respectively on identical rule, during according to the association voice extracted
Between information and captions temporal information generate the video feature vector of the video segment and one or many
The subtitles appearances vector of individual subtitle file;
Determining unit, for based on the video feature vector generated and subtitles appearances vector, it is determined that with it is described
The subtitle file of video segment matching.
The technical solution adopted in the present invention, each sound bite is extracted from acquired video segment
Speech time information is associated, the video feature vector of video segment is generated according to the association Speech time information,
Captions temporal information is extracted from acquired one or more subtitle files, is believed according to the captions time
The subtitles appearances vector of the one or more of subtitle files of breath generation, and then based on the video feature vector
It is vectorial with the subtitles appearances, it is determined that the subtitle file matched with video segment, is solved because of video caption
The puzzlement caused with mistake to user, fundamentally ensure that the correctness that captions are matched in video.
Brief description of the drawings
By reading the detailed description made to non-limiting example made with reference to the following drawings, the present invention
Other features, objects and advantages will become more apparent upon:
Fig. 1 is the flow chart of the matching process of the video that the embodiment of the present invention one is provided and captions;
Fig. 2 is the flow chart of the matching process of the video that the embodiment of the present invention two is provided and captions;
Fig. 3 is the flow chart of the matching process of the video that the embodiment of the present invention three is provided and captions;
Fig. 4 is the flow chart of the matching process of the video that the embodiment of the present invention four is provided and captions;
Fig. 5 is the structure chart of the coalignment of the video that the embodiment of the present invention five is provided and captions.
Embodiment
In order that the object, technical solutions and advantages of the present invention are clearer, below in conjunction with the accompanying drawings to the present invention
Specific embodiment is described in further detail.It is understood that specific embodiment described herein is only
It is only used for explaining the present invention, rather than limitation of the invention.It also should be noted that, for the ease of retouching
State, part related to the present invention rather than full content are illustrate only in accompanying drawing.Show being discussed in greater detail
It should be mentioned that some exemplary embodiments are described as what is described as flow chart before example property embodiment
Processing or method.Although operations (or step) are described as the processing of order by flow chart, wherein
Many operations can be implemented concurrently, concomitantly or simultaneously.In addition, the order of operations can be with
It is rearranged.The processing can be terminated when its operations are completed, it is also possible to being not included in
Additional step in accompanying drawing.The processing can correspond to method, function, code, subroutine, subprogram
Etc..
Embodiment one
Fig. 1 is the flow chart of the matching process of the video that the embodiment of the present invention one is provided and captions, the present embodiment
Method can be performed by the coalignment of video and captions, the device can be by the side of hardware and/or software
Formula is realized, and can be typically integrated in the client for the subtitle file for needing to obtain matching, and for providing
The server of subtitle file and/or video segment is used cooperatively.
The method of the present embodiment is specifically included:
S110, acquisition video segment to be matched and one or more subtitle files to be matched.
In this operation, video segment can be complete film, cartoon, variety show, a training
Any fragment in course etc. or video, or in video after any several fragment editing splicings
The fragment of generation;Voice messaging in the video segment is typically the sound of people, or can be with after processing
Be identified as the sound of voice, including the sound dubbed or synthesized in cartoon, can be one's voice in speech,
Sound of singing etc..
Generally, video segment to be matched is the target video fragment that user selects to play, and to be determined and target
The subtitle file that video segment is matched, generally requires to search from local or online subtitle file storehouse, institute
State and one or more subtitle files to be matched are frequently included in subtitle file storehouse.
The otherness of length and content in view of video segment, it is to be understood that with it is described to be matched
The subtitle file of video segment matching may be one or more.For example, a content is longer or splicing
Video segment, may correspond to multiple subtitle files to be matched.
S120, extract from acquired video segment the association Speech time information of each sound bite with
And extract captions temporal information from one or more subtitle files to be matched.
In this operation, the association Speech time information of each sound bite can be and each sound bite phase
The timing node information or time segment information of association, specifically, when can include the starting of each sound bite
Intermediate node information and terminate time interval information between timing node information, each adjacent sound bite and
Duration information of each sound bite etc..Similarly, captions temporal information can be in subtitle file and word
The associated timing node information of curtain content or time segment information.
Association Speech time information in view of obtaining each sound bite in video segment, is compared to and passes through
Speech recognition technology identifies the voice content of video segment, when extracting the association voice of each sound bite
Between information it is relatively easy, and in subtitle file generally only include caption content and corresponding captions temporal information,
Thereby it is preferred that extracting the association Speech time information of each sound bite from acquired video segment
And captions temporal information is extracted from one or more subtitle files to be matched, to characterize video
The feature of fragment and subtitle file.
S130, be based respectively on identical rule, during according to the association Speech time information and captions extracted
Between information generate the video feature vector of the video segment and the captions of one or more of subtitle files
Characteristic vector.
It is special according to the video that the association Speech time information extracted generates the video segment in this operation
Vector is levied, the video that can generate the video segment according to the whole association Speech time information extracted is special
Levy vector so that the characteristic information of video segment characterizes more detailed, abundant;Also can be according to being extracted
Partial association Speech time information generate the video feature vector of the video segment, reduction video features to
The dimension of amount, can more quickly determine what is matched with the video segment while accuracy is ensured
Subtitle file.It is understood that the subtitles appearances vector of one or more of subtitle files, based on phase
Same rule, can be generated according to all or part of captions temporal information extracted.And based on identical rule
The subtitles appearances vector then generated, dimension identical generally with the element number of the video feature vector
Degree is identical.
Identical rule is based respectively on, the video segment is generated according to the association Speech time information extracted
Video feature vector, and one or more of subtitle files are generated according to extracted captions temporal information
Subtitles appearances vector, so set and be advantageous in that, can fundamentally ensure video feature vector and word
The accuracy of curtain characteristic vector matching.
S140, based on the video feature vector generated and subtitles appearances vector, it is determined that with the video segment
The subtitle file of matching.
In this operation, can by vectorial being compared with the video feature vector of the subtitles appearances,
And based on the video feature vector generated and subtitles appearances vector comparison result, it is determined that with the piece of video
The subtitle file of section matching, it is special based on the video generated in a preferred embodiment of the present embodiment
Vector sum subtitles appearances vector is levied, it is determined that the subtitle file matched with the video segment can specifically include:
Space similarity between the generated video feature vector of calculating and subtitles appearances vector;And according to being counted
The space similarity calculated, it is determined that target subtitle file corresponding with the video segment.
Wherein, the similarity in vector space can be for expression subtitle file and video segment in temporal characteristics
On similarity degree or tightness degree.Exemplarily, space similarity can be according to subtitles appearances vector
The numerical value such as COS distance, Euclidean algorithm, Pearson correlation coefficient between the video feature vector
To be judged.Preferably, can be according to the space similarity highest with the video feature vector
Subtitles appearances vector, it is determined that target subtitle file corresponding with the video segment.
The technical solution adopted in the present invention, each sound bite is extracted from acquired video segment
Speech time information is associated, the video feature vector of video segment is generated according to the association Speech time information,
Captions temporal information is extracted from acquired one or more subtitle files, is believed according to the captions time
The subtitles appearances vector of the one or more of subtitle files of breath generation, and then based on the video feature vector
It is vectorial with the subtitles appearances, it is determined that the subtitle file matched with video segment, is solved because of video caption
The puzzlement caused with mistake to user, fundamentally ensure that the correctness that captions are matched in video.
Embodiment two
Fig. 2 is the flow chart of the matching process of a kind of video that the embodiment of the present invention two is provided and captions.This reality
Apply example based on above-described embodiment one further to be optimized, in the present embodiment, by from acquired
The association Speech time Advance data quality that each sound bite is extracted in video segment is:
Voice data is extracted from acquired video segment;Frequency spectrum is carried out to the voice data extracted
Analysis, regard the voice data for meeting voice spectrum characteristic as speech data;Based on resulting speech data,
Obtain each sound bite and corresponding association Speech time information.
Accordingly, the method for the present embodiment is specifically included:
S110, acquisition video segment to be matched and one or more subtitle files to be matched.
S220, from acquired video segment extract voice data.
In video segment, video content would generally be showed by scene, and be understood as auxiliary video
Subtitle file in, do not have corresponding content generally and embody content in these scenes, subtitle file one by one
The voice data of the video segment is corresponded generally to, therefore can be extracted from acquired video segment
Voice data, the feature for characterizing video segment, to realize purpose that subtitle file is matched with video segment.
S230, the voice data progress spectrum analysis to being extracted, will meet the audio of voice spectrum characteristic
Data are used as speech data.
In order that the expressive force for obtaining video segment is stronger, the cooperation of various audios is generally also had in video segment,
Got on the bus such as road and come the sound of the past sound of car, the light music for rendering atmosphere, nature, such as wind and rain sound,
And generally comprised in subtitle file is perhaps lyrics etc. in the dialogue of role, therefore, it can further carry
The speech data in the voice data is taken, in order to which video segment is preferably matched with subtitle file.
Generally, the speech data in voice data is extracted, it is necessary to analyze the voice data extracted,
The analysis of time domain and/or frequency domain can be used, the acoustic characteristic that voice data is included is obtained.In general, frequency
Domain analysis is that voice data is come out using frequency axis as coordinate representation to be analyzed, and analysis process is more terse,
Anatomy problem is more deeply and conveniently.Therefore, in the present embodiment, the audio number preferably to being extracted
According to progress spectrum analysis.Specifically, the voice data extracted preferably includes frequency distribution information, with
Realize and spectrum analysis is carried out to the voice data, obtain speech data.Spectrum analysis is carried out to voice data
Each frequency content and frequency distribution scope in voice data can be obtained, and then obtains each frequency content
Amplitude distribution and Energy distribution, also can obtain corresponding frequency values of main amplitude distribution and Energy distribution etc.,
According to the result of spectrum analysis, the voice data for meeting voice spectrum characteristic can be regard as speech data.Wherein,
Voice spectrum characteristic includes the frequency content of voice and the frequency distribution scope of voice etc..
In aforesaid operations, will meet the voice data of voice spectrum characteristic as speech data can be specifically,
According to result of spectrum analysis, the voice data is clustered by spectral characteristic, voice spectrum will be met special
The voice data of property is used as speech data.For example, can be divided according to the frequency of different audio types in voice data
The difference of the otherness of cloth, such as amplitude distribution and Energy distribution, the voice data is carried out by spectral characteristic
Cluster, is distinguish between, and then regard the voice data for meeting voice spectrum characteristic in cluster result as voice number
According to.
S240, based on resulting speech data, obtain each sound bite and corresponding association Speech time
Information.
, can be based on resulting speech data, more precisely from audio after speech data is got
Each sound bite and the association Speech time information corresponding to each sound bite are obtained in data.Due to
Caption content in subtitle file, generally corresponds to the content corresponding to the voice messaging in video segment, such as
The dialogue of role or aside etc., the association Speech time information extracted corresponding to speech data are described as generation
The foundation of the video feature vector of video segment so that it is more accurate to be matched with subtitle file.
S250, from one or more subtitle files to be matched extract captions temporal information.
In the present embodiment, existing captions time extracting method can be used, according to the characteristics of time data, from
Captions temporal information is extracted in one or more subtitle files to be matched.
S130, be based respectively on identical rule, during according to the association Speech time information and captions extracted
Between information generate the video feature vector of the video segment and the captions of one or more of subtitle files
Characteristic vector.
S140, based on the video feature vector generated and subtitles appearances vector, it is determined that with the video segment
The subtitle file of matching.
The technical scheme that the present embodiment is provided, by the voice data extracted from video segment, is accorded with
The voice data of voice spectrum characteristic is closed as speech data, based on acquired in the resulting speech data
Each sound bite arrived and corresponding association Speech time information, can more accurately characterize video segment
Feature, and then the video that is generated to the association Speech time information and captions temporal information extracted is special
Levy vector and subtitles appearances vector is analyzed, the word matched with video segment can be more accurately determined out
Curtain file.
Embodiment three
Fig. 3 is the flow chart of the matching process of a kind of video that the embodiment of the present invention three is provided and captions.This reality
Apply example to optimize based on above-described embodiment two, in the present embodiment, by " the association Speech time
Information " is optimized for:The association Speech time information is the time interval letter between each adjacent sound bite
Breath, " the captions temporal information " is optimized for:The captions temporal information is each adjacent subtitle fragment
Between time interval information, and will " it is described be based respectively on identical rule, according to the association extracted
Speech time information and captions temporal information generate the video feature vector and described one of the video segment
The subtitles appearances vector of individual or multiple subtitle files " is optimized for:
Identical rule is based respectively on, is regarded according to the time interval information generation between each adjacent sound bite
The video feature vector of frequency fragment, and according to the time interval information between each adjacent subtitle fragment, it is raw
Into the subtitles appearances vector of one or more of subtitle files.
Accordingly, the method for the present embodiment is specifically included:
S110, acquisition video segment to be matched and one or more subtitle files to be matched.
S220, from acquired video segment extract voice data.
S230, the voice data progress spectrum analysis to being extracted, will meet the audio of voice spectrum characteristic
Data are used as speech data.
S340, based on resulting speech data, obtain each sound bite and each corresponding adjacent voice
Time interval information between fragment.
Specifically, can, based on resulting speech data, obtain each sound bite, and according to institute
State the initial time nodal information of each sound bite and terminate timing node information, calculate each adjacent language
Tablet section between time interval information, for example, can by the initial time nodal value of current speech segment with
The termination timing node value of an adjacent upper sound bite makes the difference, and obtains current speech segment and adjacent upper one
The time interval of sound bite.
S350, extract from one or more subtitle files to be matched each adjacent subtitle fragment it
Between time interval information.
Similarly, it can be the initial time nodal information of each subtitle fragment and segmentum intercalaris when terminating according to
Point information, calculates the time interval information between each adjacent subtitle fragment, according to each adjacent title stock
Time interval information between section, generates the subtitles appearances vector of one or more of subtitle files.
S360, be based respectively on identical rule, according to the time interval information between each adjacent sound bite
The video feature vector of video segment is generated, and is believed according to the time interval between each adjacent subtitle fragment
Breath, generates the subtitles appearances vector of one or more of subtitle files.
In this operation, video segment is generated according to the time interval information between each adjacent sound bite
Video feature vector, can obtain the time between each the adjacent sound bite extracted in video segment
Interval information, by the time interval information between each sound bite got and adjacent sound bite,
As the element of the video feature vector of video segment, the video feature vector of video segment is generated.Similarly,
According to the time interval information between each adjacent subtitle fragment, it can obtain and be extracted in video segment
Time interval information between each subtitle fragment and adjacent subtitle fragment, it is special as the captions of subtitle file
The element of vector is levied, the subtitles appearances vector of one or more of subtitle files is generated.
Further, in order to reduce the dimension of video feature vector, the fast of video segment and subtitle file is realized
Speed matching, can also be based respectively on identical selection rule, choose the accessed adjacent sound bite
Between time interval in a part of numerical value, according to the time between the selected adjacent sound bite in part
The numerical value at interval, generates the video feature vector of the video segment, and according to selected described adjacent
A part of numerical value in the time interval of subtitle fragment, the captions for generating one or more of subtitle files are special
Levy vector.For example, the time interval of the adjacent sound bite of the setting number got can be chosen,
As the element of video feature vector, accordingly, based on identical selection rule, the setting got is chosen
The time interval of the adjacent subtitle fragment of number, is used as the element of subtitles appearances vector.
S140, based on the video feature vector generated and subtitles appearances vector, it is determined that with the video segment
The subtitle file of matching.
The technical scheme that the present embodiment is provided, according to the time interval between each described adjacent sound bite
Video feature vector is generated, subtitles appearances are generated according to the time interval between each described adjacent subtitle fragment
Vector, and then based on the video feature vector and the subtitles appearances vector, it is determined that with the video segment
The subtitle file of matching, can effectively reduce the dimension of the video feature vector and subtitles appearances vector,
The overall time migration because of subtitle file and video segment, caused captions and time can also effectively be solved
The problem of matching inaccurate, substantially increases the matching efficiency and accuracy rate of video segment and subtitle file.
It will be clear that in the preferred exemplary of the present embodiment, can also be by the step S220 in Fig. 3
Following step is replaced with S230:Voice data is extracted from acquired video segment, voice number is used as
According to, i.e., can be without spectrum analysis in the preferred exemplary.
Example IV
Fig. 4 is the flow chart of the matching process of a kind of video that the embodiment of the present invention four is provided and captions.This reality
Apply example to optimize based on above-described embodiment two, in the present embodiment, by " the association Speech time
Information " is optimized for:The association Speech time information is the duration information of each sound bite, will be " described
Captions temporal information " is optimized for:The captions temporal information is the duration information of each subtitle fragment, and
Will " it is described to be based respectively on identical rule, according to the association Speech time information and captions time extracted
Information generates the video feature vector of the video segment and the captions spy of one or more of subtitle files
Levy vector " it is optimized for:
Identical rule is based respectively on, the video of video segment is generated according to the duration information of each sound bite
Characteristic vector, and one or more of subtitle files are generated according to the duration information of each subtitle fragment
Subtitles appearances vector.
Accordingly, the method for the present embodiment is specifically included:
S110, acquisition video segment to be matched and one or more subtitle files to be matched.
S220, from acquired video segment extract voice data.
S230, the voice data progress spectrum analysis to being extracted, will meet the audio of voice spectrum characteristic
Data are used as speech data.
S440, based on resulting speech data, obtain each sound bite and each corresponding sound bite
Duration information.
Specifically, each can be extracted from acquired video segment based on resulting speech data
The duration information of sound bite, can be the node time information according to each sound bite, calculate
The duration information of each sound bite, for example, can be by the termination timing node value of current speech segment with rising
Beginning timing node value makes the difference, and obtains the duration information of current speech segment.
S450, the duration for extracting from one or more subtitle files to be matched each subtitle fragment
Information.
Similarly, extracted from one or more subtitle files to be matched each subtitle fragment when
Long message, can be the initial time nodal information and termination timing node letter according to each subtitle fragment
Breath, calculates the duration information of each subtitle fragment.
S460, be based respectively on identical rule, according to the duration information of each sound bite generate video segment
Video feature vector, and one or more of captions are generated according to the duration information of each subtitle fragment
The subtitles appearances vector of file.
In this operation, the video feature vector of video segment is generated according to the duration information of each sound bite,
Can be the duration information for obtaining each sound bite extracted in video segment, it is all by what is got
The duration information of sound bite, as the element of the video feature vector of video segment, generation video segment
Video feature vector.Similarly, one or more of words are generated according to the duration information of each subtitle fragment
The subtitles appearances vector of curtain file, can obtain long letter when extracting each subtitle fragment in subtitle fragment
Breath, using the duration information of all subtitle fragments as the vectorial element of the subtitles appearances of subtitle fragment, generation
The subtitles appearances vector of one or more of subtitle files.
Further, in order to reduce the dimension of video feature vector, the fast of video segment and subtitle file is realized
Speed matching, can also be based respectively on identical selection rule, choose the accessed sound bite when
A part of numerical value in long message, and the numerical value of the duration information according to the selected part sound bite,
The video feature vector of the video segment is generated, and chooses the duration of the accessed subtitle fragment
A part of numerical value in information, it is raw according to the numerical value of the duration information of the selected part subtitle fragment
Into the subtitles appearances vector of one or more of subtitle files.The setting number got can for example be chosen
The sound bite duration information as video feature vector element, accordingly, based on identical choosing
Rule is taken, the duration information of the subtitle fragment of the setting number got is chosen as subtitles appearances vector
Element.
S440, based on the video feature vector generated and subtitles appearances vector, it is determined that with the video segment
The subtitle file of matching.
The technical scheme that the present embodiment is provided, video is generated according to the duration information of each sound bite
Characteristic vector, subtitles appearances vector is generated according to the duration information of each subtitle fragment, and then based on institute
Video feature vector and subtitles appearances vector are stated, it is determined that the subtitle file matched with the video segment,
The captions and time match caused by the overall time migration of subtitle file and video segment can effectively be solved
Inaccurate the problem of, while the dimension of the video feature vector and subtitles appearances vector can be reduced effectively
Degree, can greatly improve the matching efficiency and accuracy rate of video segment and subtitle file.
It will be clear that in the preferred exemplary of the present embodiment, can also be by the step S220 in Fig. 4
Following step is replaced with S230:Voice data is extracted from acquired video segment, voice number is used as
According to, i.e., can be without spectrum analysis in the preferred exemplary.
On the basis of the various embodiments described above, each sound bite is extracted from acquired video segment
Speech time information is associated, specifically be may include:Each sound bite is extracted from acquired video segment,
And obtain the corresponding association Speech time information of each described sound bite.
Generally, in video segment comprising abundant voice messaging, voice messaging that can be in video segment,
Each sound bite is extracted from the video segment of acquisition.According to setting time interval threshold, voice messaging
Syllable, word and/or sentence etc. can be divided into.Exemplarily, according to the voice messaging, from acquired
Each sound bite is extracted in video segment can be specifically judge in video segment current syllable with it is next
Whether the time interval between syllable exceedes the silent duration threshold value of setting, if, it is determined that current syllable institute
Corresponding temporal information is the termination timing node information of current speech segment, corresponding to next syllable
Temporal information is the initial time nodal information of next sound bite;If it is not, then repeating aforesaid operations.Its
In, the silent duration threshold value can combine the length of video segment to be matched, carry out according to the actual requirements
Setting, for example:30 milliseconds, 1 second, 2 seconds, 5 seconds or 5 minutes etc., the present invention is not limited.
For example, when silent duration threshold value is set in video segment as 2 seconds, successively in the video segment
Time interval between syllable and syllable, that is to say, that whether have syllable in 2 seconds after detection current syllable
Occur, if so, this step is then repeated, if it is not, between then illustrating the time between current syllable and next syllable
Every more than 2 seconds, then the temporal information corresponding to current syllable is regard as segmentum intercalaris during the termination of current speech segment
Point information, the time corresponding to next syllable is recorded when next syllable occurs as next voice
The initial time nodal information of fragment, and repeat the above steps, extracted from acquired video segment each
Individual sound bite.
Using above-mentioned technical proposal, the sound bite in video segment can be effectively extracted, so it is more accurate
The association Speech time information of each sound bite is effectively extracted from acquired video segment, with reality
The accurate match of existing video segment and subtitle file.
Embodiment five
The structure of a kind of video and captions coalignment that are provided figure 5 illustrates the embodiment of the present invention five
Figure, as shown in figure 5, described device includes:Acquiring unit 510, video feature extraction unit 520, captions
Feature extraction unit 530, characteristic vector generation unit 540 and determining unit 550.
Wherein, acquiring unit 510, the video segment to be matched and to be matched one or more for obtaining
Subtitle file;Video feature extraction unit 520, for extracting each language from acquired video segment
The association Speech time information of tablet section;Subtitles appearances extraction unit 530, for from described to be matched one
Captions temporal information is extracted in individual or multiple subtitle files;Characteristic vector generation unit 540, for respectively
Based on identical rule, according to association Speech time information and captions the temporal information generation extracted
The subtitles appearances vector of the video feature vector of video segment and one or more of subtitle files;It is determined that
Unit 550, for based on the video feature vector generated and subtitles appearances vector, it is determined that with the video
The subtitle file of fragment match.
The technical solution adopted in the present invention, each sound bite is extracted from acquired video segment
Speech time information is associated, the video feature vector of video segment is generated according to the association Speech time information,
Captions temporal information is extracted from acquired one or more subtitle files, is believed according to the captions time
The subtitles appearances vector of the one or more of subtitle files of breath generation, and then based on the video feature vector
It is vectorial with the subtitles appearances, it is determined that the subtitle file matched with video segment, is solved because of video caption
The puzzlement caused with mistake to user, fundamentally ensure that the correctness that captions are matched in video.
On the basis of the various embodiments described above, the video feature extraction unit can include:Voice data is carried
Modulus block, speech data acquisition module with associate Speech time data obtaining module.
Wherein, voice data extraction module, for extracting voice data from acquired video segment;
Speech data acquisition module, for carrying out spectrum analysis to the voice data extracted, will meet voice frequency
The voice data of spectral property is used as speech data;Speech time data obtaining module is associated, for based on gained
The speech data arrived, obtains each sound bite and corresponding association Speech time information.
On the basis of the various embodiments described above, the association Speech time information can be each adjacent voice sheet
Time interval information between section, the captions temporal information can be between each adjacent subtitle fragment when
Between interval information, and the characteristic vector generation unit can be specifically for:Identical rule is based respectively on,
The video feature vector of video segment is generated according to the time interval information between each adjacent sound bite, with
And according to the time interval information between each adjacent subtitle fragment, generate one or more of subtitle files
Subtitles appearances vector.
On the basis of the various embodiments described above, the association Speech time information can also be each adjacent voice
The duration information of fragment, the captions temporal information can also be the duration information of each subtitle fragment, and
The characteristic vector generation unit can also be specifically for:Identical rule is based respectively on, according to each voice sheet
The duration information of section generates the video feature vector of video segment, and according to the when long letter of each subtitle fragment
The subtitles appearances vector of the one or more of subtitle files of breath generation.
On the basis of the various embodiments described above, the determining unit can include:Computing module and determining module.
Wherein, computing module, for calculating between generated video feature vector and subtitles appearances vector
Space similarity;And determining module, for according to the space similarity calculated, it is determined that being regarded with described
The corresponding target subtitle file of frequency fragment.
Embodiment six
The embodiment of the present invention six provides a kind of terminal device, and the terminal device is integrated with the embodiment of the present invention
Video and captions coalignment, video and captions can be carried out with captions matching process by performing video
Matching.
Exemplarily, the terminal device in the present embodiment concretely mobile phone, tablet personal computer with etc. be configured with and regard
The terminal device of frequency playing device.
The technical solution adopted in the present invention, each sound bite is extracted from acquired video segment
Speech time information is associated, the video feature vector of video segment is generated according to the association Speech time information,
Captions temporal information is extracted from acquired one or more subtitle files, is believed according to the captions time
The subtitles appearances vector of the one or more of subtitle files of breath generation, and then based on the video feature vector
It is vectorial with the subtitles appearances, it is determined that the subtitle file matched with video segment, is solved because of video caption
The puzzlement caused with mistake to user, fundamentally ensure that the correctness that captions are matched in video.
The video that the present embodiment is provided and captions coalignment, the video provided with any embodiment of the present invention
Belong to same inventive concept with captions matching process, can perform the video that is provided of any embodiment of the present invention and
Captions matching process, possesses execution video functional module corresponding with captions matching process and beneficial effect.Not
Ins and outs of detailed description in the present embodiment, reference can be made to video and word that any embodiment of the present invention is provided
Curtain matching process.
Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.Those skilled in the art
It will be appreciated that the invention is not restricted to specific embodiment described here, can enter for a person skilled in the art
Row it is various it is obvious change, readjust and substitute without departing from protection scope of the present invention.Therefore, though
So the present invention is described in further detail by above example, but the present invention be not limited only to
Upper embodiment, without departing from the inventive concept, can also include other more equivalent embodiments,
And the scope of the present invention is determined by scope of the appended claims.
Claims (11)
1. a kind of video and captions matching process, it is characterised in that including:
Obtain video segment to be matched and one or more subtitle files to be matched;
The association Speech time information of each sound bite is extracted from acquired video segment and from institute
State in one or more subtitle files to be matched and extract captions temporal information;
Identical rule is based respectively on, according to the association Speech time information and captions temporal information extracted
Generate the video feature vector of the video segment and the subtitles appearances of one or more of subtitle files to
Amount;
Based on the video feature vector generated and subtitles appearances vector, it is determined that matched with the video segment
Subtitle file.
2. according to the method described in claim 1, it is characterised in that extracted from acquired video segment
Going out the association Speech time information of each sound bite includes:
Voice data is extracted from acquired video segment;
Spectrum analysis is carried out to the voice data extracted, the voice data for meeting voice spectrum characteristic is made
For speech data;
Based on resulting speech data, each sound bite and corresponding association Speech time information are obtained.
3. method according to claim 1 or 2, it is characterised in that the association Speech time information
It is the time interval information between each adjacent sound bite, the captions temporal information is each adjacent captions
Time interval information between fragment, and
It is described to be based respectively on identical rule, according to the association Speech time information and captions time extracted
Information generates the video feature vector of the video segment and the captions spy of one or more of subtitle files
Levying vector includes:
Identical rule is based respectively on, is regarded according to the time interval information generation between each adjacent sound bite
The video feature vector of frequency fragment, and according to the time interval information between each adjacent subtitle fragment, it is raw
Into the subtitles appearances vector of one or more of subtitle files.
4. method according to claim 1 or 2, it is characterised in that the association Speech time information
It is the duration information of each sound bite, the captions temporal information is the duration information of each subtitle fragment,
And
It is described to be based respectively on identical rule, according to the association Speech time information and captions time extracted
Information generates the video feature vector of the video segment and the captions spy of one or more of subtitle files
Levying vector includes:
Identical rule is based respectively on, the video of video segment is generated according to the duration information of each sound bite
Characteristic vector, and one or more of subtitle files are generated according to the duration information of each subtitle fragment
Subtitles appearances vector.
5. according to any described methods of claim 1-4, it is characterised in that special based on the video generated
Vector sum subtitles appearances vector is levied, it is determined that including with the subtitle file that the video segment is matched:
Space similarity between the generated video feature vector of calculating and subtitles appearances vector;And
According to the space similarity calculated, it is determined that target subtitle file corresponding with the video segment.
6. a kind of video and captions coalignment, it is characterised in that including:
Acquiring unit, video segment to be matched and one or more subtitle files to be matched for obtaining;
Video feature extraction unit, the pass for extracting each sound bite from acquired video segment
Join Speech time information;
Subtitles appearances extraction unit, for extracting word from one or more subtitle files to be matched
Curtain temporal information;
Characteristic vector generation unit, for being based respectively on identical rule, during according to the association voice extracted
Between information and captions temporal information generate the video feature vector of the video segment and one or many
The subtitles appearances vector of individual subtitle file;
Determining unit, for based on the video feature vector generated and subtitles appearances vector, it is determined that with it is described
The subtitle file of video segment matching.
7. device according to claim 6, it is characterised in that video feature extraction unit includes:
Voice data extraction module, for extracting voice data from acquired video segment;
Speech data acquisition module, for carrying out spectrum analysis to the voice data extracted, will meet language
The voice data of audio spectral property is used as speech data;
Speech time data obtaining module is associated, for based on resulting speech data, obtaining each voice
Fragment and corresponding association Speech time information.
8. the device according to claim 6 or 7, it is characterised in that the association Speech time information
It is the time interval information between each adjacent sound bite, the captions temporal information is each adjacent captions
Time interval information between fragment, and
The characteristic vector generation unit is used to be based respectively on identical rule, according to each adjacent sound bite
Between time interval information generate the video feature vector of video segment, and according to each adjacent title stock
Time interval information between section, generates the subtitles appearances vector of one or more of subtitle files.
9. the device according to claim 6 or 7, it is characterised in that the association Speech time information
It is the duration information of each adjacent sound bite, long letter when the captions temporal information is each subtitle fragment
Breath, and
The characteristic vector generation unit be used for be based respectively on identical rule, according to each sound bite when
Long message generates the video feature vector of video segment, and is generated according to the duration information of each subtitle fragment
The subtitles appearances vector of one or more of subtitle files.
10. according to any described devices of claim 6-9, it is characterised in that the determining unit includes:
Computing module, the space phase for calculating between generated video feature vector and subtitles appearances vector
Like degree;And
Determining module, for according to the space similarity calculated, it is determined that corresponding with the video segment
Target subtitle file.
11. a kind of terminal device, including video as described in any in claim 6-10 matches dress with captions
Put.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610139767.1A CN107181986A (en) | 2016-03-11 | 2016-03-11 | The matching process and device of video and captions |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610139767.1A CN107181986A (en) | 2016-03-11 | 2016-03-11 | The matching process and device of video and captions |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107181986A true CN107181986A (en) | 2017-09-19 |
Family
ID=59829813
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610139767.1A Pending CN107181986A (en) | 2016-03-11 | 2016-03-11 | The matching process and device of video and captions |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107181986A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108538309A (en) * | 2018-03-01 | 2018-09-14 | 杭州趣维科技有限公司 | A kind of method of song detecting |
CN109587543A (en) * | 2018-12-27 | 2019-04-05 | 秒针信息技术有限公司 | Audio synchronization method and device and storage medium |
CN109743617A (en) * | 2018-12-03 | 2019-05-10 | 清华大学 | A kind of video playing jumps air navigation aid and equipment |
CN109754783A (en) * | 2019-03-05 | 2019-05-14 | 百度在线网络技术(北京)有限公司 | Method and apparatus for determining the boundary of audio sentence |
CN114051154A (en) * | 2021-11-05 | 2022-02-15 | 新华智云科技有限公司 | News video strip splitting method and system |
CN116471436A (en) * | 2023-04-12 | 2023-07-21 | 央视国际网络有限公司 | Information processing method and device, storage medium and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103647909A (en) * | 2013-12-16 | 2014-03-19 | 宇龙计算机通信科技(深圳)有限公司 | Caption adjusting method and caption adjusting device |
CN104038827A (en) * | 2014-06-06 | 2014-09-10 | 小米科技有限责任公司 | Multimedia playing method and device |
US20150100981A1 (en) * | 2012-06-29 | 2015-04-09 | Huawei Device Co., Ltd. | Video Processing Method, Terminal, and Caption Server |
CN104853257A (en) * | 2015-04-30 | 2015-08-19 | 北京奇艺世纪科技有限公司 | Subtitle display method and device |
-
2016
- 2016-03-11 CN CN201610139767.1A patent/CN107181986A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150100981A1 (en) * | 2012-06-29 | 2015-04-09 | Huawei Device Co., Ltd. | Video Processing Method, Terminal, and Caption Server |
CN103647909A (en) * | 2013-12-16 | 2014-03-19 | 宇龙计算机通信科技(深圳)有限公司 | Caption adjusting method and caption adjusting device |
CN104038827A (en) * | 2014-06-06 | 2014-09-10 | 小米科技有限责任公司 | Multimedia playing method and device |
CN104853257A (en) * | 2015-04-30 | 2015-08-19 | 北京奇艺世纪科技有限公司 | Subtitle display method and device |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108538309A (en) * | 2018-03-01 | 2018-09-14 | 杭州趣维科技有限公司 | A kind of method of song detecting |
CN108538309B (en) * | 2018-03-01 | 2021-09-21 | 杭州小影创新科技股份有限公司 | Singing voice detection method |
CN109743617A (en) * | 2018-12-03 | 2019-05-10 | 清华大学 | A kind of video playing jumps air navigation aid and equipment |
CN109743617B (en) * | 2018-12-03 | 2020-11-24 | 清华大学 | Skip navigation method and device for video playing |
CN109587543A (en) * | 2018-12-27 | 2019-04-05 | 秒针信息技术有限公司 | Audio synchronization method and device and storage medium |
CN109587543B (en) * | 2018-12-27 | 2021-04-02 | 秒针信息技术有限公司 | Audio synchronization method and apparatus and storage medium |
CN109754783A (en) * | 2019-03-05 | 2019-05-14 | 百度在线网络技术(北京)有限公司 | Method and apparatus for determining the boundary of audio sentence |
CN114051154A (en) * | 2021-11-05 | 2022-02-15 | 新华智云科技有限公司 | News video strip splitting method and system |
CN116471436A (en) * | 2023-04-12 | 2023-07-21 | 央视国际网络有限公司 | Information processing method and device, storage medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107181986A (en) | The matching process and device of video and captions | |
Fan et al. | Cn-celeb: a challenging chinese speaker recognition dataset | |
CN108780643B (en) | Automatic dubbing method and device | |
CN106297776B (en) | A kind of voice keyword retrieval method based on audio template | |
US10037313B2 (en) | Automatic smoothed captioning of non-speech sounds from audio | |
CN108305643B (en) | Method and device for determining emotion information | |
US10964330B2 (en) | Matching speakers to meeting audio | |
CN107507626B (en) | Mobile phone source identification method based on voice frequency spectrum fusion characteristics | |
Pan et al. | Selective listening by synchronizing speech with lips | |
US11511200B2 (en) | Game playing method and system based on a multimedia file | |
CN106898339B (en) | Song chorusing method and terminal | |
CN106816151B (en) | Subtitle alignment method and device | |
US20200013389A1 (en) | Word extraction device, related conference extraction system, and word extraction method | |
CN110750996B (en) | Method and device for generating multimedia information and readable storage medium | |
CN105989839B (en) | Speech recognition method and device | |
CN106982344A (en) | video information processing method and device | |
CN111147871B (en) | Singing recognition method and device in live broadcast room, server and storage medium | |
CN114598933B (en) | Video content processing method, system, terminal and storage medium | |
CN114065720A (en) | Conference summary generation method and device, storage medium and electronic equipment | |
CN111737515B (en) | Audio fingerprint extraction method and device, computer equipment and readable storage medium | |
CN110555117B (en) | Data processing method and device and electronic equipment | |
CN113270112A (en) | Electronic camouflage voice automatic distinguishing and restoring method and system | |
CN117059123A (en) | Small-sample digital human voice-driven action replay method based on gesture action graph | |
CN116017088A (en) | Video subtitle processing method, device, electronic equipment and storage medium | |
US20230169988A1 (en) | Method and apparatus for performing speaker diarization based on language identification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170919 |
|
RJ01 | Rejection of invention patent application after publication |