CN110035326A

CN110035326A - Subtitle generation, the video retrieval method based on subtitle, device and electronic equipment

Info

Publication number: CN110035326A
Application number: CN201910272387.9A
Authority: CN
Inventors: 汪忠超
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2019-07-19

Abstract

The embodiment of the invention discloses subtitle generation, the video retrieval method based on subtitle, device and electronic equipments.One specific embodiment of this method includes: to extract audio data from the video data of target video；Speech recognition is carried out to the audio data, caption data is generated according to speech recognition result；By caption data in conjunction with the video data of the target video, the video data of the target video comprising subtitle is generated.The corresponding subtitle of audio data for generating target video by video editor is realized, and subtitle is combined with video data.On the one hand the cost for generating the subtitle of video can be reduced, the speed of video caption generation on the other hand can be improved.

Description

Subtitle generation, the video retrieval method based on subtitle, device and electronic equipment

Technical field

The present invention relates to multimedia technology field more particularly to a kind of generation of subtitle, the video retrieval method based on subtitle, Device and electronic equipment.

Background technique

Subtitle refers to that in video playing, the explanation text occurred in video playing interface may include dialogue, explanation Word or other information.Subtitle can help spectators to understand the content of program.Usual subtitle be all after video program completion, Post-production.

Current subtitle adding method is all to use third party's speech-to-text software, life after video record completion At the corresponding text of audio content in video.Then text is pasted in video editor, places text realization frame by frame and holds Continuous Subtitle Demonstration effect.Such subtitle addition manner causes subtitle to add higher cost.

Summary of the invention

The embodiment of the invention provides a kind of generation of subtitle, the video retrieval method based on subtitle, device and electronic equipment, It realizes and caption data is generated according to the audio data in video in video editor, to reduce the mesh of subtitle manufacturing cost 's.

In a first aspect, the embodiment of the invention provides a kind of method for generating captions, this method comprises: from the view of target video Frequency extracts audio data in；Speech recognition is carried out to the audio data, generates caption data；By caption data with it is described The video data of target video combines, and generates the video data of the target video comprising subtitle.

Optionally, described by caption data in conjunction with the video data of the target video, generate include subtitle mesh Before the video data for marking video, the method also includes: obtain the time synchronization generated in the shooting process of target video Information；And it is described by caption data in conjunction with the video data of the target video, generate the target video comprising subtitle Video data, comprising: be based on the time synchronization information for the corresponding caption data of the audio data and the target video Video data combine, generate include subtitle target video.

Optionally, it is described based on the time synchronization information by the video data of the caption data and the target video In conjunction with, comprising: determine at least one audio data frame included by the audio data；For each audio data frame, determine Point and end time point at the beginning of the audio data frame, and put at the end of at the beginning of the determining and audio data frame Between put corresponding video image key frame；It will be with the audio data frame pair based on the sart point in time and end time point The caption data answered is in conjunction with video image key frame.

Optionally, described that speech recognition is carried out to the audio data, generate caption data, comprising: according to speech recognition As a result, generating the consistent first language caption data of category of language corresponding with the voice；It generates and the first language word At least one corresponding second language caption data of curtain data, the affiliated category of language of second language and the first language institute It is different to belong to category of language；And it is described by caption data in conjunction with the video data of the target video, generate comprising subtitle The video data of target video, comprising: by the first language caption data, at least one described second language caption data with The video data of the target video combines, and generates the video data of the target video comprising subtitle.

Optionally, the method also includes: receive user input subtitle be arranged parameter；And it is described by caption data with The video data of the target video combines, and generates the video data of the target video comprising subtitle, comprising: will apply described The caption data of parameter is arranged in conjunction with the video data of the target video in subtitle, generates the view of the target video including subtitle Frequency evidence.

Optionally, before extracting audio data in the video data from target video, the method also includes: it connects The subtitle for receiving user's input generates instruction；And the corresponding audio data of the extraction target video, comprising: according to the subtitle Instruction is generated, the corresponding audio data of target video data is extracted.

Second aspect, the embodiment of the present invention improve a kind of video retrieval method based on subtitle, including to receive user defeated The video search keyword entered；The video search keyword is matched with presetting database, is determined according to matching result The corresponding search target video of the video search keyword, in the presetting database in advance the multiple videos of associated storage and Described search target video, is sent to the terminal device of user by the corresponding caption data of each video；Wherein, described pre- If the corresponding caption data of any video in database is generated based on any one method for generating captions of first aspect.

The third aspect, the embodiment of the invention provides a kind of caption generation devices, comprising: extraction unit is used for from target Audio data is extracted in the video data of video；First generation unit is generated for carrying out speech recognition to the audio data Caption data；Second generation unit includes subtitle in conjunction with the video data of the target video, generating caption data Target video video data.

Optionally, described device further includes synchronizing information acquiring unit, and the synchronizing information acquiring unit is used for: being obtained The time synchronization information generated in the shooting process of target video；And second generation unit is further used for: being based on institute Time synchronization information is stated by caption data in conjunction with the video data of the target video, generates the target video comprising subtitle.

Optionally, second generation unit is further used for: determining at least one sound included by the audio data Frequency data frame；For each audio data frame, determine point and end time point at the beginning of the audio data frame, and determine with Video image key frame corresponding with end time point is put at the beginning of the audio data frame；Based on the time started Point and end time point will caption data corresponding with the audio data frame in conjunction with video image key frame.

Optionally, first generation unit is further used for: according to speech recognition result, generating corresponding with the voice The consistent first language caption data of category of language；It is generated according to presetting method corresponding with the first language caption data At least one second language caption data, the affiliated category of language of second language and the affiliated category of language of the first language are not Together；And second generation unit is further used for: by the first language caption data, at least one described second language Caption data generates the video data of the target video comprising subtitle in conjunction with the video data of the target video.

Optionally, described device further includes the first receiving unit, and first receiving unit is used for: receiving user's input Parameter is arranged in subtitle；And second generation unit is further used for: will apply the subtitle number of the subtitle setting parameter According in conjunction with the video data of the target video, the video data of the target video including subtitle is generated.

Optionally, described device further includes the second receiving unit, and second receiving unit is used for: in the extraction unit Before the audio data for extracting target video, the subtitle for receiving user's input generates instruction；And the extraction unit is further For: it is generated and is instructed according to the subtitle, extract audio data from the video data of target video.

Fourth aspect, the embodiment of the invention provides a kind of video frequency searching device based on subtitle, comprising: receiving unit, For receiving the video search keyword of user's input；Determination unit is used for the video search keyword and preset data Library is matched, and determines the corresponding search target video of the video search keyword, the preset data according to matching result Multiple videos and the corresponding caption data of each video are pre-saved in library, transmission unit is used for described search mesh Mark video is sent to the terminal device of user；Wherein, the corresponding caption data of any video in the presetting database is based on Any caption generation device of the third aspect generates.

5th aspect, the embodiment of the invention provides a kind of electronic equipment, comprising: one or more processors；Storage dress It sets, for storing one or more programs, when one or more programs are executed by one or more processors, so that one or more A processor realizes the step of any one of the above method for generating captions.

6th aspect, the embodiment of the invention provides a kind of electronic equipment, comprising: one or more processors；Storage dress It sets, for storing one or more programs, when one or more of programs are executed by one or more of processors, so that One or more of processors realize such as the above-mentioned video retrieval method based on subtitle.

7th aspect, the embodiment of the invention provides a kind of computer-readable mediums, are stored thereon with computer program, should The step of any one of the above method for generating captions is realized when program is executed by processor.

Eighth aspect, the embodiment of the invention provides a kind of computer-readable mediums, are stored thereon with computer program, should The step of above-mentioned video retrieval method based on subtitle is realized when program is executed by processor.

Subtitle generation provided in an embodiment of the present invention, the video retrieval method based on subtitle, device and electronic equipment, pass through Extract the audio data of target video；Then, speech recognition is carried out to audio data, generates caption data；Finally, by subtitle number According in conjunction with the video data of target video, the video data of the target video comprising subtitle is generated.Above scheme realizes logical It crosses video editor and generates the corresponding subtitle of audio data of target video, and subtitle is combined with video data.On the one hand The cost for generating the subtitle of video can be reduced, the speed of video caption generation on the other hand can be improved.

Detailed description of the invention

Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:

Fig. 1 is the flow chart of one embodiment of method for generating captions according to the present invention；

Fig. 2 is the flow chart according to another embodiment of the method for generating captions of the application；

Fig. 3 is the flow chart of another embodiment of method for generating captions according to the present invention；

Fig. 4 is the flow chart of one embodiment of the video retrieval method according to the present invention based on subtitle；

Fig. 5 is the structural schematic diagram of one embodiment of caption generation device according to the present invention；

Fig. 6 is the structural schematic diagram of one embodiment of the video frequency searching device according to the present invention based on subtitle；

Fig. 7 is that the method for generating captions of one embodiment of the present of invention can be applied to exemplary system architecture therein；

Fig. 8 is the schematic diagram of the basic structure of the electronic equipment provided according to embodiments of the present invention.

Specific embodiment

Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention Details is to help understanding.They should be thought only exemplary.Therefore it will be appreciated by those of ordinary skill in the art that, It can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.

It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the present invention can phase Mutually combination.

Referring to FIG. 1, it illustrates the processes of one embodiment of method for generating captions according to the present invention.Such as Fig. 1 institute Show the method for generating captions, comprising the following steps:

Step 101, audio data is extracted from the video data of target video.

In the present embodiment, above-mentioned target video can be the video of captured in real-time, be also possible to locally be stored pre- The video first shot.It can also be the video obtained from other electronic equipments.The video data of above-mentioned target video may include The video flowing and audio stream being made of multiple image.According to default encapsulation in the corresponding video data of namely above-mentioned target video Format encapsulates video flowing and audio stream.

In application scenes, it can be extracted from the video data of target video for encapsulating video flowing and audio stream Audio data out.Namely the video data of above-mentioned target video is demultiplexed, audio stream is isolated from above-mentioned video data.On It states video flowing and audio stream is multiplexed identical time shaft.

In other application scenarios, the audio content of video can be recorded in video preprocessor playing process, To obtain the corresponding audio data of target video.

Whole extraction can be carried out to the audio data of target video.The video data of target video can also be decomposed into Multiple decomposition video datas are carried out the corresponding audio data of each decomposition video simultaneously using multithreading and extracted.

Step 102, speech recognition is carried out to audio data, generates caption data.

In the present embodiment, speech recognition can be carried out to audio data, caption data is generated according to speech recognition result. Specifically, the corresponding word content of audio data can be obtained by speech recognition first, then word content is compiled into subtitle, To obtain the caption data of target video.

In some optional implementations of the present embodiment, default sound bank can use to carry out language to audio data Sound identification, obtains the corresponding word content of audio data.In application scenes, default sound bank can be for local preparatory Storaged voice library.In these application scenarios, it can use local voice library and speech recognition carried out to audio data.Other one In a little application scenarios, default sound bank can be the sound bank being stored in devices in remote electronic.It, can in these application scenarios It is communicated to connect with being realized by wired or communication and above-mentioned electronic equipment, and accesses remote speech library, using remote Journey sound bank carries out speech recognition to above-mentioned audio data.

It may include multiple texts and each text corresponding at least one in application scenes, in sound bank A standard pronunciation.

In other application scenarios, sound bank is comprising voice, text and semantic adaptable database；In language By the voice of input in sound library, the clear and coherent sentence of corresponding text composition can be found according to specific context, semanteme, and it is defeated The voice entered is adapted.

In the optional implementation of other of the present embodiment, above-mentioned steps 102 carry out voice to audio data Identification, generating caption data may include steps of:

Firstly, determining voice messaging from audio data.

Secondly, audio data is decomposed into multiple speech data frames.

Specifically, voice messaging may include multiple speech data frames.Each speech data frame is a series of tight associations Word or sentence.Then the frequency spectrum of available voice messaging in practice determines above-mentioned each language according to the frequency spectrum of voice messaging The initial position of tablet section and final position, so that above-mentioned voice messaging is decomposed into multiple sound bites.

Then, the voice data in each sound bite is identified using the method for speech recognition, generated and each sound bite Corresponding word content.

It can be using speech recognition technology, speech recognition technology are also referred to as automatic speech recognition in the prior art Automatic Speech Recognition, (ASR), target is that the vocabulary Content Transformation in the voice by the mankind is to calculate Machine readable input, such as key, binary coding or character string etc. have neural network, adaptive etc. by the way of Method.By speech recognition, the lexical information in voice segments can be identified, and be translated into the mode of text, obtained To the word content of audio data.

It is above-mentioned that word content is compiled into subtitle, including will word content corresponding with each sound bite according to each language The sequencing of tablet section is concatenated into the subtitle of target video.

Step 103, by caption data in conjunction with the video data of target video, the view of the target video comprising subtitle is generated Frequency evidence.

Packaged video flowing and audio stream can be multiplexed same time axis in the video data of visual frequency.It can be according to upper It states time shaft and combines caption data with the video data of target video, generate the video counts of the target video comprising subtitle According to.

In application scenes, caption data can be embedded in the video data of target video, to produce packet The video data of target video containing subtitle.

In other application scenarios, caption data can also be generated to independent file, by subtitle data files with Video data is packaged into the video data of the target video comprising subtitle.

In some optional implementations of the present embodiment, before step 103, above-mentioned method for generating captions is also wrapped It includes: obtaining the time synchronization information generated in the shooting process of target video.Here time synchronization information can be each view The corresponding timestamp of frequency picture frame and the corresponding timestamp of each audio frame.Above-mentioned timestamp usually may include the time started and End time.

In these optional implementations, the video counts by caption data and the target video of above-mentioned steps 103 According to combination, the video data of the target video comprising subtitle is generated, comprising: be based on the time synchronization information for the audio number According to corresponding caption data in conjunction with the video data of the target video, the target video comprising subtitle is generated.Based on above-mentioned Time synchronization information by caption data in conjunction with the video data of target video so that when playing target video, target video Video data and caption data can be played simultaneously.

In practice, it is above-mentioned based on time synchronization information by the video counts of audio data corresponding caption data and target video According to combination, may include steps of:

Firstly, determining at least one audio data frame included by above-mentioned audio data.

Multiple audio data frames obtained in step 102 can be used.

Secondly, determining point and end time point at the beginning of the audio data frame, and really for each audio data frame Determine and puts video image key frame corresponding with end time point at the beginning of the audio data frame.

Finally, will caption data corresponding with the audio data frame and video figure based on sart point in time and end time point As key frame combines.

Point and end time at the beginning of namely by the corresponding caption data of audio data frame according to audio data frame Point video image frame corresponding with the audio data frame is combined, and ensure that the video flowing of target video and subtitle stream data are same Step.To which when playing target video, the video data and caption data of target video can be played simultaneously.

It should be noted that method provided in this embodiment can be executed by terminal device.In practice, the present embodiment provides Method can be executed by video editor is arranged in terminal device.It can also be by the video capture that is arranged in terminal device Tool executes, or the video distribution tool by being arranged in terminal device executes.

The method that the above embodiment of the present invention provides from the video data of target video by extracting audio data；So Afterwards, speech recognition is carried out to audio data, generates caption data；Finally, by the video data knot of caption data and target video It closes, generates the video data of the target video comprising subtitle.Above scheme, which is realized, generates target video by video editor The corresponding subtitle of audio data, and subtitle is combined with video data.On the one hand the subtitle for generating video can be reduced On the other hand the speed of video caption generation can be improved in cost.

With further reference to Fig. 2, it illustrates the flow charts of another embodiment of method for generating captions.As shown in Fig. 2, should The process of method for generating captions, comprising the following steps:

Step 201, audio data is extracted from the video data of target video.

Step 201 is identical as the step 101 in embodiment illustrated in fig. 1, does not repeat herein.

Step 202, speech recognition is carried out to audio data, it is right with above-mentioned voice data institute to generate according to speech recognition result The consistent first language caption data of the category of language answered.

In the present embodiment, audio data carries out the detailed step of speech recognition, can refer to step 102 shown in FIG. 1, It does not repeat herein.

It in the present embodiment, may include the language of multiple types in the sound bank using its progress speech recognition, such as Chinese, English, French, Japanese etc..

When progress voice is other, can be generated and the consistent first language subtitle number of category of language corresponding to voice data According to.

Here first language can be any language in above-mentioned Chinese, English, French, Japanese etc..

Step 203, at least one second language caption data corresponding with first language caption data is generated.

In application scenes, the method that existing various real time translations can be used is generated and first language subtitle number According at least one corresponding second language caption data.

It should be noted that a kind of language real time translation at the method for another language be at present extensively research and application Well-known technique, do not repeat herein.

Step 204, by first language caption data, the video counts of at least one second language caption data and target video According to combination, the video data of the target video comprising subtitle is generated.

In the present embodiment, step 204 can be same or similar with the step 103 in embodiment illustrated in fig. 1, does not go to live in the household of one's in-laws on getting married herein It states.

Furthermore it is possible to first language caption data be arranged, at least one second language subtitle states the mode respectively shown, example The position such as respectively shown, so as to user watch target video when, each language subtitle data can according to it is above-mentioned be arranged into Row display.

From figure 2 it can be seen that compared with the corresponding embodiment of Fig. 1, the process of the method for generating captions in the present embodiment It highlights and first language caption data is generated according to speech recognition result, and second language is generated according to first language caption data The step of caption data, can extend subtitle type, be beneficial for improving user experience.

With further reference to Fig. 3, it illustrates the flow charts of another embodiment of method for generating captions.As shown in figure 3, should The process of method for generating captions, comprising the following steps:

Step 301, audio data is extracted from the video data of target video.

Step 301 is identical as the step 101 in embodiment illustrated in fig. 1, does not repeat herein.

Step 302, speech recognition is carried out to audio data, generates caption data.

Step 302 can be same or similar with the step 102 in embodiment illustrated in fig. 1, does not repeat herein.

Step 303, parameter is arranged in the subtitle for receiving user's input.

In the present embodiment, user can be configured the parameter of subtitle.Such as the size of font, subtitle are aobvious in subtitle Show position, subtitle region background color, font style, font color etc..

User can be prompted to be configured subtitle parameters in the video editing interface of target video.User can basis Prompt inputs above-mentioned subtitle setting parameter in video editing interface.User can input above-mentioned word in such a way that text inputs Curtain setting parameter can also input above-mentioned subtitle by the normal form of voice and parameter is arranged.

In the present embodiment, above-mentioned steps 303 can be interchanged with step 302.

Step 304, the caption data for applying subtitle setting parameter is generated into packet in conjunction with the video data of target video The video data of target video containing subtitle.

In the present embodiment, the subtitle setting parameter that user inputs can be applied to caption data by video editing tool In.Such as above-mentioned subtitle setting parameter etc. is written on the head of subtitle data files.

Specific method of the caption data in conjunction with the video data of target video that subtitle setting parameter will be applied can be with With reference to the step 103 of embodiment illustrated in fig. 1, do not repeat herein.

From figure 3, it can be seen that compared with the corresponding embodiment of Fig. 1, the process of the method for generating captions in the present embodiment The subtitle setting parameter for receiving user's input is highlighted, the view of the caption data and target video of subtitle setting parameter will be applied The step of frequency is according to combining.Personalized caption data can be generated in the above method, to realize the diversification of caption data.

In some optional implementations of each embodiment of method for generating captions of the application, embodiment shown in Fig. 1 Step 101 before, before the step 201 of embodiment illustrated in fig. 2 and before the step 301 of embodiment illustrated in fig. 3, subtitle generates Method can also include receiving the subtitle generation instruction of user's input.It is real shown in the step 101 and Fig. 3 of embodiment shown in Fig. 1 Before the step 301 for applying example, above-mentioned method for generating captions be may further include: the subtitle for receiving user's input generates instruction.

In these optional implementations, user can input subtitle in the video editing interface of editor's target video Generate instruction.

Specifically, it when opening target video is edited in video editing interface, can prompt the user whether to generate word Curtain.Or prompt whether to generate subtitle in the recording process of target video, or target video is being sent to other electricity Prompt whether generate subtitle before sub- equipment.

User can input subtitle and generate instruction according to above-mentioned prompt.Here subtitle instructions for example can be user couple The selection operation generation of the options of the no instruction generation subtitle generated in subtitle is shown in screen.

In addition, user can also select not generate subtitle according to above-mentioned prompt.

When user has input the instruction for generating the subtitle of target video, the step 101 of above-mentioned embodiment illustrated in fig. 1, Fig. 2 Audio number is extracted in the video data of the slave target video of the step 301 of the step 201 and embodiment illustrated in fig. 3 of illustrated embodiment According to, may include: according to the subtitle of user generate instruct, extract audio data from the video data of target video.

Referring to FIG. 4, it illustrates the streams of one embodiment of the video retrieval method according to the present invention based on subtitle Journey.It is somebody's turn to do the video retrieval method based on subtitle as shown in Figure 4, comprising the following steps:

Step 401, the video search keyword of user's input is received.

In the present embodiment, user can in the terminal device that it is used input video search key or video Search key, sentence.For example, input video search key, word, sentence in the search interface that user applies at one.Above-mentioned use The video search keyword of family input can be the video search keyword inputted in the form of text, be also possible to voice shape The video search keyword of formula input.When user is with speech form input video keyword, above-mentioned terminal device can be at this Ground carries out speech recognition, to identify the corresponding video search keyword of user speech.In addition, above-mentioned terminal device can also lead to It crosses network and sends server end for user's input voice, speech recognition is carried out to user speech by server, to identify The video search keyword of user's input out.

Above-mentioned terminal device can send server end for video search keyword.It is defeated that server end can receive user The video search keyword entered.

Step 402, video search keyword is matched with presetting database, video search is determined according to matching result The corresponding search target video of keyword.

In the present embodiment, caption data corresponding to any video in above-mentioned presetting database can be based on Fig. 1, figure 2 or the method for generating captions that is illustrated of embodiment shown in Fig. 3 generate.

In the present embodiment, after server receives the video search keyword that terminal device is sent in step 401, Video search keyword can be matched with presetting database.It specifically, can will be pre- in video key and database The caption data of the multiple videos first stored is compared, and includes the video of video key as search target view using its subtitle Frequently.Video it is possible to further will in its banner include video key searches for target video as preferred.

It is understood that above-mentioned search target video may include at least one video.

Step 403, search target video is sent to the terminal device of user.

In the present embodiment, at least one search target video can be sent to the terminal device of user.Specifically, may be used It is sent to the terminal device of user so that the corresponding summary info of target video and link information will be searched for, so that terminal device exhibition Show above-mentioned search target video.Above-mentioned summary info can also include the image information etc. of search target video.

Video retrieval method provided in this embodiment based on subtitle, can by keyword that user inputs with stored The caption data of video determines search target video with matched method.Compared to existing according to key frame of video image pair The information answered and the keyword of user's input carry out matched method, for the method to determine search target video, this method Video retrieval method can reduce the cost of video search, improve the efficiency of video search.

It generates and fills the present invention provides a kind of subtitle as the realization to method shown in above-mentioned each figure with further reference to Fig. 5 The one embodiment set, the Installation practice is corresponding with embodiment of the method shown in FIG. 1, which specifically can be applied to respectively In kind electronic equipment.

As shown in figure 5, the caption generation device of the present embodiment includes extraction unit 501, the first generation unit 502 and second Generation unit 503.Wherein, extraction unit 501, for extracting audio data from the video data of target video；First generates Unit 502 generates caption data for carrying out speech recognition to the audio data；Second generation unit 503 is used for word Curtain data generate the video data of the target video comprising subtitle in conjunction with the video data of the target video.

In the present embodiment, the extraction unit 501 of caption generation device, the first generation unit 502 and the second generation unit 503 specific processing and its brought technical effect can be respectively with reference to step 101, step 102 and steps in Fig. 1 corresponding embodiment Rapid 103 related description, details are not described herein.

In some optional implementations of the present embodiment, caption generation device further includes synchronizing information acquiring unit (not shown).Above-mentioned synchronizing information acquiring unit is used for: it is same to obtain the time generated in the shooting process of target video Walk information；And second generation unit 503 be further used for: based on time synchronization information by the view of caption data and target video Frequency generates the target video comprising subtitle according to combination.

In some optional implementations of the present embodiment, the second generation unit 503 is further used for: determining the sound Frequency is according at least one included audio data frame；For each audio data frame, at the beginning of determining the audio data frame Between point and end time point, and determine and corresponding with the end time point video figure of point at the beginning of the audio data frame As key frame；It will caption data corresponding with the audio data frame and video figure based on the sart point in time and end time point As key frame combines.

In some optional implementations of the present embodiment, the first generation unit 502 is further used for: being known according to voice Not as a result, generating the consistent first language caption data of category of language corresponding with the voice；It generates and the first language At least one corresponding second language caption data of caption data, the affiliated category of language of second language and the first language Affiliated category of language is different；And second generation unit 503 be further used for: by the first language caption data, it is described extremely A few second language caption data generates the view of the target video comprising subtitle in conjunction with the video data of the target video Frequency evidence

In some optional implementations of embodiment, caption generation device further include the first receiving unit (in figure not It shows).First receiving unit is used for: parameter is arranged in the subtitle for receiving user's input；And second generation unit 503 further use In: by the caption data for applying subtitle setting parameter in conjunction with the video data of target video, generate the target including subtitle The video data of video.

In some optional implementations of the present embodiment, the second generation unit is further used for: according to audio data Corresponding caption data and time synchronization information generate subtitle file, and subtitle file is based on time synchronization information and video counts It is packaged according to file, generates the video data of the target video comprising subtitle.

In some optional implementations of the present embodiment, caption generation device further includes the second receiving unit.Second Receiving unit is used for: before the audio data that extraction unit extracts target video, the subtitle for receiving user's input generates instruction； And extraction unit is further used for: being generated and is instructed according to subtitle, extracts the corresponding audio data of target video data.

With further reference to Fig. 6, as the realization to method shown in above-mentioned each figure, the present invention provides a kind of based on subtitle One embodiment of video frequency searching device, the Installation practice is corresponding with embodiment of the method shown in Fig. 4, which specifically may be used To be applied in various electronic equipments.

As shown in fig. 6, the video frequency searching device based on subtitle of the present embodiment includes receiving unit 601, determination unit 602 With transmission unit 603.Wherein, receiving unit 601, for receiving the video search keyword of user's input；Determination unit 602, For matching the video search keyword with presetting database, determine that the video search is crucial according to matching result Word corresponding search target video pre-saves multiple videos and the corresponding word of each video in the presetting database Curtain data, transmission unit 603, for described search target video to be sent to the terminal device of user；Wherein, the present count It is generated according to the corresponding caption data of any video in library based on caption generation device shown in fig. 5.

Referring to FIG. 7, the method for generating captions that Fig. 7 shows one embodiment of the present of invention can be applied to therein show Example property system architecture.

As shown in fig. 7, system architecture may include terminal device 701,702,703, network 704 and server 705.Network 704 between terminal device 701,702,703 and server 705 to provide the medium of communication link.Network 704 may include Various connection types, such as wired, wireless communication link or fiber optic cables etc..

Terminal device 701,702,703 can be interacted by network 704 with server 705, to receive or send message etc.. Various client applications can be installed, such as the application of video editing class, video playback class are answered on terminal device 701,702,703 With etc..

Terminal device 701,702,703 can be hardware, be also possible to software.When terminal device 701,702,703 is hard When part, the various electronic equipments of video tour, including but not limited to smart phone, plate are can be with display screen and supported Computer pocket computer on knee and desktop computer etc..When terminal device 701,702,703 is software, may be mounted at In above-mentioned cited electronic equipment.Multiple softwares or software module may be implemented into (such as providing Distributed Services in it Software or software module), single software or software module also may be implemented into.It is not specifically limited herein.Above-mentioned target view Frequency can be the video shot by terminal device, can also be the video shot by other video capture equipment, and according to communication Connection is sent to above-mentioned terminal device.

Server 705 can provide various services, such as the transmission of receiving terminal apparatus 701,702,703 includes subtitle The video data of target video.And target video is pushed to subscriber terminal equipment according to the search key of user's input.

It should be noted that method for generating captions provided by the embodiment of the present invention generally by terminal device 701,702, 703 execute, and correspondingly, caption generation device is generally positioned in terminal device 701,702,703.

It should be noted that method for generating captions provided by the embodiment of the present invention can be by terminal device is arranged in 701, the video editing application execution in 702,703.

It should be understood that the number of terminal device, network and server in Fig. 7 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.

Below with reference to Fig. 8, it illustrates the signals of the basic structure for the electronic equipment for being suitable for being used to realize the embodiment of the present invention Figure.Electronic equipment shown in Fig. 8 is only an example, should not function to the embodiment of the present invention and use scope bring it is any Limitation.

As shown in figure 8, electronic equipment may include one or more processors 801, storage device 802.Storage device 802 User stores one or more programs.One or more programs in storage device 802 can be by one or more processors 801 It executes.When one or more programs are executed by one or more processors, so that this may be implemented in one or more processors The above-mentioned function of being limited in the method for invention.

Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard The mode of part is realized.Described module also can be set in the processor, for example, can be described as: a kind of processor packet Include extraction unit, the first generation unit and the second generation unit.Wherein, the title of these modules is not constituted under certain conditions Restriction to the module itself, for example, extraction unit is also described as " extracting audio from the video data of target video The unit of data ".

As on the other hand, the present invention also provides a kind of computer-readable medium, which be can be Included in equipment described in above-described embodiment；It is also possible to individualism, and without in the supplying equipment.Of the invention Computer-readable medium can be computer-readable signal media or computer readable storage medium either the two Any combination.Computer readable storage medium for example may be-but not limited to-electricity, magnetic, optical, electromagnetic, infrared ray or System, device or the device of semiconductor, or any above combination.The more specific example of computer readable storage medium can To include but is not limited to: having electrical connection, portable computer diskette, hard disk, the random access storage of one or more conducting wires Device (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, Portable, compact magnetic Disk read-only memory (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.

Above-mentioned computer-readable medium carries one or more program, when said one or multiple programs are set by this When standby execution, so that the equipment extracts audio data from the video data of target video；Voice is carried out to the audio data Identification generates caption data；By caption data in conjunction with the video data of the target video, generates the target comprising subtitle and regard The video data of frequency.

Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention Within.

Claims

1. a kind of method for generating captions characterized by comprising

Audio data is extracted from the video data of target video；

Speech recognition is carried out to the audio data, generates caption data；

By caption data in conjunction with the video data of the target video, the video data of the target video comprising subtitle is generated.

2. the method according to claim 1, wherein in the video by caption data and the target video Data combine, before the video data for generating the target video comprising subtitle, the method also includes:

Obtain the time synchronization information generated in the shooting process of target video；And

It is described by caption data in conjunction with the video data of the target video, generate comprising subtitle target video video counts According to, comprising:

Based on the time synchronization information by the video data of the audio data corresponding caption data and the target video In conjunction with generation includes the target video of subtitle.

3. according to the method described in claim 2, it is characterized in that, described be based on the time synchronization information for the subtitle number According in conjunction with the video data of the target video, comprising:

Determine at least one audio data frame included by the audio data；

For each audio data frame, point and end time point at the beginning of the audio data frame, and the determining and sound are determined Video image key frame corresponding with end time point is put at the beginning of frequency data frame；

It will caption data corresponding with the audio data frame and video image pass based on the sart point in time and end time point Key frame combines.

4. method according to claim 1-3, which is characterized in that described to carry out voice knowledge to the audio data Not, caption data is generated, comprising:

According to speech recognition result, the consistent first language caption data of category of language corresponding with the voice is generated；

Corresponding with the first language caption data at least one second language caption data is generated, belonging to the second language Category of language is different from the affiliated category of language of the first language；And

By the first language caption data, the video counts of at least one the second language caption data and the target video According to combination, the video data of the target video comprising subtitle is generated.

5. method according to claim 1-3, which is characterized in that the method also includes:

Parameter is arranged in the subtitle for receiving user's input；And

By the caption data for applying the subtitle setting parameter in conjunction with the video data of the target video, generate including word The video data of the target video of curtain.

6. method according to claim 1-3, which is characterized in that in the video data from target video Before extracting audio data, the method also includes:

The subtitle for receiving user's input generates instruction；And

The corresponding audio data of the extraction target video, comprising:

It is generated and is instructed according to the subtitle, extract the corresponding audio data of target video data.

7. a kind of video retrieval method based on subtitle characterized by comprising

Receive the video search keyword of user's input；

The video search keyword is matched with presetting database, determines that the video search is crucial according to matching result The corresponding search target video of word, the multiple videos of associated storage and each video respectively correspond in advance in the presetting database Caption data；

Described search target video is sent to the terminal device of user；Wherein, any video pair in the presetting database The caption data answered is generated based on the method that any one of claim 1-6 is provided.

8. a kind of caption generation device characterized by comprising

Extraction unit, for extracting audio data from the video data of target video；

First generation unit generates caption data for carrying out speech recognition to the audio data；

Second generation unit, for caption data in conjunction with the video data of the target video, to be generated the mesh comprising subtitle Mark the video data of video.

9. device according to claim 8, which is characterized in that described device further includes synchronizing information acquiring unit,

The synchronizing information acquiring unit is used for:

Second generation unit is further used for:

Based on the time synchronization information by caption data in conjunction with the video data of the target video, generate comprising subtitle Target video.

10. device according to claim 9, which is characterized in that second generation unit is further used for:

Determine at least one audio data frame included by the audio data；

11. according to the described in any item devices of claim 8-10, which is characterized in that first generation unit is further used In:

At least one second language caption data corresponding with the first language caption data is generated according to presetting method, it is described The affiliated category of language of second language is different from the affiliated category of language of the first language；And

Second generation unit is further used for:

12. according to the described in any item devices of claim 8-10, which is characterized in that described device further includes the first reception list Member, first receiving unit are used for:

Parameter is arranged in the subtitle for receiving user's input；And

Second generation unit is further used for:

13. according to the described in any item devices of claim 8-10, which is characterized in that described device further includes the second reception list Member, second receiving unit are used for:

Before the audio data that the extraction unit extracts target video, the subtitle for receiving user's input generates instruction；And

The extraction unit is further used for:

It is generated and is instructed according to the subtitle, extract audio data from the video data of target video.

14. a kind of video frequency searching device based on subtitle characterized by comprising

Receiving unit, for receiving the video search keyword of user's input；

Determination unit determines institute according to matching result for matching the video search keyword with presetting database The corresponding search target video of video search keyword is stated, pre-saves multiple videos and each view in the presetting database Frequently corresponding caption data,

Transmission unit, for described search target video to be sent to the terminal device of user；Wherein, in the presetting database The device that is provided based on any one of claim 8-13 of the corresponding caption data of any video generate.

15. a kind of electronic equipment characterized by comprising

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are executed by one or more of processors, so that one or more of processors are real Now such as method as claimed in any one of claims 1 to 6.

16. a kind of electronic equipment characterized by comprising

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are executed by one or more of processors, so that one or more of processors are real Now the method for claim 7.

17. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that the program is executed by processor Shi Shixian method for example as claimed in any one of claims 1 to 6.

18. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that the program is executed by processor Shi Shixian method according to claim 8.