CN112911373A

CN112911373A - Method, device and equipment for generating video subtitles and storage medium

Info

Publication number: CN112911373A
Application number: CN202110132044.XA
Authority: CN
Inventors: 张晋; 刘青松; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2021-01-31
Filing date: 2021-01-31
Publication date: 2021-06-04
Anticipated expiration: 2041-01-31
Also published as: CN112911373B

Abstract

The invention relates to a method, a device, equipment and a storage medium for generating video subtitles, wherein the method comprises the following steps: in response to a monitored subtitle regeneration instruction, intercepting a subtitle picture according to a subtitle position in a video; extracting a subtitle background from the subtitle picture; inputting subtitle content in a video into a pre-trained multi-style subtitle generation model for processing to obtain subtitles in a target style; and superposing the captions with the target style and the caption background, and splicing the captions and the caption background into the video for display, so that real-time and dynamic display in a style required by a user is realized, the video can be suitable for different users, and the adaptability of the video is improved.

Description

Method, device and equipment for generating video subtitles and storage medium

Technical Field

The invention relates to the technical field of video playing, in particular to a method, a device, equipment and a storage medium for generating video subtitles.

Background

As an important medium for information transfer, video plays an important role in human life. In general, most videos are configured with subtitles, and the subtitles are displayed in the videos while the videos are played.

In the prior art, subtitles in a video are usually displayed in a fixed form in the video, and for some users, the subtitles in the video may not be watched any more because the users are not interested in the video, or the video is evaluated to be low, so that the playing rate of the video is affected. Therefore, how to implement personalized setting of video subtitles and improve the adaptability of videos is a technical problem to be urgently solved by technical personnel in the field.

Disclosure of Invention

The invention provides a method, a device, equipment and a storage medium for generating video subtitles, which can solve the technical problem of low video adaptability caused by the fact that the video subtitles cannot be set individually.

The technical scheme for solving the technical problems is as follows:

a method for generating video subtitles comprises the following steps:

in response to a monitored subtitle regeneration instruction, intercepting a subtitle picture according to a subtitle position in a video;

extracting a subtitle background from the subtitle picture;

inputting subtitle content in a video into a pre-trained multi-style subtitle generation model for processing to obtain subtitles in a target style;

and overlapping the caption with the target style and the caption background, and splicing the overlapped caption and the caption background into the video for displaying.

Further, in the above method for generating a video caption, inputting caption content in a video into a pre-trained multi-style caption generating model for processing to obtain a caption with a target style, the method includes:

encoding the subtitle content by using an encoder of the multi-style subtitle generating model to obtain a subtitle vector, and recombining the subtitle vector with a preset topic word segmentation characteristic vector to obtain a recombined vector;

and inputting the recombination vector into a countermeasure generation network corresponding to the multi-style subtitle generation model to obtain the target style subtitle.

Further, in the above method for generating a video caption, the topic segmentation feature vector is set as follows:

extracting the topic word segmentation feature vector from preset topic words and setting the topic word segmentation feature vector;

extracting the topic word segmentation feature vector from a user-defined topic word segmentation, and setting the topic word segmentation feature vector; the user-defined topic segmentation is obtained by re-editing the preset topic segmentation, or the user-defined topic segmentation is obtained by self-creation of a user in a self-creation mode.

Further, in the above method for generating video subtitles, the subtitle position in the video is obtained as follows:

if the type of the video is a plug-in subtitle video, extracting a subtitle file from the plug-in subtitle video, analyzing the subtitle file, and acquiring the position of the subtitle;

and if the type of the video is the embedded subtitle video, taking the preset position of the embedded subtitle video as the subtitle position, or acquiring the subtitle position by utilizing a pre-trained text detection model.

Further, in the above method for generating video subtitles, the subtitle content in the video is obtained as follows:

if the type of the video is the plug-in subtitle video, extracting subtitle files from the plug-in subtitle video, and analyzing the subtitle files to obtain subtitle content;

and if the type of the video is the embedded subtitle video, acquiring the subtitle content by utilizing a pre-trained text detection model.

The present invention also provides a device for generating video subtitles, comprising:

the intercepting module is used for responding to the monitored subtitle regeneration instruction and intercepting a subtitle picture according to the position of a subtitle in a video;

the extraction module is used for extracting the subtitle background from the subtitle picture;

the subtitle regenerating module is used for inputting subtitle contents in the video into a multi-style subtitle generating model trained in advance to be processed to obtain subtitles in a target style;

and the splicing module is used for overlapping the caption with the target style and the caption background and splicing the overlapped caption with the video for display.

Further, in the above apparatus for generating a video subtitle, the subtitle regenerating module is specifically configured to:

Further, in the above apparatus for generating a video caption, the topic segmentation feature vector is set as follows:

The present invention also provides a device for generating a video subtitle, comprising: a processor and a memory;

the processor is configured to execute a program for generating a video subtitle stored in the memory, so as to implement the method for generating a video subtitle according to any one of the above descriptions.

The present invention also provides a storage medium storing one or more programs that when executed implement any of the above-described methods for generating video subtitles.

The invention has the beneficial effects that:

intercepting a subtitle picture according to a subtitle position in a video by responding to a monitored subtitle regeneration instruction; extracting a caption background from a caption picture; inputting subtitle content in a video into a pre-trained multi-style subtitle generation model for processing to obtain subtitles in a target style; the target style subtitles and the subtitle background are overlapped and spliced into the video for display, real-time and dynamic display in the style required by the user is achieved, the video can be suitable for different users, and the adaptability of the video is improved.

Drawings

Fig. 1 is a flowchart of a method for generating a video subtitle according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a video subtitle generating apparatus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a video subtitle generating apparatus according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is a flowchart of a method for generating a video subtitle according to an embodiment of the present invention, and as shown in fig. 1, the method for generating a video subtitle according to the embodiment may specifically include the following steps:

100. in response to a monitored subtitle regeneration instruction, intercepting a subtitle picture according to a subtitle position in a video;

in a specific implementation process, when a user watches a video, if the subtitle in the video cannot meet the requirement of the user, a subtitle regeneration instruction can be input, so that the user can respond after receiving the subtitle regeneration instruction and intercept a subtitle picture according to the position of the subtitle in the video.

In practical applications, the type of the video may be plug-in subtitle video or embedded subtitle video. Therefore, in this embodiment, the subtitle position in the video is obtained as follows:

if the type of the video is the plug-in subtitle video, extracting subtitle files from the plug-in subtitle video, analyzing the subtitle files and obtaining the position of the subtitle;

if the type of the video is the embedded subtitle video, the subtitle and the video are integrated, and the subtitle position cannot be extracted, but the subtitle position in the embedded subtitle video is usually a fixed position, so the preset position of the embedded subtitle video can be used as the subtitle position in the video. If the position of the subtitle in the embedded subtitle video is not a fixed position, a text in the video can be detected by using a pre-trained text detection model, and the position of the subtitle in the video is further obtained.

After the subtitle position in the video is obtained, the subtitle picture can be intercepted according to the subtitle position in the video by utilizing an image interception technology.

101. Extracting a caption background from the intercepted caption picture;

in this embodiment, the subtitles and the subtitle backgrounds in the picture can be separated, and the picture with the subtitle background removed is obtained.

102. Inputting subtitle content in a video into a pre-trained multi-style subtitle generation model for processing to obtain subtitles in a target style;

in a specific implementation process, the subtitle content in the video is obtained as follows:

if the type of the video is the plug-in subtitle video, extracting subtitle files from the plug-in subtitle video, analyzing the subtitle files, and acquiring subtitle content in the video;

and if the type of the video is the embedded subtitle video, acquiring subtitle content in the video by using a pre-trained text detection model. For example, subtitle content in a video may be acquired based on Optical Character Recognition (OCR) techniques.

In a specific implementation process, the multi-style subtitle generating model of the embodiment can be obtained after training based on the confrontation generating network in advance, so that after subtitle content in a video is obtained, a coder of the multi-style subtitle generating model can be used for coding the subtitle content to obtain a subtitle vector, the subtitle vector is recombined with a preset topic word segmentation feature vector to obtain a recombined vector, the recombined vector is input into the confrontation generating network corresponding to the multi-style subtitle generating model to obtain a target style subtitle, and therefore the subtitle in the video is more personalized and presented to a user, and more unique experience is brought to a video appreciator. For example, in a children video, cartoon-style subtitles can be generated in a personalized manner, and the effect brought by the video is improved.

In some embodiments, a topic segmentation feature vector may be extracted from preset topic segmentation, and the topic segmentation feature vector may be set.

In some embodiments, in order to further meet the requirements of different users, a topic segmentation feature vector can be extracted from the custom topic segmentation, and the topic segmentation feature vector is set. Specifically, for the preset topic segmentation, only part of the styles may not meet the user requirements, and thus, the user only needs to adjust the preset topic segmentation by a small amount to meet the user requirements.

In some embodiments, the user may also create the custom topic participle by himself, specifically, the user triggers a self-creation instruction, and the user self-creates the custom topic participle in a self-creation mode. For example, the user may use his/her drawing as a subtitle style, and in the self-creation mode, the user uploads his/her drawing as a defined topic word segmentation, and sets a topic word segmentation feature vector after extracting the topic word segmentation feature vector from the defined topic word segmentation feature vector.

103. And overlapping the captions and the caption backgrounds in the target style, and splicing the captions and the caption backgrounds into a video for displaying.

After the captions in the target style are obtained, the captions in the target style and the caption background can be overlapped to form an image containing personalized captions, and the image is spliced into the video to be displayed, so that the captions in the video can be displayed in a style required by a user in a real-time and dynamic mode, and the phenomenon that the user does not watch the video any more because the user does not interest the captions in the video or the phenomenon that the play rate of the video is influenced because the evaluation of the video is low is reduced.

It should be noted that, in this embodiment, it is preferable to generate the target-style subtitles at the video playing end, so that network transmission overhead consumption is reduced, influence of network bandwidth and delay on the video playing end and the remote control end in a certain degree is avoided, and stability of generating the target-style subtitles is improved.

In the method for generating the video subtitles according to the embodiment, the subtitle picture is captured according to the subtitle position in the video by responding to the monitored subtitle regeneration instruction; extracting a caption background from a caption picture; inputting subtitle content in a video into a pre-trained multi-style subtitle generation model for processing to obtain subtitles in a target style; the target style subtitles and the subtitle background are overlapped and spliced into the video for display, real-time and dynamic display in the style required by the user is achieved, the video can be suitable for different users, and the adaptability of the video is improved.

In a specific implementation process, when a user watches a video, identity information of the user can be recorded through a camera, a fingerprint identification component and the like, and subtitles of a target style used by the user are associated with the identity information of the user, so that subtitle libraries of different users are established, and therefore subtitles frequently used by the user can be directly called from different subtitle libraries when the user watches the video next time.

Fig. 2 is a schematic structural diagram of a video subtitle generating apparatus according to an embodiment of the present invention, and as shown in fig. 2, the video subtitle generating apparatus according to this embodiment may specifically include an intercepting module 20, an extracting module 21, a subtitle regenerating module 22, and a splicing module 23.

The intercepting module 20 is configured to respond to the monitored subtitle regeneration instruction and intercept a subtitle picture according to a subtitle position in the video;

in this embodiment, the subtitle position in the video is obtained as follows:

and if the type of the video is the embedded subtitle video, taking the preset position of the embedded subtitle video as the subtitle position, or acquiring the subtitle position by using a pre-trained text detection model.

The extraction module 21 is configured to extract a subtitle background from a subtitle picture;

the subtitle regenerating module 22 is used for inputting the subtitle content in the video into a multi-style subtitle generating model trained in advance for processing to obtain the subtitle with the target style;

in this embodiment, the subtitle content in the video is obtained as follows:

if the type of the video is the plug-in subtitle video, extracting subtitle files from the plug-in subtitle video, analyzing the subtitle files and obtaining subtitle content;

and if the type of the video is the embedded subtitle video, acquiring subtitle content by using a pre-trained text detection model.

In a specific implementation process, a coder of a multi-style caption generating model can be used for coding caption contents to obtain a caption vector, and the caption vector is recombined with a preset topic word segmentation feature vector to obtain a recombined vector; and inputting the recombined vector into a countermeasure generation network corresponding to the multi-style subtitle generation model to obtain the subtitle with the target style.

In this embodiment, the topic segmentation feature vector is set as follows:

extracting topic word segmentation feature vectors from preset topic words, and setting the topic word segmentation feature vectors;

extracting topic word segmentation feature vectors from the user-defined topic word segmentation, and setting the topic word segmentation feature vectors; the user-defined topic segmentation is obtained by re-editing the preset topic segmentation, or the user-defined topic segmentation is obtained by self-creation of a user in a self-creation mode.

And the splicing module 23 is configured to superimpose the subtitles in the target style and the subtitle backgrounds, and splice the superimposed subtitles and the subtitle backgrounds into a video for display.

The video subtitle generating device of the embodiment intercepts a subtitle picture according to a subtitle position in a video by responding to a monitored subtitle regeneration instruction; extracting a caption background from a caption picture; inputting subtitle content in a video into a pre-trained multi-style subtitle generation model for processing to obtain subtitles in a target style; the target style subtitles and the subtitle background are overlapped and spliced into the video for display, real-time and dynamic display in the style required by the user is achieved, the video can be suitable for different users, and the adaptability of the video is improved.

Fig. 3 is a schematic structural diagram of a video subtitle generating device according to an embodiment of the present invention, and as shown in fig. 3, the passing device according to this embodiment may include: a processor 1010 and a memory 1020. Those skilled in the art will appreciate that the device may also include input/output interface 1030, communication interface 1040, and bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The present invention also provides a storage medium storing one or more programs that when executed implement the video subtitle generating method of the above-described embodiments.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for generating a video subtitle, comprising:

extracting a subtitle background from the subtitle picture;

2. The method for generating video subtitles according to claim 1, wherein inputting subtitle content in a video into a pre-trained multi-style subtitle generation model for processing to obtain subtitles of a target style, comprises:

3. The method of claim 2, wherein the topic segmentation feature vector is set as follows:

4. The method for generating video subtitles according to claim 1, wherein a subtitle position in the video is obtained as follows:

5. The method for generating video subtitles according to claim 1, wherein subtitle content in the video is obtained as follows:

6. An apparatus for generating a video subtitle, comprising:

7. The apparatus for generating video subtitles according to claim 6, wherein the subtitle regenerating module is specifically configured to:

8. The apparatus for generating a video subtitle according to claim 7, wherein the topic segmentation feature vector is set as follows:

9. A video subtitle generating apparatus, comprising: a processor and a memory;

the processor is configured to execute a program for generating a video subtitle stored in the memory to implement the method for generating a video subtitle according to any one of claims 1-5.

10. A storage medium storing one or more programs which, when executed, implement the method for generating video subtitles according to any one of claims 1 to 5.