CN112911373B

CN112911373B - Video subtitle generating method, device, equipment and storage medium

Info

Publication number: CN112911373B
Application number: CN202110132044.XA
Authority: CN
Inventors: 张晋; 刘青松; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2021-01-31
Filing date: 2021-01-31
Publication date: 2023-05-26
Anticipated expiration: 2041-01-31
Also published as: CN112911373A

Abstract

The invention relates to a method, a device, equipment and a storage medium for generating video captions, wherein the method comprises the following steps: responding to the monitored subtitle regeneration instruction, and intercepting subtitle pictures according to subtitle positions in the video; extracting a caption background from the caption picture; inputting the caption content in the video into a pre-trained multi-style caption generating model for processing to obtain a caption with a target style; and overlapping the subtitle of the target style and the subtitle background, splicing the subtitle and the subtitle background into the video for display, and realizing real-time dynamic display in a style required by a user, so that the video can be suitable for different users, and the adaptability of the video is improved.

Description

Video subtitle generating method, device, equipment and storage medium

Technical Field

The present invention relates to the field of video playing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating video subtitles.

Background

Video is an important medium for information transfer, and is important in people's life. In general, most videos are provided with subtitles, and the subtitles are displayed in the video while the video is played.

In the prior art, subtitles in a video are usually displayed in a fixed form in the video, and for some users, the video may not be watched anymore because the subtitles in the video are not interesting, or the video is evaluated to be low, so that the play rate of the video is affected. Therefore, how to implement personalized setting of video subtitles and improve video adaptability is a technical problem to be solved urgently by those skilled in the art.

Disclosure of Invention

The invention provides a method, a device, equipment and a storage medium for generating video subtitles, which can solve the technical problem that the video subtitles cannot be set individually, so that the adaptability of the video is low.

The technical scheme for solving the technical problems is as follows:

a method of generating video subtitles, comprising:

responding to the monitored subtitle regeneration instruction, and intercepting subtitle pictures according to subtitle positions in the video;

extracting a caption background from the caption picture;

inputting the caption content in the video into a pre-trained multi-style caption generating model for processing to obtain a caption with a target style;

and superposing the caption of the target style and the caption background, and splicing the caption and the caption background into the video for display.

Further, in the method for generating video subtitles described above, inputting subtitle content in video into a pre-trained multi-style subtitle generating model for processing to obtain a subtitle of a target style, including:

encoding the caption content by using an encoder of the multi-style caption generating model to obtain a caption vector, and recombining the caption vector with a preset theme word segmentation feature vector to obtain a recombined vector;

inputting the recombined vector into a countermeasure generation network corresponding to the multi-style subtitle generation model to obtain the subtitle of the target style.

Further, in the method for generating a video subtitle described above, the topic segmentation feature vector is set as follows:

extracting the topic word segmentation feature vector from preset topic word segmentation, and setting the topic word segmentation feature vector;

extracting the topic word segmentation feature vector from the user-defined topic word segmentation, and setting the topic word segmentation feature vector; the user-defined topic segmentation is obtained by re-editing the preset topic segmentation, or is obtained by user self-creation in a self-creation mode.

Further, in the method for generating video subtitles described above, the subtitle positions in the video are obtained as follows:

if the video type is the externally-hung subtitle video, extracting a subtitle file from the externally-hung subtitle video, and analyzing the subtitle file to obtain the subtitle position;

and if the type of the video is an embedded subtitle video, taking a preset position of the embedded subtitle video as the subtitle position, or acquiring the subtitle position by using a pre-trained text detection model.

Further, in the method for generating video subtitles described above, the subtitle content in the video is obtained as follows:

if the video type is the externally-hung subtitle video, extracting a subtitle file from the externally-hung subtitle video, and analyzing the subtitle file to obtain the subtitle content;

and if the type of the video is embedded subtitle video, acquiring the subtitle content by using a pre-trained text detection model.

The invention also provides a device for generating the video subtitle, which comprises the following steps:

the intercepting module is used for intercepting caption pictures according to the caption positions in the video in response to the monitored caption regeneration instruction;

the extraction module is used for extracting the caption background from the caption picture;

the subtitle regeneration module is used for inputting subtitle contents in the video into a pre-trained multi-style subtitle generation model for processing to obtain a subtitle with a target style;

and the splicing module is used for superposing the subtitle of the target style and the subtitle background, and splicing the subtitle and the subtitle background into the video for display.

Further, in the above-mentioned video subtitle generating device, the subtitle regenerating module is specifically configured to:

Further, in the video subtitle generating apparatus described above, the topic segmentation feature vector is set as follows:

The invention also provides a video subtitle generating device, which comprises: a processor and a memory;

the processor is configured to execute a video subtitle generating program stored in the memory, so as to implement the video subtitle generating method described in any one of the above.

The present invention also provides a storage medium storing one or more programs which when executed implement the method of generating video subtitles as described in any of the above.

The beneficial effects of the invention are as follows:

intercepting a caption picture according to a caption position in a video by responding to a monitored caption regeneration instruction; extracting a caption background from a caption picture; inputting the caption content in the video into a pre-trained multi-style caption generating model for processing to obtain a caption with a target style; the subtitle and the subtitle background in the target style are overlapped and spliced into the video for display, so that the real-time dynamic display in the style required by the user is realized, the video can be suitable for different users, and the adaptability of the video is improved.

Drawings

Fig. 1 is a flowchart of a method for generating video subtitles according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a video subtitle generating apparatus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a video subtitle generating apparatus according to an embodiment of the present invention.

Detailed Description

The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention.

Fig. 1 is a flowchart of a method for generating video subtitles according to an embodiment of the present invention, as shown in fig. 1, the method for generating video subtitles according to the embodiment may specifically include the following steps:

100. responding to the monitored subtitle regeneration instruction, and intercepting subtitle pictures according to subtitle positions in the video;

in a specific implementation process, when a user watches a video, if the user considers that the subtitle in the video cannot meet the requirement of the user, a subtitle regeneration instruction can be input, so that after the subtitle regeneration instruction is received, the user can respond and intercept subtitle pictures according to the subtitle position in the video.

In practical applications, the video may be a plug-in subtitle video or an embedded subtitle video. Therefore, in the present embodiment, the subtitle position in the video is acquired as follows:

if the video type is the externally-hung subtitle video, extracting a subtitle file from the externally-hung subtitle video, and analyzing the subtitle file to obtain a subtitle position;

if the type of the video is an embedded subtitle video, the subtitle and the video are integrated, and the subtitle position cannot be extracted, but the subtitle position in the embedded subtitle video is usually a fixed position, so that the preset position of the embedded subtitle video can be used as the subtitle position in the video. If the subtitle position in the embedded subtitle video is not a fixed position, a text in the video can be detected by using a pre-trained text detection model, and the subtitle position in the video can be further acquired.

After the caption position in the video is obtained, the caption picture can be intercepted by utilizing an image interception technology according to the caption position in the video.

101. Extracting a caption background from the intercepted caption picture;

in this embodiment, the caption in the picture and the caption background may be separated, and the picture with the caption background removed may be obtained.

102. Inputting the caption content in the video into a pre-trained multi-style caption generating model for processing to obtain a caption with a target style;

in a specific implementation process, subtitle content in a video is acquired as follows:

if the type of the video is the plug-in subtitle video, extracting a subtitle file from the plug-in subtitle video, and analyzing the subtitle file to obtain subtitle content in the video;

if the type of the video is embedded subtitle video, acquiring subtitle content in the video by using a pre-trained text detection model. For example, subtitle content in a video may be acquired based on optical character recognition (Optical Character Recognition, OCR) technology.

In a specific implementation process, the multi-style subtitle generating model of the embodiment can be obtained after training based on the countermeasure generating network in advance, so that after subtitle content in video is obtained, a coder of the multi-style subtitle generating model can be utilized to code the subtitle content to obtain subtitle vectors, the subtitle vectors are recombined with preset topic word segmentation feature vectors to obtain recombined vectors, the recombined vectors are input into the countermeasure generating network corresponding to the multi-style subtitle generating model to obtain subtitles of a target style, and therefore, the subtitles in the video are more personalized to be presented to users, and more unique experience is brought to video enjoyers. For example, in the video of children, cartoon style subtitles can be generated in a personalized way, and the effect brought by the video is improved.

In some embodiments, the topic word feature vector may be extracted from a preset topic word and set.

In some embodiments, to further meet the needs of different users, the topic word segmentation feature vector may be extracted from the custom topic word segmentation, and the topic word segmentation feature vector may be set. Specifically, for the preset topic word, only a part of styles may not meet the user requirements, so that the user only needs to adjust a small part of the preset topic word to reach the user requirements, and therefore, in the embodiment, the preset topic word can be edited again to obtain the user-defined topic word.

In some embodiments, the user may also create the custom topic word itself, specifically, the user triggers a self-creation instruction, and in the self-creation mode the user self-creates the custom topic word. For example, the user may upload his or her own drawing as a caption style, and in the self-creation mode, the user uploads his or her own drawing as a defined subject word, extracts a subject word feature vector from the defined subject word, and sets the subject word feature vector.

103. And superposing the caption of the target style and the caption background, and splicing the caption and the caption background into a video for display.

After the target style subtitles are obtained, the target style subtitles and the subtitle background can be overlapped to form an image containing the personalized subtitles and spliced into the video for display, so that the subtitles in the video can be dynamically displayed in a user-required style in real time, and the phenomenon that the user does not watch the video any more because the user is not interested in the subtitles in the video or evaluates the video less, thereby influencing the play rate of the video is reduced.

In this embodiment, it is preferable to generate the subtitle of the target style at the video playing end, so that the network transmission overhead consumption is reduced, the influence of the network bandwidth and the delay on a certain extent when the video playing end and the remote control end interact is avoided, and the stability of generating the subtitle of the target style is improved.

According to the method for generating the video subtitles, the subtitle pictures are intercepted according to the subtitle positions in the video by responding to the monitored subtitle regeneration instruction; extracting a caption background from a caption picture; inputting the caption content in the video into a pre-trained multi-style caption generating model for processing to obtain a caption with a target style; the subtitle and the subtitle background in the target style are overlapped and spliced into the video for display, so that the real-time dynamic display in the style required by the user is realized, the video can be suitable for different users, and the adaptability of the video is improved.

In a specific implementation process, when a user watches a video, identity information of the user can be recorded through a camera, a fingerprint identification component and the like, and subtitles of a target style used by the user are associated with the identity information of the user, so that subtitle libraries of different users are built, and when the user watches the video next time, the subtitles frequently used by the user can be directly called from the different subtitle libraries.

Fig. 2 is a schematic structural diagram of a video subtitle generating apparatus according to an embodiment of the present invention, and as shown in fig. 2, the video subtitle generating apparatus in this embodiment may specifically include an interception module 20, an extraction module 21, a subtitle regeneration module 22, and a splicing module 23.

The intercepting module 20 is used for intercepting caption pictures according to caption positions in the video in response to the monitored caption regeneration instruction;

in this embodiment, the subtitle position in the video is acquired as follows:

if the video type is the externally hung subtitle video, extracting a subtitle file from the externally hung subtitle video, and analyzing the subtitle file to obtain a subtitle position;

if the type of the video is the embedded subtitle video, taking the preset position of the embedded subtitle video as the subtitle position, or acquiring the subtitle position by using a pre-trained text detection model.

An extracting module 21, configured to extract a caption background from a caption picture;

the subtitle regeneration module 22 is configured to input subtitle content in a video into a pre-trained multi-style subtitle generation model for processing, so as to obtain a subtitle with a target style;

in this embodiment, the subtitle content in the video is acquired as follows:

if the video type is the externally hung subtitle video, extracting a subtitle file from the externally hung subtitle video, and analyzing the subtitle file to obtain subtitle content;

if the type of the video is embedded subtitle video, acquiring subtitle content by using a pre-trained text detection model.

In a specific implementation process, a multi-style caption generating model encoder can be utilized to encode caption contents to obtain caption vectors, and the caption vectors and preset topic word segmentation feature vectors are recombined to obtain recombined vectors; inputting the recombined vector into a countermeasure generation network corresponding to the multi-style subtitle generation model to obtain the subtitle of the target style.

In this embodiment, the topic word segmentation feature vector is set as follows:

extracting topic word segmentation feature vectors from preset topic word segmentation, and setting the topic word segmentation feature vectors;

extracting topic word segmentation feature vectors from the user-defined topic word segmentation, and setting the topic word segmentation feature vectors; the user-defined topic segmentation is obtained by re-editing the preset topic segmentation, or is obtained by user self-creation in a self-creation mode.

And the splicing module 23 is used for superposing the caption of the target style and the caption background and splicing the caption and the caption background into the video for display.

The generation device of the video caption of the embodiment intercepts caption pictures according to the caption position in the video by responding to the monitored caption regeneration instruction; extracting a caption background from a caption picture; inputting the caption content in the video into a pre-trained multi-style caption generating model for processing to obtain a caption with a target style; the subtitle and the subtitle background in the target style are overlapped and spliced into the video for display, so that the real-time dynamic display in the style required by the user is realized, the video can be suitable for different users, and the adaptability of the video is improved.

Fig. 3 is a schematic structural diagram of a video subtitle generating device according to an embodiment of the present invention, as shown in fig. 3, a traffic device according to this embodiment may include: a processor 1010 and a memory 1020. The device may also include an input/output interface 1030, a communication interface 1040, and a bus 1050, as will be appreciated by those skilled in the art. Wherein processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 implement communication connections therebetween within the device via a bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit ), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. Memory 1020 may store an operating system and other application programs, and when the embodiments of the present specification are implemented in software or firmware, the associated program code is stored in memory 1020 and executed by processor 1010.

The input/output interface 1030 is used to connect with an input/output module for inputting and outputting information. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

Communication interface 1040 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 1050 includes a path for transferring information between components of the device (e.g., processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).

It should be noted that although the above-described device only shows processor 1010, memory 1020, input/output interface 1030, communication interface 1040, and bus 1050, in an implementation, the device may include other components necessary to achieve proper operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.

The present invention also provides a storage medium storing one or more programs which when executed implement the video subtitle generating method of the above embodiments.

The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the disclosure, including the claims, is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the invention, the steps may be implemented in any order and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the invention. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.

While the invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.

The present invention is not limited to the above embodiments, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the present invention, and these modifications and substitutions are intended to be included in the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method for generating video subtitles, comprising:

extracting a caption background from the caption picture;

encoding the caption content in the video by using an encoder of a multi-style caption generating model to obtain caption vectors, and recombining the caption vectors with preset topic word segmentation feature vectors to obtain recombined vectors;

inputting the reorganization vector into an countermeasure generation network corresponding to the multi-style subtitle generation model to obtain a subtitle of a target style; the topic word segmentation feature vector is set in the following mode: extracting the topic word segmentation feature vector from the user-defined topic word segmentation, and setting the topic word segmentation feature vector; the self-defined topic segmentation is obtained by re-editing preset topic segmentation, or is obtained by self-creating by a user in a self-creating mode;

overlapping the caption of the target style and the caption background, and splicing the caption and the caption background into the video for display;

when the user watches the video, the identity information of the user is recorded, and the subtitle of the target style used by the user is associated with the identity information of the user.

2. The method for generating video subtitles according to claim 1, wherein the subtitle position in the video is acquired as follows:

3. The method for generating video subtitles according to claim 1, wherein subtitle content in the video is acquired as follows:

4. A video subtitle generating apparatus, comprising:

the subtitle regeneration module is used for encoding subtitle contents in the video by utilizing an encoder of the multi-style subtitle generation model to obtain subtitle vectors, and recombining the subtitle vectors with preset topic word segmentation feature vectors to obtain recombined vectors; inputting the reorganization vector into an countermeasure generation network corresponding to the multi-style subtitle generation model to obtain a subtitle of a target style; the topic word segmentation feature vector is set in the following mode: extracting the topic word segmentation feature vector from the user-defined topic word segmentation, and setting the topic word segmentation feature vector; the self-defined topic segmentation is obtained by re-editing preset topic segmentation, or is obtained by self-creating by a user in a self-creating mode;

a splicing module for superposing the caption of the target style and the caption background and splicing the caption and the caption background into the video for display,

and the recording module is used for recording the identity information of the user when the user watches the video, and associating the caption of the target style used by the user with the identity information of the user.

5. A video subtitle generating apparatus, comprising: a processor and a memory;

the processor is configured to execute a video subtitle generating program stored in the memory, so as to implement the video subtitle generating method of any one of claims 1 to 3.

6. A storage medium storing one or more programs which when executed by a processor implement the method of generating video subtitles of any of claims 1-3.