CN112818984A

CN112818984A - Title generation method and device, electronic equipment and storage medium

Info

Publication number: CN112818984A
Application number: CN202110114237.2A
Authority: CN
Inventors: 姚晓宇; 李海; 谭颖
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2021-05-18
Anticipated expiration: 2041-01-27
Also published as: CN112818984B

Abstract

The invention discloses a title generation method, a title generation device, electronic equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining a target video, respectively detecting areas with title characteristics from a plurality of image frames of the target video to obtain area positions, determining the positions of title candidate areas in the image frames according to the area positions corresponding to the image frames, carrying out text recognition on the title candidate areas in the image frames to obtain a target title of the target video, analyzing and obtaining the title candidate areas where the titles of the videos may appear according to the image frames of the target video, eliminating numerous confusion information in the video, improving the accuracy when the titles are determined, and then carrying out text recognition on the title candidate areas, so that the target title of the target video is automatically generated, the manual intervention degree when the titles are generated is reduced, and the efficiency of generating the titles for the video is improved.

Description

Title generation method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of video technologies, and in particular, to a title generation method, a title generation apparatus, an electronic device, and a computer-readable storage medium.

Background

Currently, many local channels in various locations have news coverage programs. For the current fast-paced life style, news reports are broken into a plurality of news videos one by one, and the requirement for watching network videos by users is met.

In order to enable users to quickly find out interesting contents in a plurality of news videos for understanding, an important task in the process of dividing a news report into a plurality of segments is to allocate a proper title to each news video.

Since a lot of news reports are reported every day, generating a proper title for each news video is a work with a large workload, a lot of confusion information such as subtitles and characters in picture content appears in the video, and the title is found from the lot of confusion information, so that the problem that the time and the labor are consumed for finding the title from the video is caused.

Disclosure of Invention

An object of embodiments of the present invention is to provide a title generating method, a title generating apparatus, an electronic device, and a computer-readable storage medium, so as to solve the technical problem that it takes time and effort to find a title from a video because many pieces of confusion information, such as subtitles and characters in picture content, appear in the video and the title is found from many pieces of confusion information.

In order to solve the above problem, in a first aspect of the present invention, there is provided a title generating method, including:

acquiring a target video;

respectively detecting areas with title characteristics from a plurality of image frames of the target video to obtain area positions;

determining the position of a title candidate region in the image frames according to the region positions corresponding to the image frames;

and performing text recognition on the title candidate areas in the plurality of image frames to obtain a target title of the target video.

Optionally, the detecting, from a plurality of image frames of the target video, regions having a title characteristic respectively, and obtaining the region position includes at least one of:

obtaining the area position of which the difference value of pixel values in different image frames is smaller than a preset threshold value by comparing different image frames;

respectively detecting areas containing texts in the image frames to obtain the positions of the areas containing the texts in the image frames;

and respectively carrying out edge detection on the plurality of image frames to obtain the area positions of the areas surrounded by the edges in the plurality of image frames.

Optionally, the determining, according to the region positions corresponding to the plurality of image frames, the position of the title candidate region in the image frame includes:

counting the area positions corresponding to the plurality of image frames to generate frequency distribution data of the area positions;

performing gradient operation on the frequency distribution data to obtain a gradient operation result;

and determining the position of the title candidate region in the image frame according to the gradient operation result.

Optionally, before the text recognition is performed on the title candidate regions in the image frames to obtain the target title of the target video, the method further includes:

respectively detecting whether the image change rate of the title candidate area is smaller than a preset threshold value or not for each image frame;

and eliminating the title candidate area with the image change rate larger than a preset threshold value.

Optionally, the performing text recognition on the title candidate regions in the image frames to obtain a target title of the target video includes:

respectively carrying out text recognition on the title candidate regions in the plurality of image frames to obtain candidate texts;

and selecting a target title of the target video according to the candidate text.

Optionally, the performing text recognition on the title candidate regions in the plurality of image frames respectively to obtain candidate texts includes:

when the title candidate area is subjected to text recognition, generating the appearance duration and/or time distribution of the candidate text in the target video according to the image frame of the candidate text;

before the selecting a target title of the target video according to the candidate text, the method further includes:

and eliminating the candidate texts of which the occurrence durations and/or time distributions do not accord with the preset time condition.

Optionally, before the selecting the target title of the target video according to the candidate text, the method further includes:

detecting at least one of attribute information of text elements in the candidate text, text length of the candidate text and character categories contained in the candidate text;

and eliminating at least one candidate text which does not accord with a preset rule in the attribute information, the text length and the character category.

Optionally, the selecting a target title of the target video according to the candidate text includes:

inputting the candidate texts into a title detection network; the title detection network is used for detecting whether the text can be used as a title or not, and is obtained by adopting a title text sample and a non-title text sample for training;

detecting whether the candidate text can be used as a title or not by the title detection network, and outputting a title confidence coefficient;

and selecting the candidate text with the highest title confidence as a target title.

According to a second aspect of the present invention, there is also provided a title generating apparatus, including:

the video acquisition module is used for acquiring a target video;

the position detection module is used for respectively detecting areas with title characteristics from a plurality of image frames of the target video to obtain area positions;

a region determining module, configured to determine, according to the region positions corresponding to the plurality of image frames, positions of the title candidate regions in the image frames;

and the title generation module is used for performing text recognition on the title candidate areas in the image frames to obtain the target title of the target video.

Optionally, the position detection module comprises at least one of:

the comparison submodule is used for obtaining the area position of which the difference value of the pixel values between different image frames is smaller than a preset threshold value by comparing different image frames;

the text detection submodule is used for respectively detecting the areas containing the texts in the image frames to obtain the positions of the areas containing the texts in the image frames;

and the edge detection submodule is used for respectively carrying out edge detection on the plurality of image frames to obtain the area positions of the areas surrounded by the edges in the plurality of image frames.

Optionally, the region determining module includes:

the data generation submodule is used for counting the area positions corresponding to the image frames and generating frequency distribution data of the area positions;

the gradient operation submodule is used for carrying out gradient operation on the frequency distribution data to obtain a gradient operation result;

and the area determining submodule is used for determining the position of the title candidate area in the image frame according to the gradient operation result.

Optionally, the apparatus further comprises:

a detection module, configured to detect whether an image change rate of the title candidate region is smaller than a preset threshold for each image frame before performing text recognition on the title candidate region in the image frames to obtain a target title of the target video;

and the area removing module is used for removing the title candidate area with the image change rate larger than a preset threshold value.

Optionally, the title generating module includes:

the text recognition submodule is used for respectively carrying out text recognition on the title candidate areas in the plurality of image frames to obtain candidate texts;

and the title selection submodule is used for selecting a target title of the target video according to the candidate text.

Optionally, the text recognition sub-module comprises:

the time generation unit is used for generating the appearance duration and/or time distribution of the candidate text in the target video according to the image frame of the candidate text identified when the title candidate area is subjected to text identification;

the title selection submodule comprises:

and the first eliminating unit is used for eliminating the candidate texts of which the occurrence durations and/or time distributions do not accord with the preset time condition before the target titles of the target videos are selected according to the candidate texts.

Optionally, the title selecting sub-module includes:

the attribute detection unit is used for detecting at least one of attribute information of text elements in the candidate text, text length of the candidate text and character types contained in the candidate text before the target title of the target video is selected according to the candidate text;

and the second eliminating unit is used for eliminating at least one candidate text which does not accord with a preset rule in the attribute information, the text length and the character category.

Optionally, the title selecting sub-module includes:

an input unit configured to input the candidate text into a title detection network; the title detection network is used for detecting whether the text can be used as a title or not, and is obtained by adopting a title text sample and a non-title text sample for training;

an output unit, configured to detect, by the title detection network, whether the candidate text can be used as a title, and output a title confidence;

and the selecting unit is used for selecting the candidate text with the highest title confidence coefficient as the target title.

In another aspect of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing any of the above method steps when executing a program stored in the memory.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform any of the methods described above.

In yet another aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the methods described above.

In summary, according to the embodiments of the present invention, by acquiring a target video, regions having a title characteristic are respectively detected from a plurality of image frames of the target video, and a region position is obtained, determining the position of a title candidate region in the image frames according to the region positions corresponding to the plurality of image frames, performing text recognition on the title candidate regions in the plurality of image frames to obtain a target title of the target video, so that the title candidate area which is possibly appeared in the title of the video is obtained by analyzing according to a plurality of image frames of the target video, thereby eliminating a plurality of confusion information in the video, improving the accuracy when the title is determined, and then, text recognition is carried out on the title candidate area, so that the target title of the target video is automatically generated, the manual intervention degree during title generation is reduced, and the title generation efficiency for the video is improved.

Drawings

FIG. 1 is a flow chart illustrating the steps of one embodiment of a title generation method of the present invention;

FIG. 2 is a flow chart illustrating the steps of another title generation method embodiment of the present invention;

FIG. 3 illustrates a title hot zone feature diagram;

FIG. 4 illustrates a cut-off point hot-zone signature diagram;

FIG. 5 is a block diagram illustrating an embodiment of a title generation apparatus according to the present invention;

fig. 6 shows a schematic view of an electronic device of the invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a title generating method according to the present invention is shown, which may specifically include the following steps:

step 101, acquiring a target video.

In the embodiment of the present invention, the target video includes a video submitted by a user, a plurality of video segments split from one video, and the like, for example, a video of one news report may include a plurality of news events, the video may be split into video segments corresponding to the plurality of news events, and each video segment serves as the target video.

Step 102, detecting areas with title characteristics from a plurality of image frames of the target video respectively to obtain area positions.

In the embodiment of the present invention, the areas where the titles appear in the videos are relatively concentrated, but the specific areas in different videos are different, for example, in many videos of news reports, the titles usually appear in the lower half area in the videos, but the areas where the titles appear in different videos are different from the lower edge of the video image.

In the embodiment of the present invention, in each area of the video, there is an area having a title characteristic. A region in a video has a title property, and does not necessarily mean that a title exists in the region, but means that the region has a specific property, and may be a region where a title exists. For example, because the title does not change in the video playing process, if the image content in a certain area is not changed all the time, the certain area has a title characteristic, or a text is detected to exist in the certain area, the certain area has a title characteristic, or because a special title box is designed in some videos, if a straight line or a neat edge is detected in the certain area, the certain area has a title characteristic, and any applicable title characteristic may be included.

In the embodiment of the invention, the video is composed of image frames, and if a certain area in the image frames has a title characteristic, the area position of the area in the image frames is obtained. When the area with the title characteristic is detected and the area position is obtained, the area with the title characteristic can be detected through the same image frame in the target video to obtain the area position, or the area with the title characteristic can be detected through different image frames in the target video to obtain the area position. The specific implementation manner may include multiple manners, for example, by comparing different image frames, obtaining the position of an area where a difference value between pixel values of different image frames is smaller than a preset threshold, or respectively detecting areas containing texts in the image frames, obtaining the position of an area containing a text in the image frames, or respectively performing edge detection on the image frames, obtaining the position of an area at an edge in the image frames, or any other suitable manner, which is not limited in this embodiment of the present invention.

For example, in a news video, a title usually appears in the lower half of a video image, and in order to reduce the workload of detecting the position of an area with a title characteristic, only the lower half of a plurality of image frames in a target video is detected, specifically, a stable area with unchanged content in different image frames can be detected by a frame difference method, the position of the area of the stable area is detected, a text area is detected for the image frames, the position of the area containing text is detected, edge detection is performed for the image frames by using a Canny edge detection algorithm, the position of the area of an edge is detected, and the like.

Step 103, determining the position of the title candidate area in the image frame according to the area positions corresponding to the plurality of image frames.

In the embodiment of the invention, the area positions respectively detected from different image frames of the target video are the same and different. The more image frames a certain region position is detected, the more likely it is that the region the title is located in. Therefore, according to the corresponding region positions of a plurality of image frames, the positions of one or more regions in the image frames can be determined by using the principle of statistical analysis, and the one or more regions are marked as title candidate regions.

In this embodiment of the present invention, the implementation manner for determining the position of the title candidate region in the image frame according to the region positions corresponding to the plurality of image frames may include multiple implementations, for example, counting the region positions corresponding to the plurality of image frames, generating frequency distribution data of the region positions, performing a gradient operation on the frequency distribution data, and determining the position of the title candidate region in the image frame according to a result of the gradient operation, or any other suitable implementation manner, which is not limited in this embodiment of the present invention.

In this embodiment of the present invention, optionally, before performing text recognition on the title candidate regions in the plurality of image frames to obtain a target title of the target video, the method may further include: respectively detecting whether the image change rate of the title candidate area is smaller than a preset threshold value or not for each image frame; and eliminating the title candidate area with the image change rate larger than a preset threshold value.

Generally, in a video, the region where the title is located includes only text and background color, and the image change rate of the region where the title is located should be low relative to other images of the video. Before performing text recognition on each image frame, it is detected whether the image change rate of the title candidate area is smaller than a preset threshold, for example, a variance of the image in the title candidate area is calculated, and the variance may represent the image change rate. If the image change rate is less than the preset threshold, the title candidate area is represented to be in accordance with the general background features of the title, and the title candidate area can be reserved.

And 104, performing text recognition on the title candidate areas in the plurality of image frames to obtain a target title of the target video.

In the embodiment of the invention, text recognition is carried out on the title candidate area in each image frame to obtain the text in the title candidate area, and then the recognized text is directly used as the title of the target video, or whether the recognized text can be used as the title of the target video is judged firstly, or after the recognized text is processed, the processed text is used as the title of the target video. And recording the finally obtained title of the target video as the target title.

In one case, the same text is identified in the title candidate regions in the multiple image frames, and the identified text may be directly used as the target title of the target video, or whether the identified text can be used as the target title of the target video is determined according to a preset rule, and finally the target title of the target video is obtained, or any other suitable manner, which is not limited in the embodiment of the present invention.

In another case, different texts are identified in candidate areas of titles in a plurality of image frames, and a text that can be a target title is selected from the candidate texts. The specific implementation manner of selecting the target title may include multiple manners, for example, removing a part of candidate texts that do not meet a preset rule, inputting the remaining candidate texts into a title detection network, where the title detection network is used to detect whether texts can be used as titles, the title detection network is obtained by training a title text sample and a non-title text sample, and the title detection network detects whether candidate texts can be used as titles, outputs a title confidence, and selects a candidate text with the highest title confidence as a target title, or any other suitable manner.

In summary, according to the embodiments of the present invention, by acquiring a target video, regions having a title characteristic are respectively detected from a plurality of image frames of the target video, and a region position is obtained, determining the position of the title candidate region in the image frame according to the region positions corresponding to the plurality of image frames, performing text recognition on the title candidate regions in the plurality of image frames to obtain a target title of the target video, so that the title candidate area which is possibly appeared in the title of the video is obtained by analyzing according to a plurality of image frames of the target video, thereby eliminating a plurality of confusion information in the video, improving the accuracy when the title is determined, and then, text recognition is carried out on the title candidate area, so that the target title of the target video is automatically generated, the manual intervention degree during title generation is reduced, and the title generation efficiency for the video is improved.

Referring to fig. 2, a flowchart illustrating steps of an embodiment of a title generating method according to the present invention is shown, which may specifically include the following steps:

step 201, acquiring a target video.

Step 202, comparing different image frames to obtain the area position where the difference value of the pixel values between the different image frames is smaller than a preset threshold value.

In the embodiment of the invention, the image of the area where the title is located in the video does not change in a period of time, so that one method for detecting the area position with the title characteristic is to obtain the area position with the difference value of the pixel values smaller than the preset threshold value by comparing different image frames. For the target video, one image frame can be taken from the target video every set time length, each taken image frame is compared with the last taken image frame respectively, and the region position where the difference value of the pixel value corresponding to each image frame is smaller than the preset threshold value is obtained.

For example, a frame difference method is used to perform difference operation on two frames of images, pixel points corresponding to different image frames are subtracted to determine an absolute value of a gray difference, and when the absolute value is smaller than a certain threshold, a static target with an unchanged image can be determined, so that detection of an area with an unchanged image is realized. In particular, different image frames may be compared in any suitable manner, which is not limited in this embodiment of the present invention.

Step 203, detecting the regions containing the text in the plurality of image frames respectively, and obtaining the positions of the regions containing the text in the plurality of image frames.

In the embodiment of the present invention, another method for detecting the location of the area having the title characteristic is to detect the area containing the text in the image frame, and obtain the location of the area containing the text in the image frame. For the target video, one image frame can be taken from the target video every set time length, the area containing the text in each image frame is respectively detected, and the area position containing the text in each image frame is obtained. For example, the location of the region containing text in the image can be roughly found by MSER (maximum Stable extreme region) algorithm. Any suitable text detection mode may be specifically adopted, and the embodiment of the present invention is not limited thereto.

Step 204, respectively performing edge detection on the plurality of image frames to obtain the area positions of the areas surrounded by the edges in the plurality of image frames.

In the embodiment of the present invention, the area where the title is located in the video has a straight line or a neat edge, and therefore, another method for detecting the position of the area with the title characteristic is to perform edge detection on the image frame to obtain the edge in the image frame, and then obtain the position of the area surrounded by the edge. For the target video, one image frame can be taken from the target video every set time length, and edge detection is performed on each image frame respectively to obtain the area position of the area surrounded by the edge in each image frame. Specifically, any suitable edge detection method may be adopted, for example, a Canny edge detection algorithm, a Sobel edge detection algorithm, and the like, which is not limited in this embodiment of the present invention.

Step 205, counting the area positions corresponding to the plurality of image frames, and generating frequency distribution data of the area positions.

In an embodiment of the present invention, the detected region positions may also be different for different image frames. After detecting the region positions from each of the plurality of image frames, the region positions are counted to generate frequency distribution data of the region positions. The time distribution data may characterize the time distribution of the location of the region, for example, as shown in the title hot zone feature map of fig. 3, and the time distribution data may be presented in the form of the title hot zone feature map, wherein the higher the brightness of the region in the map, the higher the statistical time, that is, the higher the probability that the region is the title region. For another example, as shown in the cutoff point hot zone feature map shown in fig. 4, the number distribution data may be shown in the form of a cutoff point hot zone feature map, in which the higher the brightness of the vertical line, the higher the statistical number, that is, the higher the probability that the region is the edge of the title frame.

And step 206, performing gradient operation on the frequency distribution data to obtain a gradient operation result.

In the embodiment of the invention, the gradient of the pixel point is a vector with size and direction, for the frequency distribution data, the direction of the gradient is the direction with the fastest frequency change of one pixel point, and the size of the gradient is the frequency change rate of one pixel point. And performing gradient operation on the frequency distribution data, wherein the obtained gradient operation result comprises gradients at all positions corresponding to the pixel points on the image frame.

Step 207, determining the position of the title candidate region in the image frame according to the gradient operation result.

In the embodiment of the present invention, the manner of determining the position of the title candidate region in the image frame according to the gradient operation result may include multiple manners, for example, dividing the title candidate region by taking the pixel point with the maximum gradient as the edge; or selecting a pixel point with a gradient between a preset first gradient threshold and a preset second gradient threshold as an edge, and dividing a title candidate region, or any other suitable manner.

And 208, respectively performing text recognition on the title candidate regions in the plurality of image frames to obtain candidate texts.

In the embodiment of the invention, the title candidate areas in each image frame are respectively subjected to text recognition to obtain the candidate texts corresponding to each image frame. For example, the text in the candidate region of the title is recognized by using an OCR (Optical Character Recognition) technique to obtain a candidate text.

Step 209, selecting a target title of the target video according to the candidate text.

In the embodiment of the present invention, if the candidate texts obtained from the plurality of image frames are all the same, the candidate text may be directly used as the target title, or whether the candidate text can be used as the target title may be determined first. If the candidate texts obtained from the plurality of image frames are not all the same, one of the candidate texts needs to be selected as a target title.

In this embodiment of the present invention, optionally, performing text recognition on the title candidate regions in the plurality of image frames respectively, and an implementation manner of obtaining candidate texts may include: when the title candidate area is subjected to text recognition, generating the appearance duration and/or time distribution of the candidate text in the target video according to the image frame of the candidate text; correspondingly, before selecting the target title of the target video according to the candidate text, the method may further include: and eliminating the candidate texts of which the occurrence durations and/or time distributions do not accord with the preset time condition.

When the image frames are subjected to text recognition, the time stamps of the image frames are extracted, and the occurrence duration and/or time distribution of the candidate text in the target video can be generated according to the image frames in which the candidate text is recognized, for example, firstly, the candidate text A is recognized in the 1 st to x th image frames, then, the candidate text B is recognized in the (x +1) th to y th image frames, then, the candidate text A is recognized in the (y +1) th to z th image frames, then, the occurrence duration and time distribution of the candidate text A can be generated according to the time stamps of the 1 st to x th image frames and the time stamps of the (y +1) th to z th image frames, and the occurrence duration and time distribution of the candidate text B can be generated according to the time stamps of the (x +1) th to y image frames.

The occurrence duration and/or time distribution of the title in the video need to meet a preset time condition, where the preset time condition may be set according to an actual situation, and the embodiment of the present invention is not limited to this. And eliminating the candidate texts which do not meet the preset time condition, and taking the remaining candidate texts as target titles or further selecting the target titles from the remaining candidate texts. According to the occurrence duration and/or the time distribution, many candidate texts which are not the titles can be filtered, and the accuracy of title generation is improved.

For example, title selection is performed on a plurality of candidate texts, the candidate texts are tracked, occurrence duration and time distribution of the candidate texts are recorded, time sequence analysis is performed on the candidate texts, the preset time condition comprises that the occurrence duration of the candidate titles is higher than the preset duration, the candidate texts not higher than the preset duration are removed, the preset time condition also comprises that the time distribution of the candidate titles meets the requirement that the duration of interruption is not higher than the preset interruption duration, and the candidate texts higher than the preset interruption duration are removed.

In this embodiment of the present invention, optionally, before selecting the target title of the target video according to the candidate text, the method may further include: detecting at least one of attribute information of text elements in the candidate text, text length of the candidate text and character categories contained in the candidate text; and eliminating at least one candidate text which does not accord with a preset rule in the attribute information, the text length and the character category.

The text elements include words, phrases, sentences, and the like, and the attribute information of the text elements includes parts of speech, semantics, and the like, or any other applicable attribute information, which is not limited in this embodiment of the present invention. The character category includes a number category, a language category, a punctuation category, and the like, or any other suitable character category, which is not limited in the embodiment of the present invention.

Attribute information, text length, character type and the like of text elements of a title in a video need to meet preset rules, wherein the preset rules can be set according to actual conditions, and the embodiment of the invention does not limit the preset rules. And eliminating the candidate texts which do not accord with the preset rule, and taking the remaining candidate texts as target titles or further selecting the target titles from the remaining candidate texts. According to the preset rule, many candidate texts which are not the titles can be filtered, and the accuracy of title generation is improved.

For example, after the candidate text is obtained by text recognition, a part of the candidate text with a high probability of not being the title may be initially filtered according to the text length and the character type, and the remaining candidate text may be used as an alternative. The preset rules include that the number of words is within a preset number range, the character categories cannot include several preset categories, and the like. And carrying out title selection on the candidate texts, and further filtering the candidate texts after filtering according to the occurrence duration and time distribution. And performing semantic analysis on the alternative titles, detecting the parts of speech, the semantics and the like of the alternative titles, and filtering the candidate texts which do not meet the preset rules. The preset rule includes that verbs cannot be included, names of people cannot be included, and the like in the text.

In this embodiment of the present invention, optionally, an implementation manner of selecting the target title of the target video according to the candidate text may include: inputting the candidate texts into a title detection network; detecting whether the candidate text can be used as a title or not by the title detection network, and outputting a title confidence coefficient; and selecting the candidate text with the highest title confidence as a target title.

The title detection network is used for detecting whether the text can be used as a title or not, and the title detection network is obtained by adopting a title text sample and a non-title text sample for training. For example, a two-class network model, i.e., a headline detection network, is trained based on a large number of news headline corpora and non-headline corpora using supervised learning. The trained title detection network can detect whether the candidate texts can be used as titles or not, the candidate texts are input into the title detection network, title confidence degrees are output by the title detection network, the title confidence degrees can represent the probability that the candidate texts are used as the titles, the candidate texts are ranked according to the title confidence degrees, and the candidate texts with the highest title confidence degrees are selected as target titles.

To sum up, according to the embodiment of the present invention, a target video is obtained, region positions where differences between pixel values in different image frames are smaller than a preset threshold are obtained by comparing different image frames, regions including texts in the image frames are respectively detected, region positions including texts in the image frames are obtained, edge detection is performed on the image frames, region positions of regions surrounded by edges in the image frames are obtained, the region positions corresponding to the image frames are counted, frequency distribution data of the region positions are generated, gradient operation is performed on the frequency distribution data, a gradient operation result is obtained, the position of a title candidate region in an image frame is determined according to the gradient operation result, text recognition is performed on the title candidate regions in the image frames, the candidate text is obtained, the target title of the target video is selected according to the candidate text, so that the title candidate area where the title of the video is likely to appear is obtained through analysis according to a plurality of image frames of the target video, a plurality of confusion information in the video is eliminated, the accuracy in title determination is improved, and then the text recognition is carried out on the title candidate area, so that the target title of the target video is automatically generated, the manual intervention degree in title generation is reduced, and the title generation efficiency for the video is improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 5, a block diagram of a title generation apparatus according to an embodiment of the present invention is shown, which may specifically include the following modules:

a video obtaining module 301, configured to obtain a target video;

a position detection module 302, configured to detect regions with a title characteristic from a plurality of image frames of the target video, respectively, to obtain region positions;

a region determining module 303, configured to determine, according to the region positions corresponding to the plurality of image frames, positions of title candidate regions in the image frames;

a title generating module 304, configured to perform text recognition on the title candidate regions in the multiple image frames to obtain a target title of the target video.

Optionally, the position detection module comprises at least one of:

Optionally, the region determining module includes:

Optionally, the apparatus further comprises:

Optionally, the title generating module includes:

Optionally, the text recognition sub-module comprises:

the title selection submodule comprises:

Optionally, the title selecting sub-module includes:

An embodiment of the present invention further provides an electronic device, as shown in fig. 6, including a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete mutual communication through the communication bus 604,

a memory 603 for storing a computer program;

the processor 601 is configured to implement the following steps when executing the program stored in the memory 603:

acquiring a target video;

Optionally, the detecting, from a plurality of image frames of the target video, a region having a title characteristic, and obtaining a region position includes at least one of:

obtaining the area position of which the difference value of pixel values between different image frames is smaller than a preset threshold value by comparing different image frames;

Optionally, the generating the title candidate region according to the region positions corresponding to the plurality of image frames includes:

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In yet another embodiment, a computer-readable storage medium is provided, having stored thereon instructions, which, when executed on a computer, cause the computer to perform the method of any of the above embodiments.

In a further embodiment provided by the present invention, there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A title generation method, comprising:

acquiring a target video;

2. The method according to claim 1, wherein the detecting the region having the title characteristic from the plurality of image frames of the target video respectively, and obtaining the region position comprises at least one of:

3. The method according to claim 1 or 2, wherein the determining the position of the title candidate region in the image frame according to the region positions corresponding to the plurality of image frames comprises:

4. The method of claim 1, wherein before the text identifying the title candidate regions in the plurality of image frames to obtain the target title of the target video, the method further comprises:

5. The method of claim 1, wherein the performing text recognition on the title candidate regions in the image frames to obtain the target title of the target video comprises:

6. The method of claim 4, wherein the performing text recognition on the candidate regions of the titles in the image frames respectively to obtain candidate texts comprises:

7. The method of claim 4, wherein before said selecting a target title of said target video according to said candidate text, said method further comprises:

8. The method of claim 4, wherein selecting the target title of the target video according to the candidate text comprises:

9. A title generation apparatus, comprising:

the video acquisition module is used for acquiring a target video;

the area determining module is used for determining the position of a title candidate area in the image frames according to the area positions corresponding to the image frames;

10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 8 when executing a program stored in the memory.

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.