CN112818984B

CN112818984B - Title generation method, device, electronic equipment and storage medium

Info

Publication number: CN112818984B
Application number: CN202110114237.2A
Authority: CN
Inventors: 姚晓宇; 李海; 谭颖
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2023-10-24
Anticipated expiration: 2041-01-27
Also published as: CN112818984A

Abstract

The invention discloses a title generation method, a title generation device, electronic equipment and a storage medium, wherein the title generation method comprises the following steps: the method comprises the steps of obtaining a target video, respectively detecting areas with title characteristics from a plurality of image frames of the target video to obtain area positions, determining the positions of title candidate areas in the image frames according to the area positions corresponding to the plurality of image frames, carrying out text recognition on the title candidate areas in the plurality of image frames to obtain a target title of the target video, analyzing the title candidate areas possibly appearing in the title of the video according to the plurality of image frames of the target video, eliminating a plurality of confusion information in the video, improving the accuracy in title determination, and then carrying out text recognition on the title candidate areas, so that the target title of the target video is automatically generated, the manual intervention degree in the process of title generation is reduced, and the efficiency of title generation for the video is improved.

Description

Title generation method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of video technology, and in particular, to a title generation method, a title generation apparatus, an electronic device, and a computer-readable storage medium.

Background

Currently, many local channels in various places have news stories. For the current fast-paced life style, news stories are split into a plurality of sections of news videos one by one, so that the news stories become a great requirement for users to watch network videos.

In order to enable a user to quickly find out interesting contents from a plurality of news videos to learn, in the process of splitting a news report into a plurality of sections, an important task is to match each section of news video with a proper title.

Because of many news reports every day, generating a proper title for each news video is a work with great workload, numerous confusion information such as titles, characters in picture content and the like can appear in the video, and the title is found out from the numerous confusion information, so that the problem of time and effort consumption for finding out the title from the video is caused.

Disclosure of Invention

An object of an embodiment of the present invention is to provide a title generation method, a title generation device, an electronic device, and a computer readable storage medium, so as to solve the technical problem that a lot of confusion information such as subtitles, characters in picture content, etc. can appear in a video, and a title is found out from the plurality of confusion information, thereby causing time and effort to find out the title from the video.

In order to solve the above-mentioned problems, in a first aspect of the present invention, there is provided a title generation method, including:

acquiring a target video;

detecting areas with title characteristics from a plurality of image frames of the target video respectively to obtain area positions;

determining the positions of the title candidate areas in the image frames according to the area positions corresponding to the image frames;

and carrying out text recognition on the title candidate areas in the plurality of image frames to obtain a target title of the target video.

Optionally, detecting the regions with the title characteristics from the plurality of image frames of the target video, respectively, and obtaining the region positions includes at least one of the following:

obtaining the region positions of which the difference value of pixel values in different image frames is smaller than a preset threshold value by comparing the different image frames;

detecting areas containing texts in the plurality of image frames respectively to obtain the positions of the areas containing the texts in the plurality of image frames;

and respectively carrying out edge detection on the plurality of image frames to obtain the region positions of the regions surrounded by the edges in the plurality of image frames.

Optionally, the determining the positions of the title candidate regions in the image frames according to the region positions corresponding to the image frames includes:

Counting the region positions corresponding to the plurality of image frames to generate frequency distribution data of the region positions;

performing gradient operation on the frequency distribution data to obtain a gradient operation result;

and determining the position of the title candidate region in the image frame according to the gradient operation result.

Optionally, before the text recognition is performed on the title candidate regions in the plurality of image frames to obtain a target title of the target video, the method further includes:

for each image frame, respectively detecting whether the image change rate of the title candidate region is smaller than a preset threshold value;

and eliminating the title candidate region with the image change rate larger than a preset threshold value.

Optionally, the text identifying the title candidate region in the plurality of image frames, and obtaining the target title of the target video includes:

respectively carrying out text recognition on the title candidate areas in the plurality of image frames to obtain candidate texts;

and selecting a target title of the target video according to the candidate text.

Optionally, the text recognition is performed on the title candidate areas in the plurality of image frames, and obtaining candidate texts includes:

When the title candidate region is subjected to text recognition, generating the occurrence time length and/or time distribution of the candidate text in the target video according to the image frames of the recognized candidate text;

before the selecting the target title of the target video according to the candidate text, the method further includes:

and eliminating candidate texts of which the appearance duration and/or time distribution do not meet preset time conditions.

Optionally, before the selecting the target title of the target video according to the candidate text, the method further includes:

detecting at least one of attribute information of text elements in the candidate text, text length of the candidate text and character category contained in the candidate text;

and eliminating candidate texts which do not accord with preset rules from at least one of the attribute information, the text length and the character category.

Optionally, selecting the target title of the target video according to the candidate text includes:

inputting the candidate text into a title detection network; the title detection network is used for detecting whether the text can be used as a title, and is obtained by training a title text sample and a non-title text sample;

Detecting whether the candidate text can be used as a title by the title detection network, and outputting title confidence;

and selecting the candidate text with the highest title confidence as a target title.

According to a second aspect of the present invention, there is also provided a title generation apparatus, including:

the video acquisition module is used for acquiring a target video;

the position detection module is used for respectively detecting areas with title characteristics from a plurality of image frames of the target video to obtain area positions;

the region determining module is used for determining the positions of the title candidate regions in the image frames according to the region positions corresponding to the plurality of image frames;

and the title generation module is used for carrying out text recognition on the title candidate areas in the plurality of image frames to obtain a target title of the target video.

Optionally, the position detection module includes at least one of:

the comparison sub-module is used for obtaining the region position of which the difference value of the pixel values between different image frames is smaller than a preset threshold value by comparing the different image frames;

the text detection sub-module is used for respectively detecting areas containing texts in the plurality of image frames to obtain the positions of the areas containing the texts in the plurality of image frames;

And the edge detection sub-module is used for respectively carrying out edge detection on the plurality of image frames to obtain the region positions of the regions surrounded by the edges in the plurality of image frames.

Optionally, the area determining module includes:

the data generation sub-module is used for counting the region positions corresponding to the plurality of image frames and generating the frequency distribution data of the region positions;

the gradient operation sub-module is used for carrying out gradient operation on the frequency distribution data to obtain a gradient operation result;

and the region determining submodule is used for determining the position of the title candidate region in the image frame according to the gradient operation result.

Optionally, the apparatus further comprises:

the detection module is used for respectively detecting whether the image change rate of the title candidate region is smaller than a preset threshold value for each image frame before the title candidate region in the plurality of image frames is subjected to text recognition to obtain a target title of the target video;

and the region eliminating module is used for eliminating the title candidate region with the image change rate larger than a preset threshold value.

Optionally, the title generation module includes:

the text recognition sub-module is used for respectively carrying out text recognition on the title candidate areas in the plurality of image frames to obtain candidate texts;

And the title selecting sub-module is used for selecting the target title of the target video according to the candidate text.

Optionally, the text recognition submodule includes:

the time generation unit is used for generating the occurrence time length and/or time distribution of the candidate text in the target video according to the image frames of the identified candidate text when the text identification is carried out on the title candidate region;

the title selecting submodule comprises:

the first eliminating unit is used for eliminating the candidate texts of which the appearance duration and/or time distribution do not meet the preset time condition before the target title of the target video is selected according to the candidate texts.

Optionally, the title selecting submodule includes:

the attribute detection unit is used for detecting at least one of attribute information of text elements in the candidate text, text length of the candidate text and character types contained in the candidate text before the target title of the target video is selected according to the candidate text;

and the second eliminating unit is used for eliminating the candidate text which does not accord with the preset rule from at least one of the attribute information, the text length and the character category.

Optionally, the title selecting submodule includes:

an input unit for inputting the candidate text into a title detection network; the title detection network is used for detecting whether the text can be used as a title, and is obtained by training a title text sample and a non-title text sample;

an output unit for detecting whether the candidate text can be used as a title by the title detection network and outputting a title confidence;

and the selecting unit is used for selecting the candidate text with the highest title confidence as the target title.

In yet another aspect of the present invention, there is also provided an electronic device including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory perform communication with each other through the communication bus;

a memory for storing a computer program;

and a processor for implementing any of the above-described method steps when executing a program stored on the memory.

In yet another aspect of the present invention, there is also provided a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform any of the methods described above.

In yet another aspect of the invention there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the methods described above.

In summary, according to the embodiment of the present invention, by acquiring a target video, detecting regions having a title characteristic from a plurality of image frames of the target video, obtaining a region position, determining a position of a title candidate region in the image frame according to the region positions corresponding to the plurality of image frames, performing text recognition on the title candidate region in the plurality of image frames, and obtaining a target title of the target video, so that according to the plurality of image frames of the target video, title candidate regions in which titles of the video may appear are obtained by analysis, numerous confusion information in the video is eliminated, accuracy in title determination is improved, and then text recognition is performed on the title candidate region, thereby automatically generating a target title of the target video, reducing a degree of manual intervention in title generation, and improving efficiency of title generation for the video.

Drawings

FIG. 1 is a flowchart illustrating steps of an embodiment of a title generation method of the present invention;

FIG. 2 is a flowchart showing the steps of another title generation method embodiment of the present invention;

FIG. 3 illustrates a title hot zone feature map;

FIG. 4 shows a cut-off hot zone feature;

fig. 5 is a block diagram showing the construction of an embodiment of a title generating apparatus of the present invention;

fig. 6 shows a schematic diagram of an electronic device of the invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a title generation method of the present invention may specifically include the following steps:

and step 101, acquiring a target video.

In the embodiment of the invention, the target video comprises a video submitted by a user, a plurality of video fragments split by one video, and the like, for example, one news report video can comprise a plurality of news events, the video can be split into video fragments corresponding to the plurality of news events, and each video fragment serves as the target video.

And 102, respectively detecting areas with title characteristics from a plurality of image frames of the target video to obtain the area positions.

In the embodiment of the invention, the regions where the titles appear in the video are relatively concentrated, but specific regions in different videos are different, for example, in many news videos, the titles generally appear in the lower half region of the video, but the distances between the regions where the titles appear in different videos and the lower edge of the video image are different.

In the embodiment of the present invention, there is a region having a title characteristic among the respective regions of the video. A region in a video has a title property and does not necessarily represent that the region has a title, but rather means that the region has a specific property, which may be a region in which a title exists. For example, the embodiment of the present invention is not limited in this regard, because the title will not generally change during the video playing process, if the image content in a certain area is unchanged, the area has the title characteristic, or if the presence of text in a certain area is detected, the area has the title characteristic, or because a special title frame is designed in some videos, if a straight line or a neat edge is detected in a certain area, the area has the title characteristic, and specifically any applicable title characteristic may be further included.

In the embodiment of the invention, the video is composed of image frames, and if a certain area in the image frames has the title characteristic, the area position of the area in the image frames is acquired. When detecting the region with the title characteristic to obtain the region position, the region with the title characteristic can be detected through the same image frame in the target video to obtain the region position, or the region with the title characteristic can be detected through different image frames in the target video to obtain the region position. Specific implementations may include, for example, comparing different image frames to obtain a region position where a difference between pixel values of the different image frames is smaller than a preset threshold, or detecting regions containing text in the multiple image frames respectively to obtain a region position containing text in the multiple image frames, or performing edge detection on the multiple image frames respectively to obtain a region position of an edge in the multiple image frames, or any other applicable way, which is not limited in the embodiments of the present invention.

For example, in news video, a title is usually present in the lower half of a video image, and in order to reduce the workload of detecting the position of a region having the characteristic of the title, only the lower half of a plurality of image frames in a target video is detected, specifically, stable regions with unchanged contents in different image frames can be detected by a frame difference method, the position of the stable regions is detected, and the region containing characters is detected for the image frames, and the position of the region containing the characters is detected, and the image frames are edge-detected by a Canny edge detection algorithm, the position of the region of the edge is detected, and the like.

And step 103, determining the positions of the title candidate areas in the image frames according to the positions of the areas corresponding to the image frames.

In the embodiment of the invention, the positions of the areas detected respectively from different image frames of the target video are the same, and the positions of the areas are different. The more image frames a certain region position is detected, the more likely that region position is the region in which the title is located. Thus, based on the locations of the regions corresponding to the plurality of image frames, the locations of one or more regions in the image frames may be determined using the principles of statistical analysis, and the one or more regions may be identified as title candidate regions.

In the embodiment of the present invention, the implementation manner of determining the position of the title candidate region in the image frame according to the region positions corresponding to the plurality of image frames may include various implementation manners, for example, counting the region positions corresponding to the plurality of image frames, generating the frequency distribution data of the region positions, performing gradient operation on the frequency distribution data, determining the position of the title candidate region in the image frame according to the gradient operation result, or any other applicable implementation manner, which is not limited in the embodiment of the present invention.

In an embodiment of the present invention, optionally, before performing text recognition on the title candidate areas in the plurality of image frames to obtain a target title of the target video, the method further includes: for each image frame, respectively detecting whether the image change rate of the title candidate region is smaller than a preset threshold value; and eliminating the title candidate region with the image change rate larger than a preset threshold value.

Typically in video, the region where the title is located includes only text and background colors, and the rate of change of the image of the region where the title is located should be low relative to other images of the video. Before text recognition is performed on each image frame, whether the image change rate of the title candidate region is smaller than a preset threshold value is detected, for example, the variance of the image in the title candidate region is calculated, and the variance can represent the image change rate. If the image change rate is smaller than a preset threshold, the title candidate region is indicated to accord with the background characteristics of the title in general, the title candidate region can be reserved, if the image change rate is not smaller than the preset threshold, the title candidate region is indicated to not accord with the background characteristics of the title in general, the title candidate region can be removed, so that the candidate region which is likely to not contain the title is filtered, unnecessary text recognition workload is reduced, probability of recognizing the candidate text which is not the title is reduced, and title generation efficiency is improved.

And 104, performing text recognition on the title candidate areas in the plurality of image frames to obtain a target title of the target video.

In the embodiment of the invention, text recognition is carried out on the title candidate areas in each image frame to obtain texts in the title candidate areas, then the recognized texts are directly used as the titles of the target videos, or whether the recognized texts can be used as the titles of the target videos is judged first, or after the recognized texts are processed, the texts obtained after the processing are used as the titles of the target videos. And recording the title of the finally obtained target video as a target title.

In one case, the same text is identified in the title candidate regions in the plurality of image frames, the identified text may be directly used as the target title of the target video, or whether the identified text may be used as the target title of the target video may be first determined according to a preset rule, and finally the target title of the target video is obtained, or any other applicable manner may be used.

In another case, different texts are identified in the title candidate regions in the plurality of image frames, and it is necessary to select a text that can be a target title from the plurality of candidate texts. The specific implementation manner of selecting the target title may include various ways, for example, firstly removing a part of candidate texts which do not meet a preset rule, then inputting the rest candidate texts into a title detection network, wherein the title detection network is used for detecting whether the texts can be used as titles, the title detection network is trained by using a title text sample and a non-title text sample, whether the title detection network detects whether the candidate texts can be used as titles, outputting the title confidence, selecting the candidate text with the highest title confidence as the target title, or any other applicable way, and the embodiment of the invention is not limited to this.

In summary, according to the embodiment of the present invention, by acquiring a target video, detecting regions having a title characteristic from a plurality of image frames of the target video, obtaining a region position, determining a position of a title candidate region in an image frame according to the region positions corresponding to the plurality of image frames, and performing text recognition on the title candidate region in the plurality of image frames, a target title of the target video is obtained, so that according to the plurality of image frames of the target video, title candidate regions in the video, in which titles of the video may appear, are analyzed, numerous confusion information in the video is eliminated, accuracy in title determination is improved, and then text recognition is performed on the title candidate region, thereby automatically generating a target title of the target video, reducing a degree of manual intervention in title generation, and improving efficiency of title generation for the video.

Referring to fig. 2, a flowchart illustrating steps of an embodiment of a title generation method of the present invention may specifically include the following steps:

in step 201, a target video is acquired.

Step 202, comparing different image frames to obtain the region position where the difference value of the pixel values between the different image frames is smaller than the preset threshold value.

In the embodiment of the invention, the image of the region where the title is located in the video does not change in a period of time, so one method for detecting the position of the region with the title characteristic is to obtain the position of the region with the difference value of the pixel values smaller than the preset threshold value by comparing different image frames. For the target video, an image frame can be taken from the target video at intervals of a set time length, each taken image frame is respectively compared with the last taken image frame, and the region position of which the difference value of the pixel value corresponding to each image frame is smaller than a preset threshold value is obtained.

For example, a frame difference method is used to perform differential operation on two frames of images, pixel points corresponding to different image frames are subtracted, the absolute value of gray level difference is judged, and when the absolute value is smaller than a certain threshold value, the image is judged to be a stationary target with unchanged image, so that detection of an area with unchanged image is realized. In particular, any suitable manner may be used to compare different image frames, which is not limited in this embodiment of the present invention.

And 203, respectively detecting the areas containing the texts in the plurality of image frames to obtain the positions of the areas containing the texts in the plurality of image frames.

In an embodiment of the present invention, another method for detecting the location of the region with the title characteristic is to detect the region containing the text in the image frame, and obtain the location of the region containing the text in the image frame. For the target video, an image frame can be taken from the target video at intervals of a set time length, and the area containing the text in each image frame is detected respectively to obtain the position of the area containing the text in each image frame. For example, the location of the region in the image containing text may be roughly found by the MSER (Maximally Stable Extremal Regions, maximum stable extremum region) algorithm. Any suitable text detection mode may be specifically adopted, and embodiments of the present invention are not limited thereto.

And 204, respectively performing edge detection on the plurality of image frames to obtain the region positions of the regions surrounded by the edges in the plurality of image frames.

In the embodiment of the invention, the region where the title is located in the video has straight or regular edges, so another method for detecting the position of the region with the title characteristic is to perform edge detection on the image frame to obtain the edges in the image frame, and then the region position of the region surrounded by the edges can be obtained. For the target video, an image frame can be taken from the target video at intervals of a set time length, and edge detection is respectively carried out on each image frame to obtain the region position of the region surrounded by the edge in each image frame. Any suitable edge detection method may be specifically adopted, for example, a Canny edge detection algorithm, a Sobel edge detection algorithm, and the like, which is not limited in the embodiment of the present invention.

And step 205, counting the region positions corresponding to the plurality of image frames, and generating the frequency distribution data of the region positions.

In the embodiment of the invention, the detected region position may be different for different image frames. After detecting the region positions from each of the plurality of image frames, the region positions are counted, and the frequency distribution data of the region positions is generated. The count distribution data may characterize a count distribution of the locations of the regions, for example, as shown in the title hot zone feature map of fig. 3, and the count distribution data may be presented in the form of the title hot zone feature map, with higher brightness of the regions in the map indicating higher statistics, that is, greater likelihood that the regions are title regions. As another example, as shown in the cut-off point hot zone feature map of fig. 4, the number of times distribution data may be displayed in the form of a cut-off point hot zone feature map, where the higher the brightness of the vertical line, the higher the number of statistics, that is, the greater the likelihood that the region is the edge of the header frame.

And 206, performing gradient operation on the frequency distribution data to obtain a gradient operation result.

In the embodiment of the invention, the gradient at the pixel point is a vector with a magnitude and a direction, and for the frequency distribution data, the direction of the gradient is the direction in which the frequency of the pixel point changes the fastest, and the magnitude of the gradient is the rate of change of the frequency of the pixel point. And carrying out gradient operation on the frequency distribution data, wherein the obtained gradient operation result comprises gradients at various positions corresponding to the pixel points on the image frame.

And step 207, determining the position of the title candidate region in the image frame according to the gradient operation result.

In the embodiment of the present invention, the manner of determining the position of the title candidate region in the image frame according to the gradient operation result may include various manners, for example, dividing the title candidate region by using the pixel point with the largest gradient as the edge; or selecting pixels with gradients between a preset first gradient threshold and a preset second gradient threshold as edges, and dividing a title candidate region, or any other applicable manner, which is not limited in the embodiment of the present invention.

And step 208, respectively carrying out text recognition on the title candidate areas in the plurality of image frames to obtain candidate texts.

In the embodiment of the invention, text recognition is carried out on the title candidate areas in each image frame respectively to obtain candidate texts corresponding to each image frame. For example, text within the title candidate region is identified using OCR (Optical Character Recognition ) techniques to obtain candidate text.

And step 209, selecting a target title of the target video according to the candidate text.

In the embodiment of the invention, if the candidate texts obtained in the plurality of image frames are the same, the candidate texts can be directly used as the target titles, or whether the candidate texts can be used as the target titles can be judged first. If the candidate texts obtained in the plurality of image frames are not identical, one candidate text needs to be selected as a target title.

In an embodiment of the present invention, optionally, performing text recognition on the title candidate regions in the plurality of image frames, to obtain a candidate text may include: when the title candidate region is subjected to text recognition, generating the occurrence time length and/or time distribution of the candidate text in the target video according to the image frames of the recognized candidate text; correspondingly, before selecting the target title of the target video according to the candidate text, the method further comprises the following steps: and eliminating candidate texts of which the appearance duration and/or time distribution do not meet preset time conditions.

When the image frames are subjected to text recognition, the time stamps of the image frames are extracted, according to the image frames of the recognized candidate texts, the occurrence time length and/or time distribution of the candidate texts in the target video can be generated, for example, firstly, the candidate text A is recognized in the 1 st to x th image frames, then the candidate text B is recognized in the (x+1) th to y th image frames, then the candidate text A is recognized in the (y+1) th to z th image frames, and then the occurrence time length and time distribution of the candidate texts A can be generated according to the time stamps of the 1 st to x th image frames and the time stamps of the (y+1) th to z th image frames, and the occurrence time length and time distribution of the candidate texts B can be generated according to the time stamps of the (x+1) th to y th image frames.

The occurrence duration and/or time distribution of the titles in the video need to meet preset time conditions, wherein the preset time conditions can be set according to actual conditions, and the embodiment of the invention is not limited to the preset time conditions. And eliminating the candidate texts which do not meet the preset time condition, taking the rest candidate texts as target titles, or further selecting the target titles from the rest candidate texts. According to the occurrence duration and/or time distribution, a plurality of candidate texts which are not titles can be filtered, and the accuracy of title generation is improved.

For example, selecting a plurality of candidate texts, tracking the candidate texts, recording the occurrence time length and time distribution of the candidate texts, performing time sequence analysis on the candidate texts, wherein the preset time condition comprises that the occurrence time length of the candidate texts is higher than the preset time length, eliminating the candidate texts with the time length not higher than the preset time length, and the preset time condition can also comprise that the time distribution of the candidate texts is required to meet the interruption time length and cannot be higher than Yu Yushe interruption time length, and eliminating the candidate texts with the interruption time length of Yu Yushe.

In the embodiment of the present invention, optionally, before selecting the target title of the target video according to the candidate text, the method may further include: detecting at least one of attribute information of text elements in the candidate text, text length of the candidate text and character category contained in the candidate text; and eliminating candidate texts which do not accord with preset rules from at least one of the attribute information, the text length and the character category.

The text elements include words, phrases, sentences, etc., and the attribute information of the text elements includes parts of speech, semantics, etc., or any other suitable attribute information, which the embodiments of the present invention do not limit. The character categories include numeric categories, chinese categories, punctuation categories, etc., or any other suitable character category, as embodiments of the invention are not limited in this regard.

Attribute information, text length, character category and the like of text elements of a title in the video need to conform to preset rules, wherein the preset rules can be set according to actual conditions, and the embodiment of the invention is not limited to the preset rules. And eliminating the candidate texts which do not accord with the preset rule, taking the rest candidate texts as target titles, or further selecting the target titles from the rest candidate texts. According to the preset rule, a plurality of candidate texts which are not titles can be filtered, and the accuracy of title generation is improved.

For example, after the candidate text is obtained by text recognition, the candidate text with a part of probability not being the title is initially filtered according to the text length and character category, and the rest candidate text is used as a candidate. The preset rule includes that the number of words is within the preset number of words, the character category cannot include preset categories, etc. And selecting titles of the plurality of candidate texts, and further filtering the candidate texts filtered according to the occurrence time length and the time distribution. And carrying out semantic analysis on the candidate titles, detecting parts of speech, semantics and the like of the candidate titles, and filtering candidate texts which do not meet preset rules. The preset rules include that verbs cannot be included in the text, names of people cannot be included, and the like.

In the embodiment of the present invention, optionally, according to the candidate text, an implementation manner of selecting the target title of the target video may include: inputting the candidate text into a title detection network; detecting whether the candidate text can be used as a title by the title detection network, and outputting title confidence; and selecting the candidate text with the highest title confidence as a target title.

The title detection network is used for detecting whether the text can be used as a title or not, and the title detection network is trained by using a title text sample and a non-title text sample. For example, according to a large number of news headline corpora and non-headline corpora, a supervised learning mode is adopted to train a classification network model, namely a headline detection network. The trained title detection network can detect whether the candidate text can be used as a title, the candidate text is input into the title detection network, the title detection network outputs the title confidence level, the title confidence level can characterize the probability of the candidate text as the title, the candidate texts are ordered according to the title confidence level, and the candidate text with the highest title confidence level is selected as a target title.

In summary, according to the embodiment of the present invention, by acquiring a target video, comparing different image frames to obtain region positions where differences of pixel values in the different image frames are smaller than a preset threshold, detecting regions containing text in the plurality of image frames respectively to obtain region positions containing text in the plurality of image frames, performing edge detection on the plurality of image frames respectively to obtain region positions of regions surrounded by edges in the plurality of image frames, counting the region positions corresponding to the plurality of image frames, generating frequency distribution data of the region positions, performing gradient operation on the frequency distribution data to obtain a gradient operation result, determining a position of a title candidate region in an image frame according to the gradient operation result, performing text recognition on the title candidate region in the plurality of image frames respectively to obtain a candidate text, selecting a target title of the target video according to the candidate text, analyzing the plurality of image frames to obtain a title region possibly appearing in the video, eliminating confusion information in the video, increasing accuracy of the title, and then decreasing the accuracy of the title of the target video when the title is generated, and the title of the target video is generated, thereby decreasing the accuracy of the title is increased.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Referring to fig. 5, a block diagram of an embodiment of a title generating apparatus according to the present invention is shown, which may specifically include the following modules:

a video acquisition module 301, configured to acquire a target video;

a position detection module 302, configured to detect regions with title characteristics from a plurality of image frames of the target video, respectively, to obtain region positions;

a region determining module 303, configured to determine a position of a title candidate region in the image frames according to the region positions corresponding to the plurality of image frames;

and the title generation module 304 is configured to perform text recognition on the title candidate regions in the plurality of image frames to obtain a target title of the target video.

Optionally, the position detection module includes at least one of:

Optionally, the area determining module includes:

Optionally, the apparatus further comprises:

Optionally, the title generation module includes:

Optionally, the text recognition submodule includes:

the title selecting submodule comprises:

Optionally, the title selecting submodule includes:

The embodiment of the invention also provides an electronic device, as shown in fig. 6, which comprises a processor 601, a communication interface 602, a memory 603 and a communication bus 604, wherein the processor 601, the communication interface 602 and the memory 603 complete communication with each other through the communication bus 604,

a memory 603 for storing a computer program;

the processor 601 is configured to execute the program stored in the memory 603, and implement the following steps:

acquiring a target video;

Optionally, detecting a region having a title characteristic from a plurality of image frames of the target video, and obtaining a region position includes at least one of:

obtaining the region position of which the difference value of pixel values between different image frames is smaller than a preset threshold value by comparing the different image frames;

Optionally, the generating the title candidate region according to the region positions corresponding to the plurality of image frames includes:

The communication bus mentioned by the above terminal may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the terminal and other devices.

The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, a computer readable storage medium having instructions stored therein, which when run on a computer, cause the computer to perform the method of any of the above embodiments is also provided.

In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of any of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A title generation method, comprising:

acquiring a target video;

determining the positions of the title candidate regions in the image frames by utilizing the principle of statistical analysis according to the region positions corresponding to the image frames;

text recognition is carried out on the title candidate areas in the plurality of image frames, so that a target title of the target video is obtained;

wherein, according to the region positions corresponding to the plurality of image frames, determining the positions of the title candidate regions in the image frames by using the principle of statistical analysis includes:

2. The method of claim 1, wherein detecting regions having a header characteristic from a plurality of image frames of the target video, respectively, and obtaining a region position includes at least one of:

3. The method of claim 1, wherein prior to said text identifying said title candidate region in said plurality of image frames to obtain a target title of said target video, said method further comprises:

4. The method of claim 1, wherein the text identifying the title candidate region in the plurality of image frames to obtain the target title of the target video comprises:

5. The method of claim 4, wherein the text identifying the title candidate regions in the plurality of image frames, respectively, comprises:

6. The method of claim 4, wherein prior to the selecting the target title of the target video from the candidate text, the method further comprises:

7. The method of claim 4, wherein selecting the target title of the target video based on the candidate text comprises:

8. A title generation apparatus, comprising:

the video acquisition module is used for acquiring a target video;

the region determining module is used for determining the positions of the title candidate regions in the image frames by utilizing the principle of statistical analysis according to the region positions corresponding to the image frames;

The title generation module is used for carrying out text recognition on the title candidate areas in the plurality of image frames to obtain a target title of the target video;

wherein the region determination module comprises:

9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for carrying out the method steps of any one of claims 1-7 when executing a program stored on a memory.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.