CN113821677A - Method, device and equipment for generating cover image and storage medium - Google Patents

Method, device and equipment for generating cover image and storage medium Download PDF

Info

Publication number
CN113821677A
CN113821677A CN202110622132.8A CN202110622132A CN113821677A CN 113821677 A CN113821677 A CN 113821677A CN 202110622132 A CN202110622132 A CN 202110622132A CN 113821677 A CN113821677 A CN 113821677A
Authority
CN
China
Prior art keywords
target
text
description information
video
cover image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110622132.8A
Other languages
Chinese (zh)
Inventor
陈小帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Beijing Co Ltd
Original Assignee
Tencent Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Beijing Co Ltd filed Critical Tencent Technology Beijing Co Ltd
Priority to CN202110622132.8A priority Critical patent/CN113821677A/en
Publication of CN113821677A publication Critical patent/CN113821677A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data

Abstract

The embodiment of the application provides a method, a device, equipment and a storage medium for generating a cover image, which relate to the technical field of artificial intelligence, and the method comprises the following steps: and respectively extracting the characteristics of each text message to obtain the first text characteristics of each text message, so that each text message can obtain better characteristic representation. The first text features are fused to realize context joint modeling of the relation between the text information, so that the obtained comprehensive text features can more accurately represent the content of the target video, and the accuracy of predicting the target description information can be effectively improved when at least one target description information corresponding to the target video is predicted based on the comprehensive text features. By combining the original cover image of the target video and the target description information, the target cover image capable of representing the video bright spots better is obtained, so that a user can find the video bright spots quickly and intuitively, and the video watching experience of the user is improved.

Description

Method, device and equipment for generating cover image and storage medium
Technical Field
The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a storage medium for generating a cover image.
Background
With the development of video technology, people have higher and higher requirements on viewing experience. When a user browses a recommended video, if the user cannot quickly find a viewpoint of the video, the user may skip the current video.
In order to facilitate a user to quickly and intuitively find the video bright spots, a video creator manually constructs the video bright spots and synthesizes the video bright spots into a cover image before transmitting the video to a video platform. However, this method requires the video creator to have a high creation level, and needs to spend a long time to refine the highlight content in the video, which is inefficient.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for generating a cover image, which are used for improving the accuracy and efficiency of constructing video bright spots and synthesizing the video bright spots into the cover image.
In one aspect, an embodiment of the present application provides a method for generating a cover image, where the method includes:
performing feature extraction on at least one text message in a target video to obtain first text features respectively corresponding to the at least one text message;
fusing the obtained first text features to obtain comprehensive text features corresponding to the target video;
predicting at least one target description information corresponding to the target video based on the comprehensive text characteristics;
generating at least one object cover image based on the original cover image of the object video and the at least one object description information, wherein each object cover image contains at least one object description information.
In one aspect, an embodiment of the present application provides an apparatus for generating a cover image, where the apparatus includes:
the feature extraction module is used for extracting features of at least one text message in a target video to obtain first text features corresponding to the at least one text message respectively;
the fusion module is used for fusing the obtained first text features to obtain comprehensive text features corresponding to the target video;
the prediction module is used for predicting at least one target description information corresponding to the target video based on the comprehensive text characteristics;
a combining module to generate at least one object cover image based on an original cover image of the object video and the at least one object description information, wherein each object cover image includes at least one object description information.
Optionally, the prediction module is specifically configured to:
respectively predicting first probabilities that the at least one text message is respectively target description information based on the comprehensive text features;
and selecting at least one target description information from the at least one text information based on the obtained first probability.
Optionally, the prediction module is specifically configured to:
coding the comprehensive text features to respectively obtain second text features corresponding to the at least one text message;
decoding second text features corresponding to the at least one text message respectively to generate at least one candidate description message corresponding to the target video and a second probability that each candidate description message is the target description message;
and selecting at least one target description information from the at least one candidate description information based on the obtained second probability.
Optionally, the at least one target description information includes at least one first type description information and at least one second type description information;
the prediction module is specifically configured to:
coding the comprehensive text features to respectively obtain second text features corresponding to the at least one text message;
predicting a first probability that each of the at least one text message is first type description information based on a second text feature corresponding to each of the at least one text message;
selecting at least one first type of description information from the at least one text information based on the obtained first probability;
decoding second text features corresponding to the at least one text message respectively to generate at least one candidate description message corresponding to the target video and a second probability that each candidate description message is a second type of description message;
and selecting at least one second type of description information from the at least one candidate description information based on the obtained second probability.
Optionally, the binding module is specifically configured to:
detecting target display positions of key elements in the original cover image;
for each target description information in the at least one target description information, respectively performing the following operations:
aiming at one target description information, determining a target adding position of the target description information from other positions except the target display position;
and fusing the object description information to an object adding position in the original cover image to obtain an object cover image containing the object description information.
Optionally, a recommendation module is further included;
the recommendation module is specifically configured to:
fusing the target description information to a target adding position in the original cover image, and after a target cover image containing the target description information is obtained, determining the matching degree between the interest tag of a target user and the at least one target description information respectively;
determining recommendation description information from the at least one target description information based on the obtained matching degree;
and when the target video is recommended to the target user, selecting a target cover image containing the recommendation description information as a cover image of the target video.
Optionally, the recommendation module is specifically configured to:
performing feature extraction on the interest tag of the target user to obtain the interest feature of the target user;
for each target description information in the at least one target description information, respectively performing the following operations:
performing feature extraction on one target description information to obtain a description feature of the target description information;
fusing the interest characteristic and the description characteristic to obtain a first associated characteristic;
and predicting the matching degree between the interest tag of the target user and the target description information based on the first correlation characteristic.
Optionally, the prediction module is further configured to:
and determining that the original cover image of the target video does not contain the target description information before predicting at least one piece of target description information corresponding to the target video based on the comprehensive text characteristics.
Optionally, the prediction module is specifically configured to:
acquiring cover text information in the original cover image;
performing feature extraction on the cover text information to obtain cover text features corresponding to the cover text information;
fusing the comprehensive text features and the cover text features to obtain second associated features;
predicting a third probability that the cover text information is target description information based on the second correlation characteristics;
and if the third probability is smaller than a preset threshold value, determining that the original cover image does not contain the target description information.
In one aspect, embodiments of the present application provide a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method for generating a cover image.
In one aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program executable by a computer device, which when run on the computer device, causes the computer device to perform the steps of the above-described method for generating a cover image.
In the embodiment of the application, feature extraction is firstly performed on each text message respectively to obtain the first text feature of each text message, so that each text message obtains better feature representation. The first text features are fused to realize context joint modeling of the relation between the text information, so that the obtained comprehensive text features can more accurately represent the content of the target video, and the accuracy and efficiency of predicting the target description information can be effectively improved when at least one target description information corresponding to the target video is predicted based on the comprehensive text features. Furthermore, the original cover image of the target video is combined with the target description information to obtain the target cover image capable of representing the video bright spots better, so that a user can find the video bright spots quickly and intuitively, and the video watching experience of the user is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a method for generating a cover image according to an embodiment of the present disclosure;
fig. 3 is a schematic flowchart of a method for extracting bright spot content according to an embodiment of the present disclosure;
fig. 4 is a schematic flowchart of a method for extracting bright spot content according to an embodiment of the present disclosure;
fig. 5 is a schematic flowchart of a method for extracting bright spot content according to an embodiment of the present disclosure;
fig. 6 is a schematic flowchart of a method for extracting bright spot content according to an embodiment of the present disclosure;
fig. 7 is a flowchart illustrating a method for obtaining target description information according to an embodiment of the present application;
fig. 8 is a schematic flowchart of a method for extracting bright spot content according to an embodiment of the present disclosure;
fig. 9 is a schematic flowchart of a method for determining a highlight content adding position according to an embodiment of the present disclosure;
FIG. 10 is a schematic flow chart diagram illustrating a method for obtaining an image of a target cover according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a bright point determination model according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of a bright point determination model according to an embodiment of the present application;
fig. 13 is a schematic flowchart of a video recommendation method according to an embodiment of the present application;
FIG. 14 is a schematic diagram of a recommendation interface for a video application according to an embodiment of the present application;
fig. 15 is a schematic structural diagram of a bright spot matching model according to an embodiment of the present disclosure;
fig. 16 is a schematic structural diagram of a bright spot matching model according to an embodiment of the present disclosure;
FIG. 17 is a flowchart illustrating a method for generating cover images and video recommendations according to an embodiment of the present disclosure;
FIG. 18 is a schematic structural diagram of an apparatus for generating a cover image according to an embodiment of the present disclosure;
fig. 19 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more clearly apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
For convenience of understanding, terms referred to in the embodiments of the present invention are explained below.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like. For example, in the embodiment of the present application, NLP is used to obtain at least one piece of object description information corresponding to an object video, and then at least one piece of object cover image is generated based on an original cover image of the object video and the at least one piece of object description information.
BERT: (Bidirectional encorder recurrences from Transformers), i.e. the encorder of a Bidirectional Transformer, describes character-level, word-level, sentence-level, and even sentence-level relational features. ALBERT is lightweight BERT.
Seq2Seq model: the (Sequence-to-Sequence) model is composed of an encoder and a decoder, and the Sequence 2 Sequence model can be applied to scenes such as machine translation, man-machine conversation, chat robots and the like.
Enhancing the video effect: the highlight content in the video is automatically extracted, the highlight content is intelligently synthesized on the cover image, a user can find the highlight content of the video conveniently and visually, the playing amount of the video is improved, and in addition, the cover image containing the corresponding highlight content is dynamically selected to be displayed in combination with the personalized condition that the user watches the video.
The following is a description of the design concept of the embodiments of the present application.
When a user browses a recommended video, if the user cannot quickly find a viewpoint of the video, the user may skip the current video. In order to facilitate a user to quickly and intuitively find the video bright spots, a video creator manually constructs the video bright spots and synthesizes the video bright spots into a cover image before transmitting the video to a video platform. However, this method requires the video creator to have a high creation level, and needs to spend a long time to refine the highlight content in the video, which is inefficient.
Through analysis, in order to facilitate users to know the content in the video, the video generally includes text information corresponding to the video content, such as a title, a subtitle, a bullet screen, narration text, and the like. The text information often includes highlight information related to video content, and if a natural language processing technology is adopted to perform multi-dimensional deep understanding on the text information, the highlight content in the video can be automatically obtained, so that the efficiency of obtaining the highlight information of the video is improved. The video effect enhancement can be realized by automatically combining the highlight content with the cover image, so that a user can find video highlights quickly and intuitively, and the video watching experience of the user is improved.
In view of this, an embodiment of the present application provides a method for generating a cover image, including: the method comprises the steps of extracting features of at least one text message in a target video to obtain first text features corresponding to the at least one text message respectively, and then fusing the obtained first text features to obtain comprehensive text features corresponding to the target video. And predicting at least one object description information corresponding to the object video based on the comprehensive text characteristics, and then generating at least one object cover image based on the original cover image and the at least one object description information of the object video, wherein each object cover image comprises at least one object description information.
In the embodiment of the application, feature extraction is firstly performed on each text message respectively to obtain the first text feature of each text message, so that each text message obtains better feature representation. The first text features are fused to realize context joint modeling of the relation between the text information, so that the obtained comprehensive text features can more accurately represent the content of the target video, and the accuracy and efficiency of predicting the target description information can be effectively improved when at least one target description information corresponding to the target video is predicted based on the comprehensive text features. Furthermore, the original cover image of the target video is combined with the target description information to obtain the target cover image capable of representing the video bright spots better, so that a user can find the video bright spots quickly and intuitively, and the video watching experience of the user is improved.
Referring to fig. 1, a system architecture diagram applicable to the embodiment of the present application is shown, where the system architecture includes at least a terminal device 101 and a server 102.
The video applications are pre-installed in the terminal device 101, where the video applications include a video playing application, a short video application, a live broadcast application, and the like, and the types of the video applications include a client application, a web page application, an applet application, and the like. The terminal device 101 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a smart television, a vehicle-mounted device, and the like.
The server 102 is a background server for video application, and the server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data, and an artificial intelligence platform. The terminal device 101 and the server 102 may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
The method for generating the cover image in the embodiment of the present application may be executed by the terminal device 101, or may be executed by the server 102.
In the first case, the method for generating the cover image in the embodiment of the present application may be executed by the terminal device 101.
The terminal device 101 performs feature extraction on at least one text message in the target video to obtain first text features respectively corresponding to the at least one text message, and then fuses the obtained first text features to obtain comprehensive text features corresponding to the target video. And predicting at least one object description information corresponding to the object video based on the comprehensive text characteristics, and then generating at least one object cover image based on the original cover image and the at least one object description information of the object video, wherein each object cover image comprises at least one object description information. When the terminal device 101 recommends a target video to a target user, determining recommendation description information matched with an interest tag of the target user from at least one piece of target description information, then selecting a target cover image containing the recommendation description information as a cover image of the target video, and displaying the cover image of the target video in a recommendation interface of a video application.
In a second case, the method for generating the cover image in the embodiment of the present application may be executed by the server 102.
The server 102 performs feature extraction on at least one text message in the target video to obtain first text features respectively corresponding to the at least one text message, and then fuses the obtained first text features to obtain a comprehensive text feature corresponding to the target video. And predicting at least one object description information corresponding to the object video based on the comprehensive text characteristics, and then generating at least one object cover image based on the original cover image and the at least one object description information of the object video, wherein each object cover image comprises at least one object description information. When the server 102 recommends the target video to the target user, the recommendation description information matched with the interest tag of the target user is determined from the at least one piece of target description information, and then a target cover image containing the recommendation description information is selected as a cover image of the target video. The server 102 sends the cover image of the target video to the terminal device 101, and the terminal device 101 displays the cover image of the target video in a recommendation interface of the video application.
It should be noted that, in the embodiment of the present application, the method for generating a cover image may also be performed by the terminal device 101 and the server 102 interactively, which is not described herein again.
Based on the system architecture diagram shown in fig. 1, the present application provides a flow of a method for generating a cover image, as shown in fig. 2, the flow of the method is executed by a computer device, which may be the terminal device 101 or the server 102 shown in fig. 1, and includes the following steps:
step S201, performing feature extraction on at least one text message in the target video to obtain first text features corresponding to the at least one text message respectively.
Specifically, the text information in the target video may be a title, a subtitle, a bullet screen, narration text, or the like. A text message may refer to a sentence, a word, a paragraph, etc. The method comprises the steps of obtaining text content in a target video in a voice recognition or image recognition mode, dividing the text content in the target video into a plurality of text messages in a word segmentation or sentence segmentation mode, and then performing feature extraction on each text message by using an Encoder to obtain a first text feature corresponding to each text message, wherein the Encoder can be an ALBERT, a BERT, a Transformer-Encoder and the like. It should be noted that neural network structures such as encoders and the like in the embodiments of the present application can be stored in a block chain, and are not described in detail later.
And S202, fusing the obtained first text features to obtain comprehensive text features corresponding to the target video.
Specifically, context relationship exists among text messages in a video, and when one text message is observed alone, whether the text message is the highlight content in the video cannot be seen in some cases, but in combination with the context of the text message, whether the text message is the highlight content in the video can be determined relatively accurately. Therefore, the obtained first text features are fused, context joint modeling is carried out on the relation among all text information, and the obtained comprehensive text features corresponding to the target video can more accurately represent the highlight content of the target video.
And S203, predicting at least one target description information corresponding to the target video based on the comprehensive text characteristics.
Specifically, the target description information is used to characterize the main content or highlight content of the target video, the target description information may be a sentence, or may be a word or multiple words, and the target description information may be extracted from at least one text information or may be generated based on at least one text information.
Step S204, generating at least one object cover image based on the original cover image and the at least one object description information of the object video.
Specifically, each object cover image contains at least one object description information. When the original cover image is combined with the object description information, the object description information may be set to different styles, and may be specifically set from the aspects of fonts, colors, font sizes, display angles, display positions, display modes, and the like.
In the embodiment of the application, feature extraction is firstly performed on each text message respectively to obtain the first text feature of each text message, so that each text message obtains better feature representation. The first text features are fused to realize context joint modeling of the relation between the text information, so that the obtained comprehensive text features can more accurately represent the content of the target video, and the accuracy and efficiency of predicting the target description information can be effectively improved when at least one target description information corresponding to the target video is predicted based on the comprehensive text features. Furthermore, the original cover image of the target video is combined with the target description information to obtain the target cover image capable of representing the video bright spots better, so that a user can find the video bright spots quickly and intuitively, and the video watching experience of the user is improved.
Optionally, in step S203, predicting at least one target description information corresponding to the target video based on the comprehensive text feature, and the embodiments of the present application at least provide the following embodiments:
the first embodiment is an extraction method.
And respectively predicting first probabilities that the at least one piece of text information is the target description information based on the comprehensive text characteristics, and then selecting the at least one piece of target description information from the at least one piece of text information based on the obtained first probabilities.
Specifically, the synthesized text features are input into an Encoder to obtain first probabilities that at least one text message is respectively target description information, wherein the Encoder may be an ALBERT, a BERT, a Transformer-Encoder, or the like. And sequencing the obtained first probabilities in a descending order, and taking the text information corresponding to the first probabilities larger than the screening threshold as target description information, or taking the text information corresponding to the first top N-bit first probabilities as target description information, wherein N is a preset number.
Illustratively, as shown in fig. 3, the text content in the target video includes a video title, a video text-sentence 1, a video text-sentence 2, …, and a video text-sentence S, where S is a positive integer. Performing feature extraction on the video title by using ALBERT to obtain title features corresponding to the video title; carrying out feature extraction on the video text-sentence 1 by adopting ALBERT to obtain sentence 1 features corresponding to the video text-sentence 1; and (3) carrying out feature extraction on the video text-sentence 2 by adopting ALBERT to obtain sentence 2 features corresponding to the video text-sentence 2, and repeating the steps until sentence S features corresponding to the video text-sentence S are obtained. And (4) fusing the title characteristic, the sentence 1 characteristic, the sentence 2 characteristic, the sentence 3 characteristic, … and the sentence S characteristic to obtain the comprehensive text characteristic of the target video. And inputting the comprehensive text characteristics into the ALBERT, predicting the probability of obtaining the video title as the bright spot content, the probability of obtaining the video title as the bright spot content by sentence 1, the probability of obtaining the video title as the bright spot content by sentence 2, the probability of obtaining the video title as the bright spot content by sentence 3, … and the probability of obtaining the video title as the bright spot content by sentence S. And then, the video titles or sentences with the probability larger than the screening threshold are reserved as the bright spot content of the target video.
Alternatively, sometimes the video titles of some videos are too long, and the user cannot quickly find the view point of the video through the video titles, so that the video titles are not suitable for being used as highlight content of the video, but the video titles are helpful for understanding the video content. In order to improve the accuracy of representing the video content by the comprehensive text features and avoid repeatedly judging whether the video title is the highlight content of the video, in the embodiment of the application, the first text feature corresponding to the video title and the first text feature corresponding to each piece of text information except the video title are fused to obtain the comprehensive text feature corresponding to the target video. And then respectively predicting the first probability that each piece of text information except the video title is the target description information based on the comprehensive text characteristics.
Illustratively, as shown in fig. 4, the text content in the target video includes a video title, video text-sentence 1, video text-sentence 2, …, video text-sentence S. Performing feature extraction on the video title by using ALBERT to obtain title features corresponding to the video title; carrying out feature extraction on the video text-sentence 1 by adopting ALBERT to obtain sentence 1 features corresponding to the video text-sentence 1; and (3) carrying out feature extraction on the video text-sentence 2 by adopting ALBERT to obtain sentence 2 features corresponding to the video text-sentence 2, and repeating the steps until sentence S features corresponding to the video text-sentence S are obtained. And (4) fusing the title characteristic, the sentence 1 characteristic, the sentence 2 characteristic, the sentence 3 characteristic, … and the sentence S characteristic to obtain the comprehensive text characteristic of the target video. And inputting the comprehensive text characteristics into ALBERT, and predicting the probability of obtaining the bright spot content of sentence 1, the probability of obtaining the bright spot content of sentence 2, the probability of obtaining the bright spot content of sentence 3, and the probability of obtaining the bright spot content of … and the probability of obtaining the bright spot content of sentence S. And then, retaining the sentences with the probability larger than the screening threshold as the bright spot content of the target video.
In the implementation of the application, feature extraction is firstly carried out on each text message respectively to obtain the first text feature of each text message, so that each text message can obtain better feature representation. The first text features are fused, so that context joint modeling is carried out on the relation between the text information, the content of the target video can be more accurately represented by the obtained comprehensive text feature corresponding to the target video, and the accuracy of the obtained target description information is effectively improved when at least one piece of target description information is selected from at least one piece of text information based on the comprehensive text feature.
Embodiment two, the generation method.
And coding the comprehensive text characteristics to respectively obtain second text characteristics corresponding to at least one text message. And respectively decoding second text characteristics corresponding to the at least one text message to generate at least one candidate description message corresponding to the target video, and a second probability that each candidate description message is the target description message. And then selecting at least one target description information from the at least one candidate description information based on the obtained second probability.
Specifically, the comprehensive text features are encoded by using an Encoder to obtain second text features corresponding to at least one piece of text information, wherein the Encoder may be an ALBERT, a BERT, a Transformer-Encoder, or the like. And decoding the second text characteristics corresponding to the at least one text message by adopting a Decoder respectively to generate at least one candidate description message corresponding to the target video, wherein the Decoder can be a transform-Decoder and the like. The candidate description information may be extracted from at least one text message of the target video, or may be generated based on at least one text message of the target video, and the candidate description information may be a sentence, or may be a word or a plurality of words.
And sequencing the obtained second probabilities in a descending order, and taking the candidate description information corresponding to the second probability larger than the screening threshold as target description information, or taking the candidate description information corresponding to the second probability ranked at the top M bits as target description information, wherein M is a preset number.
Illustratively, as shown in fig. 5, the text content in the target video includes a video title, video text-sentence 1, video text-sentence 2, …, video text-sentence S. Performing feature extraction on a video title by using ALBERT to obtain first title features corresponding to the video title; performing feature extraction on the video text-sentence 1 by adopting ALBERT to obtain a first sentence 1 feature corresponding to the video text-sentence 1; and (3) performing feature extraction on the video text-sentence 2 by adopting ALBERT to obtain the first sentence 2 feature corresponding to the video text-sentence 2, and repeating the steps until the first sentence S feature corresponding to the video text-sentence S is obtained.
And after the first title characteristic, the first sentence 1 characteristic, the first sentence 2 characteristic, the first sentence 3 characteristic, … and the first sentence S characteristic are fused, the comprehensive text characteristic of the target video is obtained. And inputting the comprehensive text features into an Encoder (Transformer-Encoder) to obtain second title features, second sentence 1 features, second sentence 2 features, second sentence 3 features, … and second sentence S features. And inputting the second title characteristics, the second sentence 1 characteristics, the second sentence 2 characteristics, the second sentence 3 characteristics, … and the second sentence S characteristics into a Decoder (transform-Decoder), and predicting and obtaining the highlight words 1, 2, 3, … and W by adopting an Attention mechanism (Encoder-Decoder Attention), and the probability that each highlight word is the highlight content of the target video. And then keeping the bright spot words with the probability larger than the screening threshold as the bright spot content of the target video.
Optionally, as described in the first embodiment, in order to improve accuracy of characterizing the video content by the comprehensive text feature and avoid repeatedly determining whether the video title is a highlight content of the video, in the embodiment of the present application, the first text feature corresponding to the video title and the first text feature corresponding to each piece of text information other than the video title are fused to obtain the comprehensive text feature corresponding to the target video. And coding the comprehensive text characteristics to respectively obtain second text characteristics corresponding to other text information except the video title. And respectively decoding each obtained second text characteristic to generate at least one candidate description information corresponding to the target video and a second probability that each candidate description information is the target description information. And then selecting at least one target description information from the at least one candidate description information based on the obtained second probability.
Illustratively, as shown in fig. 6, the text content in the target video includes a video title, video text-sentence 1, video text-sentence 2, …, video text-sentence S. Performing feature extraction on a video title by using ALBERT to obtain first title features corresponding to the video title; performing feature extraction on the video text-sentence 1 by adopting ALBERT to obtain a first sentence 1 feature corresponding to the video text-sentence 1; and (3) performing feature extraction on the video text-sentence 2 by adopting ALBERT to obtain the first sentence 2 feature corresponding to the video text-sentence 2, and repeating the steps until the first sentence S feature corresponding to the video text-sentence S is obtained.
And after the first title characteristic, the first sentence 1 characteristic, the first sentence 2 characteristic, the first sentence 3 characteristic, … and the first sentence S characteristic are fused, the comprehensive text characteristic of the target video is obtained. And inputting the comprehensive text characteristics into an Encoder (Transformer-Encoder), and predicting to obtain second sentence 1 characteristics, second sentence 2 characteristics, second sentence 3 characteristics, … and second sentence S characteristics. And inputting the second sentence 1 characteristic, the second sentence 2 characteristic, the second sentence 3 characteristic, … and the second sentence S characteristic into a Decoder (transform-Decoder), and predicting by adopting an Attention-driven mechanism (Encoder-Decoder Attention) to obtain the bright spot words 1, 2, 3, … and W and the probability that each bright spot word is the bright spot content of the target video. And then keeping the bright spot words with the probability larger than the screening threshold as the bright spot content of the target video.
In the implementation of the application, feature extraction is firstly carried out on each text message respectively to obtain the first text feature of each text message, so that each text message can obtain better feature representation. And the first text features are fused, so that the context joint modeling of the relation between the text information is realized, and the comprehensive text features corresponding to the obtained target video can more accurately represent the content of the target video. Furthermore, based on the comprehensive text features, second text features of each text message are predicted and obtained, so that the second text features comprise both content features of the text messages and upper and lower text messages of the text messages, and therefore when target description information is generated based on the obtained second text features, the accuracy of obtaining the target description information is improved, meanwhile, the target description information is not limited to the text messages in the target video, and the diversity of the target description information is improved.
In the third embodiment, the extraction method is combined with the generation method.
In the embodiment of the application, the at least one piece of target description information includes at least one piece of first-type description information and at least one piece of second-type description information, where the first-type description information is the target description information obtained in an extraction manner, and the second-type description information is the target description information obtained in a generation manner.
The process of predicting at least one target description information corresponding to the target video based on the comprehensive text features is shown in fig. 7, and includes the following steps:
and step S701, coding the comprehensive text characteristics, and respectively obtaining second text characteristics corresponding to at least one text message.
And coding the comprehensive text characteristics by adopting a coder to respectively obtain second text characteristics corresponding to at least one text message, or coding the comprehensive text characteristics by adopting the coder to respectively obtain second text characteristics corresponding to other text messages except the video title, wherein the coder can be ALBERT, BERT, Transformer-Encoder and the like.
Step S702, predicting a first probability that each of the at least one text message is the first type of description information based on a second text feature corresponding to each of the at least one text message.
Step S703, based on the obtained first probability, selecting at least one first type of description information from at least one text message.
And sequencing the obtained first probabilities in a descending order, and using the text information corresponding to the first probabilities larger than the screening threshold as first-class description information, or using the text information corresponding to the first top-N-bit first probabilities as first-class description information, wherein N is a preset number.
Step S704, respectively decode the second text features corresponding to the at least one text message, to generate at least one candidate description message corresponding to the target video, and a second probability that each candidate description message is the second type description message.
And decoding the second text characteristics corresponding to the at least one text message by adopting a Decoder respectively to generate at least one candidate description message corresponding to the target video, wherein the Decoder can be a transform-Decoder and the like.
Step S705, based on the obtained second probability, selects at least one second type of description information from the at least one candidate description information.
And sequencing the obtained second probabilities in a descending order, and taking the candidate description information corresponding to the second probability larger than the screening threshold as second-class description information, or taking the candidate description information corresponding to the second probability in the top M bits as second-class description information, wherein M is a preset number.
Illustratively, as shown in fig. 8, the text content in the target video includes a video title, a video text-sentence 1, a video text-sentence 2, …, and a video text-sentence S, S being a positive integer. Performing feature extraction on a video title by using ALBERT to obtain first title features corresponding to the video title; performing feature extraction on the video text-sentence 1 by adopting ALBERT to obtain a first sentence 1 feature corresponding to the video text-sentence 1; and (3) performing feature extraction on the video text-sentence 2 by adopting ALBERT to obtain the first sentence 2 feature corresponding to the video text-sentence 2, and repeating the steps until the first sentence S feature corresponding to the video text-sentence S is obtained.
And after the first title characteristic, the first sentence 1 characteristic, the first sentence 2 characteristic, the first sentence 3 characteristic, … and the first sentence S characteristic are fused, the comprehensive text characteristic of the target video is obtained. And inputting the comprehensive text characteristics into an Encoder (Transformer-Encoder), and predicting to obtain second sentence 1 characteristics, second sentence 2 characteristics, second sentence 3 characteristics, … and second sentence S characteristics. Predicting a first probability that sentence 1 is bright spot content based on the second sentence 1 feature; predicting a first probability that sentence 2 is highlight content based on the second sentence 2 feature; based on the second sentence 3 feature, a first probability that sentence 3 is highlight content is predicted, …, and based on the second sentence S feature, a first probability that sentence S is highlight content is predicted. And then, the sentences with the first probability larger than the screening threshold are reserved as the first type of bright spot content of the target video.
Inputting the second sentence 1 characteristic, the second sentence 2 characteristic, the second sentence 3 characteristic, … and the second sentence S characteristic into a Decoder (transform-Decoder), and predicting by using an Attention-driven mechanism (Encoder-Decoder Attention) to obtain the highlight words 1, 2, 3, … and W, and a second probability that each highlight word is the highlight content of the target video, wherein W is a positive integer. And then reserving the highlight words with the second probability larger than the screening threshold as the second type highlight contents of the target video.
In the embodiment of the application, at least one target description information is selected from at least one text information based on the comprehensive text characteristics, meanwhile, second text characteristics of each text information in the target video are predicted based on the comprehensive text characteristics, and then the target description information is generated based on the obtained second text characteristics, so that the accuracy of obtaining the target description information is improved, meanwhile, the target description information is not limited to the text information in the target video, and the generated description information is included, so that the diversity of the target description information is improved, and the target description information is more comprehensive.
Optionally, in step S204, since the key elements such as the face in the original cover image are also the content that the user pays attention to, when adding the target description information to the original cover image, the target description information should be prevented from being added to the target display position of the key elements as much as possible, so as to avoid the occlusion of the key elements. In view of this, in the embodiment of the present application, the target display position of the key element in the original cover image is detected. For each target description information in the at least one target description information, the following operations are respectively executed:
and aiming at one object description information, determining an object adding position of the object description information from other positions except the object display position, and then fusing the object description information to the object adding position in the original cover image to obtain the object cover image containing the object description information.
Specifically, the key elements can be faces, objects and the like in the original cover image, and target display positions of the key elements in the original cover image can be detected by adopting target detection networks such as R-CNN (Region-CNN), Fast R-CNN, Faster R-CNN and the like.
Illustratively, as shown in fig. 9, an original cover image is detected by using an R-CNN model, a target display position 901 of a human face in the original cover image is determined, and a target adding position 902 of bright point content is determined from other positions than the target display position 901.
Alternatively, the object description information may be set in a different style when the object description information and the original cover image are merged. The method specifically can be set from the aspects of fonts, colors, word sizes, display angles, display modes and the like, wherein the fonts can adopt standard fonts-regular fonts of a platform and can also randomly select the fonts; the color can be selected from a color range with larger contrast with the background color difference of the target adding position, and the color can be randomly selected from the range with larger contrast. The word size can adopt the standard word size of the platform, and can also select the word size randomly; the display angle is generally transverse or longitudinal, and can also be a random angle; the display mode can be dynamic display, adding animation effect and the like.
Illustratively, as shown in fig. 10, the target video is set as a soccer game video, and the highlight content of the target video is "team a wins team B. The original cover image of the target video is detected by adopting an R-CNN model, a target display position 1001 of a human face in the original cover image is determined, and a target adding position 1002 of bright spot content is determined from other positions except the target display position 1001. The pattern of highlight content is set as: regular script, black, horizontal static display of 4. Then the highlight content "team defeating team B" is fused to the target adding position 1002 to obtain the target cover image. It should be noted that, in the embodiment of the present application, the target cover image is not limited to include only one highlight content, and may also include a plurality of highlight contents, which is not described herein again.
In the embodiment of the application, the target display positions of the key elements in the original cover image are detected, and then the target adding positions for displaying the target description information are selected from other positions except the target display positions, so that the target description information is prevented from blocking the key elements in the original cover image. Secondly, when the target description information and the original cover image are fused, different styles are set for the target description information, the display form of the target description information is enriched, the boring single display condition is avoided, and the watching experience of a user is improved.
Optionally, some original cover images of the videos may already have highlight content, that is, the highlight content of the videos has been added to the original cover images manually or in other ways, and if the videos are directly pushed to the user, the user can also quickly and intuitively find the highlight content of the videos. The method avoids resource waste caused by repeated extraction of bright spot content in the video and target cover image of the composite video. In the embodiment of the application, before predicting at least one piece of target description information corresponding to a target video based on the comprehensive text features, it is determined that an original cover image of the target video does not contain the target description information.
Specifically, whether an original cover image of a target video contains target description information or not is determined by adopting a bright point judgment model, wherein when the bright point judgment model is trained, a training sample comprises video data which is marked in advance and has bright points or not. When the highlight judgment model is adopted to determine that the original cover image of the target video does not contain the target description information, at least one target description information corresponding to the target video is predicted based on the comprehensive text characteristics, and then at least one target cover image is generated based on the original cover image and the at least one target description information of the target video. When the highlight judgment model is adopted to determine that the original cover image of the target video contains the target description information, the original cover image is directly used as the target cover image, so that resource waste caused by repeated prediction of the target description information of the target video and synthesis of the target cover image is avoided, and meanwhile, the efficiency is improved.
Optionally, when determining whether the original cover image of the target video contains the target description information, embodiments of the present application provide at least the following two implementation manners:
in the first implementation mode, cover text information in an original cover image is obtained, and then feature extraction is performed on the cover text information to obtain cover text features corresponding to the cover text information. And fusing the comprehensive text characteristic and the cover text characteristic to obtain a second associated characteristic. And predicting a third probability that the cover text information is the target description information based on the second correlation characteristics. And if the third probability is smaller than a preset threshold value, determining that the original cover image does not contain the target description information, and if the third probability is not smaller than the preset threshold value, determining that the original cover image contains the target description information.
Specifically, cover text information in an original cover image is obtained through Optical Character Recognition (OCR), feature extraction is performed on the cover text information by using an Encoder in a highlight judgment model, and cover text features corresponding to the cover text information are obtained, wherein the Encoder may be ALBERT, BERT, transform-Encoder, and the like.
Illustratively, the structure of the highlight judgment model is as shown in fig. 11, the target description information is set as highlight content, video text of the target video is acquired through voice/image recognition, and cover text information in the original cover image is acquired by using OCR. The cover text information is input into an Encoder (transform-Encoder) to obtain the cover text characteristics. Inputting a video text into an Encoder (transform-Encoder) to obtain a comprehensive text characteristic, then fusing the comprehensive text characteristic and a cover text characteristic to obtain a second correlation characteristic, and predicting a third probability that cover text information is highlight content based on the second correlation characteristic. And if the third probability is smaller than the preset threshold, determining that the original cover image does not contain the bright point content, and if the third probability is not smaller than the preset threshold, determining that the original cover image contains the bright point content.
In the embodiment of the application, the comprehensive text features and the cover text features are fused, so that the obtained second associated features comprise both the content features of the cover text information and the context features of the cover text information in the video text, and therefore, when the probability that the cover text information is the target description information is predicted based on the second associated features, the prediction accuracy is effectively improved, and the accuracy of judging whether the original cover image contains the target description information is improved.
And in the second implementation mode, cover text information in the original cover image is obtained, and then feature extraction is carried out on the cover text information to obtain cover text features corresponding to the cover text information. Based on the cover text features, a third probability that the cover text information is the object description information is predicted. And if the third probability is smaller than a preset threshold value, determining that the original cover image does not contain the target description information, and if the third probability is not smaller than the preset threshold value, determining that the original cover image contains the target description information.
Illustratively, the structure of the highlight judgment model is as shown in fig. 12, the target description information is set as highlight content, and OCR is adopted to obtain cover text information in an original cover image. The cover text information is input into an Encoder (transform-Encoder) to obtain the cover text characteristics. A third probability that the cover text information is highlight content is then predicted based on the cover text features. And if the third probability is smaller than the preset threshold, determining that the original cover image does not contain the bright point content, and if the third probability is not smaller than the preset threshold, determining that the original cover image contains the bright point content.
Optionally, since the highlight content of interest of different users is different, when the target video is recommended to different users, different target cover page images can be displayed according to the interests of different users.
In this way, in the embodiment of the present application, after one piece of object description information is fused to the object adding position in the original cover image, and an object cover image containing one piece of object description information is obtained, the matching degree between the interest tag of the object user and at least one piece of object description information is determined. Then, based on the obtained matching degree, recommendation description information is determined from the at least one target description information. And when the target video is recommended to the target user, selecting a target cover image containing the recommendation description information as a cover image of the target video.
Specifically, the interest tag of the target user is obtained according to the historical video playing record of the target user. For example, each video in the video application includes one or more video tags, and when the number of times that a user clicks one video is greater than a preset threshold or the duration of watching one video meets a preset condition, the video is determined as a video that is effectively played, and the video tag of the video is used as an interest tag of the user. And obtaining an interest tag set of the user by counting historical video playing records of the user.
The matching degree between the interest tag of the target user and the target description information refers to the interest degree of the target user in the target description information. The higher the matching degree, the higher the degree of interest of the target user in the target description information is, and the higher the probability that the target user clicks the target cover image containing the target description information is. The lower the matching degree, the lower the degree of interest of the target user in the target description information is, and the smaller the probability that the target user clicks the target cover image containing the target description information is.
At least one object cover image corresponding to the object video is generated in advance, and object description information in each object cover image is different. Before recommending a target video to a target user, selecting recommendation description information with the highest matching degree with an interest tag of the target user from at least one piece of target description information, and recommending a target cover image containing the recommendation description information to the target user.
Optionally, the target description information with the matching degree greater than the threshold value of the matching degree may be screened from at least one target description information, and then a recommendation description information with the highest matching degree with the interest tag of the target user may be selected from the screened target description information. And if the matching degrees corresponding to at least one piece of target description information are all smaller than the threshold value of the matching degrees, recommending the original cover image to the target user.
Illustratively, as shown in fig. 13, the process of recommending a target video to a target user includes the following steps:
step S1301, the terminal device installs the video application in advance.
Step S1302, the terminal device sends a video request carrying the identity information of the target user to the server in response to the start operation of the target user for the video application.
Step S1303, the server obtains an interest tag set of the target user based on the identity information of the target user.
Specifically, the interest tag set of the target user is set as { football, match, world cup }, and the server may count the interest tag set of the target user in advance based on the historical video play record of the target user.
In step S1304, the server determines videos recommended to the target user as video M and video N based on the interest tag set of the target user.
Specifically, the video M is a world cup game video, and the video M corresponds to a target cover image M1 and a target cover image M2, wherein the highlight content contained in the target cover image M1 is a "football game", and the highlight content contained in the target cover image M2 is a "football world cup-M team defeating N team".
The video N is football match collection, the video N corresponds to a target cover image N1 and a target cover image N2, wherein the highlight content contained in the target cover image N1 is football match collection, and the highlight content contained in the target cover image N2 is football world cup match collection.
In step S1305, the server determines matching degrees between the interest tag sets of the target users and the bright point contents in the video M and the video N, respectively.
In step S1306, the server determines "football world cup-M team defeating N team" and "football world cup match highlights" as recommended highlight content based on the obtained matching degree.
In step S1307, the server transmits the object cover image m2 and the object cover image n2 to the terminal device.
In step S1308, the terminal device displays the object cover image M2 of the video M and the object cover image N2 of the video N in the recommendation interface of the video application.
Specifically, as shown in fig. 14, the recommendation interface of the video application includes a search box, a target cover image M2 of the video M, and a target cover image N2 of the video N, where the target cover image M2 includes highlight content "football world cup-M team defeat N team", and the target cover image N2 includes highlight content "football world cup match highlight".
In the embodiment of the application, the highlight content reflecting the main content of the video is constructed by deeply understanding the current video content, and then the highlight content is added to the proper position of the original cover image in a proper showing form, so that a user can intuitively and quickly find the video highlight. When the video is recommended and displayed to the user, the target cover page image containing the highlight content conforming to the interest of the user is dynamically selected, so that the interest degree of the user in the video is improved, and the overall playing condition of the video platform is further improved.
Optionally, when determining matching degrees between interest tags of target users and at least one piece of target description information, embodiments of the present application provide at least the following embodiments:
the first implementation mode is that feature extraction is carried out on interest tags of target users to obtain interest features of the target users. For each target description information in the at least one target description information, the following operations are respectively executed:
and aiming at one target description information, performing feature extraction on the target description information to obtain a description feature of the target description information, and then fusing the interest feature and the description feature to obtain a first associated feature. And predicting the matching degree between the interest tag of the target user and target description information based on the first correlation characteristics.
Specifically, a bright spot matching model is adopted to predict the matching degree between an interest label of a target user and target description information, wherein a training sample for training the bright spot matching model comprises bright spot exposure and click data, a sample clicked by the user after the bright spot exposure is a positive sample, and a sample not clicked is a negative sample. The highlight matching model includes an Encoder, which may be ALBERT, BERT, Transformer-Encoder, etc.
Illustratively, the structure of the highlight matching model is as shown in fig. 15, and the interest tag set of the target user is input into the ALBERT to obtain the interest feature of the target user. And inputting the bright spot content W into ALBERT to obtain the description characteristics of the bright spot content W. And fusing the interest feature of the target user and the description feature of the bright spot content W to obtain a first associated feature. Based on the first correlation feature, a matching degree between the interest tag of the target user and the highlight content W, that is, a probability that the target user clicks a target cover image containing the highlight content W, is predicted.
And secondly, performing feature extraction on the interest tag of the target user to obtain the interest feature of the target user. For each target description information in the at least one target description information, the following operations are respectively executed:
the method comprises the steps of carrying out feature extraction on one target description information aiming at one target description information to obtain the description feature of the target description information, and predicting the matching degree between an interest label of a target user and the target description information based on the similarity between the interest feature and the description feature.
Illustratively, the structure of the highlight matching model is as shown in fig. 16, and the interest tag set of the target user is input into the ALBERT to obtain the interest feature of the target user. And inputting the bright spot content W into ALBERT to obtain the description characteristics of the bright spot content W. Based on the similarity between the interest features of the target user and the description features of the highlight content W, the matching degree between the interest tag of the target user and the highlight content W, that is, the probability that the target user clicks the target cover image containing the highlight content W, is predicted.
In the embodiment of the application, before the target video is recommended and displayed to the user, the target bright spot content conforming to the interest of the user is selected from all the bright spot contents of the target video based on the interest tag set of the user, and then the target cover image containing the target bright spot content is selected and recommended to the user, so that the user can find the interested bright spot in the video more intuitively, and the experience of watching the video by the user is improved.
To better explain the embodiment of the present application, a method for generating a cover image and a video recommendation provided by the embodiment of the present application is described below by taking a video recommendation scene as an example, where the method is executed by a server and includes the following steps, as shown in fig. 17:
in step S1701, it is recognized whether the original cover image of the current video includes highlight content.
And acquiring a video text of the current video through voice/image recognition, and acquiring cover text information in an original cover image of the current video by adopting OCR (optical character recognition). Identifying whether an original cover image of a current video comprises highlight content or not by adopting a highlight judgment module, and specifically, inputting cover text information into an Encoder (transform-Encoder) to obtain cover text representation; inputting the video text into a decoder (transform-Encoder) to obtain a video text representation; then, the video text representation and the cover text representation are fused to obtain the interactive representation of the video text and the cover text; and predicting the probability that the original cover image has the highlight content based on the interactive representation of the video text and the cover text. And if the obtained probability is not less than the preset threshold, determining that the original cover image has the bright spot content.
Step 1702, when it is determined that the original cover image does not include the highlight content, extracting at least one highlight content in the current video, and performing cover enhancement on the original cover image based on the highlight content to obtain at least one highlight cover image.
Specifically, the bright point content of the current video may be obtained by adopting an extraction manner or a generation manner, or the bright point content of the current video may be obtained by adopting the extraction manner in combination with the generation manner. The extraction mode is to adopt a sequence labeling model to extract the highlight content from the text content of the current video, and the generation mode is to adopt a Seq2Seq generation model to generate the highlight content based on the text content of the current video. The sequence labeling model and the Seq2Seq generation model are obtained by training by adopting a video marked with bright spot content in advance as a training sample.
Illustratively, a video marked with highlight content in advance is taken as a training sample, a sequence marking model is trained, and the format of a training sample set is as follows:
video 1 title video 1 text video 1 highlight content;
video 2 title video 2 text video 2 highlight content;
video v title video v text video v highlight content.
And after the training is finished, extracting highlight contents from the text contents of the current video by adopting a sequence marking model. Specifically, the text content of the current video is set to include a video title, video text-sentence 1, video text-sentence 2, …, video text-sentence S. Performing feature extraction on the video title by using ALBERT to obtain a title representation corresponding to the video title; carrying out feature extraction on the video text-sentence 1 by adopting ALBERT to obtain a sentence 1 expression corresponding to the video text-sentence 1; and (3) carrying out feature extraction on the video text-sentence 2 by adopting ALBERT to obtain a sentence 2 representation corresponding to the video text-sentence 2, and repeating the steps until a sentence S representation corresponding to the video text-sentence S is obtained. And fusing title representation, sentence 1 representation, sentence 2 representation, sentence 3 representation, … and sentence S representation to obtain the comprehensive text characteristic of the current video. And inputting the comprehensive text characteristics into ALBERT, and predicting the probability of obtaining the bright spot content of sentence 1, the probability of obtaining the bright spot content of sentence 2, the probability of obtaining the bright spot content of sentence 3, and the probability of obtaining the bright spot content of … and the probability of obtaining the bright spot content of sentence S. And then, the sentences with the probability larger than the screening threshold are reserved as the bright spot content of the current video.
And detecting the original cover image of the current video by adopting an R-CNN model, and determining the target display position of the key elements in the original cover image. And determining the target adding position of the bright spot content from other positions except the target display position for each bright spot content. And then fusing the highlight content into a target adding position in the original cover image in a preset mode to obtain the highlight cover image.
And step S1703, selecting the highlight cover image based on the user interest to recommend a video.
And predicting the matching degree between the interest label of the target user and each bright spot content by adopting a bright spot matching model. Specifically, the interest tag set of the target user is input into the ALBERT, and the interest representation of the target user is obtained. And inputting the bright spot content W into ALBERT to obtain a bright spot content W representation. And fusing the interest representation of the target user and the highlight content W representation to obtain the interactive representation of the user interest and the highlight content W. Based on the interactive representation of the user interest and the highlight content W, the matching degree between the interest tag of the target user and the highlight content W, that is, the probability that the target user clicks the highlight cover image containing the highlight content W, is predicted.
And selecting a recommended highlight content with the highest matching degree with the interest tag of the target user from the highlight contents, and when the current video is recommended to the target user, taking the highlight cover image containing the recommended highlight content as the cover image of the current video.
In the embodiment of the application, the highlight content reflecting the main content of the video is constructed by deeply understanding the current video content, and then the highlight content is added to the proper position of the original cover image in a proper showing form, so that a user can intuitively and quickly find the video highlight. When the video is recommended and displayed to the user, the highlight cover image containing highlight content conforming to the interest of the user is dynamically selected, so that the interest degree of the user in the video is improved, and the overall playing condition of the video platform is further improved.
Based on the same technical concept, an embodiment of the present application provides an apparatus for generating a cover image, as shown in fig. 18, the apparatus 1800 includes:
a feature extraction module 1801, configured to perform feature extraction on at least one piece of text information in a target video, to obtain first text features corresponding to the at least one piece of text information respectively;
the fusion module 1802 is configured to fuse the obtained first text features to obtain comprehensive text features corresponding to the target video;
a predicting module 1803, configured to predict, based on the comprehensive text feature, at least one piece of target description information corresponding to the target video;
a combining module 1804 generates at least one object cover image based on the original cover image and the at least one object description information of the object video, wherein each object cover image includes the at least one object description information.
Optionally, the prediction module 1803 is specifically configured to:
respectively predicting first probabilities that at least one piece of text information is target description information based on the comprehensive text characteristics;
based on the obtained first probability, at least one target description information is selected from at least one text information.
Optionally, the prediction module 1803 is specifically configured to:
coding the comprehensive text characteristics to respectively obtain second text characteristics corresponding to at least one text message;
decoding second text features corresponding to the at least one text message respectively to generate at least one candidate description message corresponding to the target video and a second probability that each candidate description message is the target description message;
and selecting at least one target description information from at least one candidate description information based on the obtained second probability.
Optionally, the at least one object description information includes at least one first type description information and at least one second type description information;
the prediction module 1803 is specifically configured to:
coding the comprehensive text characteristics to respectively obtain second text characteristics corresponding to at least one text message;
predicting a first probability that each of the at least one text message is the first type of description information based on a second text feature corresponding to each of the at least one text message;
selecting at least one first type of description information from at least one text information based on the obtained first probability;
decoding second text characteristics corresponding to the at least one text message respectively to generate at least one candidate description message corresponding to the target video and a second probability that each candidate description message is a second type of description message;
and selecting at least one second type of description information from the at least one candidate description information based on the obtained second probability.
Optionally, the combining module 1804 is specifically configured to:
detecting target display positions of key elements in the original cover image;
for each target description information in the at least one target description information, the following operations are respectively executed:
aiming at one target description information, determining a target adding position of the target description information from other positions except the target display position;
and fusing object description information to an object adding position in the original cover image to obtain an object cover image containing object description information.
Optionally, a recommendation module 1805 is further included;
the recommending module 1805 is specifically configured to:
fusing target description information to a target adding position in an original cover image, and determining the matching degree between interest tags of a target user and at least one target description information after obtaining a target cover image containing the target description information;
determining recommendation description information from at least one target description information based on the obtained matching degree;
and when the target video is recommended to the target user, selecting a target cover image containing the recommendation description information as a cover image of the target video.
Optionally, the recommending module 1805 is specifically configured to:
extracting features of interest tags of target users to obtain interest features of the target users;
for each target description information in the at least one target description information, the following operations are respectively executed:
performing feature extraction on one target description information aiming at one target description information to obtain the description feature of the target description information;
fusing the interest features and the description features to obtain first associated features;
and predicting the matching degree between the interest tag of the target user and one target description information based on the first correlation characteristic.
Optionally, the prediction module 1803 is further configured to:
and determining that the original cover image of the target video does not contain the target description information before predicting at least one piece of target description information corresponding to the target video based on the comprehensive text characteristics.
Optionally, the prediction module 1803 is specifically configured to:
acquiring cover text information in an original cover image;
performing feature extraction on the cover text information to obtain cover text features corresponding to the cover text information;
fusing the comprehensive text features and the cover text features to obtain second associated features;
predicting a third probability that the cover text information is the target description information based on the second correlation characteristics;
and if the third probability is smaller than the preset threshold value, determining that the original cover image does not contain the target description information.
In the embodiment of the application, feature extraction is firstly performed on each text message in the target video respectively to obtain a first text feature of each text message, so that each text message can obtain better feature representation. The first text features are fused to realize context joint modeling of the relation between the text information, so that the obtained comprehensive text features can more accurately represent the content of the target video, and the accuracy and efficiency of predicting the target description information can be effectively improved when at least one target description information corresponding to the target video is predicted based on the comprehensive text features. Furthermore, the original cover image of the target video is combined with the target description information to obtain the target cover image capable of representing the video bright spots better, so that a user can find the video bright spots quickly and intuitively, and the video watching experience of the user is improved. When the target video is recommended and displayed to the user, the target cover image containing the target description information conforming to the interest of the user is dynamically selected, so that the interest degree of the user in the video is improved, and the overall playing condition of the video platform is further improved.
Based on the same technical concept, the embodiment of the present application provides a computer device, which may be a terminal or a server, as shown in fig. 19, including at least one processor 1901 and a memory 1902 connected to the at least one processor, where the specific connection medium between the processor 1901 and the memory 1902 is not limited in this embodiment, and the processor 1901 and the memory 1902 are connected through a bus in fig. 19 as an example. The bus may be divided into an address bus, a data bus, a control bus, etc.
In the embodiment of the present application, the memory 1902 stores instructions executable by the at least one processor 1901, and the at least one processor 1901 may execute the steps included in the method for generating a cover image by executing the instructions stored in the memory 1902.
The processor 1901 is a control center of the computer device, and may be connected to various parts of the computer device through various interfaces and lines, and generate an object cover image and perform video recommendation by executing or executing instructions stored in the memory 1902 and calling data stored in the memory 1902. Alternatively, the processor 1901 may include one or more processing units, and the processor 1901 may integrate an application processor and a modem processor, wherein the application processor mainly handles operating systems, user interfaces, application programs, and the like, and the modem processor mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1901. In some embodiments, the processor 1901 and the memory 1902 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.
The processor 1901 may be a general-purpose processor, such as a Central Processing Unit (CPU), a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, and may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.
The memory 1902, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 1902 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 1902 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1902 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
Based on the same inventive concept, embodiments of the present application provide a computer-readable storage medium storing a computer program executable by a computer device, which when run on the computer device, causes the computer device to perform the above-described steps of the method of generating a cover image.
It should be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (12)

1. A method of generating a cover image, comprising:
performing feature extraction on at least one text message in a target video to obtain first text features respectively corresponding to the at least one text message;
fusing the obtained first text features to obtain comprehensive text features corresponding to the target video;
predicting at least one target description information corresponding to the target video based on the comprehensive text characteristics;
generating at least one object cover image based on the original cover image of the object video and the at least one object description information, wherein each object cover image contains at least one object description information.
2. The method of claim 1, wherein predicting the at least one target description information corresponding to the target video based on the comprehensive text feature comprises:
respectively predicting first probabilities that the at least one text message is respectively target description information based on the comprehensive text features;
and selecting at least one target description information from the at least one text information based on the obtained first probability.
3. The method of claim 1, wherein predicting the at least one target description information corresponding to the target video based on the comprehensive text feature comprises:
coding the comprehensive text features to respectively obtain second text features corresponding to the at least one text message;
decoding second text features corresponding to the at least one text message respectively to generate at least one candidate description message corresponding to the target video and a second probability that each candidate description message is the target description message;
and selecting at least one target description information from the at least one candidate description information based on the obtained second probability.
4. The method of claim 1, wherein the at least one object description information comprises at least one first type description information and at least one second type description information;
the predicting at least one target description information corresponding to the target video based on the comprehensive text feature comprises:
coding the comprehensive text features to respectively obtain second text features corresponding to the at least one text message;
predicting a first probability that each of the at least one text message is first type description information based on a second text feature corresponding to each of the at least one text message;
selecting at least one first type of description information from the at least one text information based on the obtained first probability;
decoding second text features corresponding to the at least one text message respectively to generate at least one candidate description message corresponding to the target video and a second probability that each candidate description message is a second type of description message;
and selecting at least one second type of description information from the at least one candidate description information based on the obtained second probability.
5. The method of any of claims 1 to 4, wherein generating at least one object cover image based on the original cover image of the object video and the at least one object description information comprises:
detecting target display positions of key elements in the original cover image;
for each target description information in the at least one target description information, respectively performing the following operations:
aiming at one target description information, determining a target adding position of the target description information from other positions except the target display position;
and fusing the object description information to an object adding position in the original cover image to obtain an object cover image containing the object description information.
6. The method of claim 5, wherein the fusing the one object description information to the object adding position in the original cover image to obtain the object cover image containing the one object description information further comprises:
determining the matching degree between the interest tag of the target user and the at least one piece of target description information;
determining recommendation description information from the at least one target description information based on the obtained matching degree;
and when the target video is recommended to the target user, selecting a target cover image containing the recommendation description information as a cover image of the target video.
7. The method of claim 6, wherein the determining the matching degree between the interest tags of the target users and the at least one target description information respectively comprises:
performing feature extraction on the interest tag of the target user to obtain the interest feature of the target user;
for each target description information in the at least one target description information, respectively performing the following operations:
performing feature extraction on one target description information to obtain a description feature of the target description information;
fusing the interest characteristic and the description characteristic to obtain a first associated characteristic;
and predicting the matching degree between the interest tag of the target user and the target description information based on the first correlation characteristic.
8. The method of claim 1, wherein before predicting at least one target description information corresponding to the target video based on the comprehensive text feature, further comprising:
determining that the original cover image of the target video does not contain target description information.
9. The method of claim 8, wherein the determining that the original cover image of the target video does not contain target description information comprises:
acquiring cover text information in the original cover image;
performing feature extraction on the cover text information to obtain cover text features corresponding to the cover text information;
fusing the comprehensive text features and the cover text features to obtain second associated features;
predicting a third probability that the cover text information is target description information based on the second correlation characteristics;
and if the third probability is smaller than a preset threshold value, determining that the original cover image does not contain the target description information.
10. An apparatus for generating a cover image, comprising:
the feature extraction module is used for extracting features of at least one text message in a target video to obtain first text features corresponding to the at least one text message respectively;
the fusion module is used for fusing the obtained first text features to obtain comprehensive text features corresponding to the target video;
the prediction module is used for predicting at least one target description information corresponding to the target video based on the comprehensive text characteristics;
a combining module to generate at least one object cover image based on an original cover image of the object video and the at least one object description information, wherein each object cover image includes at least one object description information.
11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any one of claims 1 to 9 are performed by the processor when the program is executed.
12. A computer-readable storage medium, having stored thereon a computer program executable by a computer device, for causing the computer device to perform the steps of the method of any one of claims 1 to 9, when the program is run on the computer device.
CN202110622132.8A 2021-06-04 2021-06-04 Method, device and equipment for generating cover image and storage medium Pending CN113821677A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110622132.8A CN113821677A (en) 2021-06-04 2021-06-04 Method, device and equipment for generating cover image and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110622132.8A CN113821677A (en) 2021-06-04 2021-06-04 Method, device and equipment for generating cover image and storage medium

Publications (1)

Publication Number Publication Date
CN113821677A true CN113821677A (en) 2021-12-21

Family

ID=78923799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110622132.8A Pending CN113821677A (en) 2021-06-04 2021-06-04 Method, device and equipment for generating cover image and storage medium

Country Status (1)

Country Link
CN (1) CN113821677A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114363714A (en) * 2021-12-31 2022-04-15 阿里巴巴(中国)有限公司 Title generation method, title generation device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114363714A (en) * 2021-12-31 2022-04-15 阿里巴巴(中国)有限公司 Title generation method, title generation device and storage medium
CN114363714B (en) * 2021-12-31 2024-01-05 阿里巴巴(中国)有限公司 Title generation method, title generation device and storage medium

Similar Documents

Publication Publication Date Title
CN111143610B (en) Content recommendation method and device, electronic equipment and storage medium
US10970334B2 (en) Navigating video scenes using cognitive insights
CN110781347A (en) Video processing method, device, equipment and readable storage medium
CN110837579A (en) Video classification method, device, computer and readable storage medium
CN112533051B (en) Barrage information display method, barrage information display device, computer equipment and storage medium
CN111488931B (en) Article quality evaluation method, article recommendation method and corresponding devices
CN110234018B (en) Multimedia content description generation method, training method, device, equipment and medium
CN112163122A (en) Method and device for determining label of target video, computing equipment and storage medium
CN113766299B (en) Video data playing method, device, equipment and medium
CN111372141B (en) Expression image generation method and device and electronic equipment
CN111428025B (en) Text summarization method and device, electronic equipment and storage medium
CN111553148A (en) Label establishing method and device, electronic equipment and medium
CN111783712A (en) Video processing method, device, equipment and medium
CN112163560A (en) Video information processing method and device, electronic equipment and storage medium
CN114860892B (en) Hierarchical category prediction method, device, equipment and medium
CN110263218A (en) Video presentation document creation method, device, equipment and medium
CN113704507A (en) Data processing method, computer device and readable storage medium
CN113704509B (en) Multimedia recommendation method and device, electronic equipment and storage medium
CN113407775B (en) Video searching method and device and electronic equipment
CN113821677A (en) Method, device and equipment for generating cover image and storage medium
CN116956183A (en) Multimedia resource recommendation method, model training method, device and storage medium
CN113408282B (en) Method, device, equipment and storage medium for topic model training and topic prediction
CN115269781A (en) Modal association degree prediction method, device, equipment, storage medium and program product
CN112333554B (en) Multimedia data processing method and device, electronic equipment and storage medium
CN114443916A (en) Supply and demand matching method and system for test data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination