CN116266193A

CN116266193A - Method, device, equipment, storage medium and program product for generating video cover

Info

Publication number: CN116266193A
Application number: CN202111546389.6A
Authority: CN
Inventors: 陈小帅
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2023-06-20

Abstract

The embodiment of the application discloses a method, a device, equipment, a storage medium and a program product for generating a video cover, and belongs to the technical field of multimedia. The method comprises the following steps: acquiring an original cover image and an associated text of a video; determining a cover text based on the correlation between the original cover image and each associated text, wherein the correlation between the cover text and the original cover image is higher than the correlation between other associated texts and the original cover image; and carrying out fusion processing on the original cover image and the cover text to generate the target video cover. According to the method, the device, the equipment, the storage medium and the program product, the target object is not required to manually select the associated text, the cover generation efficiency is improved, the associated text which has high correlation with the original cover image is selected as the cover text, the expression effect of the video cover can be further optimized, and the video click rate is improved.

Description

Method, device, equipment, storage medium and program product for generating video cover

Technical Field

The embodiment of the application relates to the technical field of multimedia, in particular to a method, a device, equipment, a storage medium and a program product for generating a video cover.

Background

With the popularity of barrage cultures, video publishers and multimedia network platforms are now gradually beginning to attempt to make video covers as barrage-style covers. The bullet screen type cover map can intuitively show the interaction heat of the video and the key content of the video, and compared with the common cover, the bullet screen type cover map can improve the click rate of the video and the interaction rate of users.

In the related art, a bullet screen type cover needs to be constructed manually by a video creator, that is, manually selecting bullet screen contents, and then placing each piece of bullet screen contents in a proper position of the cover.

However, the generation efficiency of the barrage type cover is low, the video publisher is required to manually manufacture the cover, the professional requirement of the video publisher is high, the barrages in the cover are selected based on personal trends, the barrages of different types of users in the multimedia platform cannot be well attached, and the improvement effect on the video click rate is poor.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment, a storage medium and a program product for generating a video cover, which can improve the generation efficiency of the video cover, optimize the expression effect of the video cover and improve the video click rate. The technical scheme is as follows:

In one aspect, an embodiment of the present application provides a method for generating a video cover, where the method includes:

acquiring an original cover image of a video and an associated text, wherein the associated text comprises at least one of a video barrage and a video comment;

determining a cover text based on the correlation between the original cover image and each piece of associated text, wherein the correlation between the cover text and the original cover image is higher than the correlation between other associated texts and the original cover image;

and carrying out fusion processing on the original cover image and the cover text to generate a target video cover.

In another aspect, an embodiment of the present application provides a device for generating a video cover, where the device includes:

the system comprises an acquisition module, a video comment processing module and a video comment processing module, wherein the acquisition module is used for acquiring an original cover image of a video and an associated text, and the associated text comprises at least one of a video barrage and a video comment;

the first determining module is used for determining cover texts based on the correlation between the original cover image and each piece of associated text, wherein the correlation between the cover texts and the original cover image is higher than the correlation between other associated texts and the original cover image;

And the image processing module is used for carrying out fusion processing on the original cover image and the cover text to generate a target video cover.

In another aspect, embodiments of the present application provide a computer device comprising a processor and a memory; the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the method for generating a video cover according to the above aspect.

In another aspect, embodiments of the present application provide a computer readable storage medium having at least one computer program stored therein, the computer program being loaded and executed by a processor to implement a method for generating a video cover as described in the above aspects.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of generating a video cover provided in various alternative implementations of the above aspects.

The technical scheme provided by the embodiment of the application at least comprises the following beneficial effects:

in the embodiment of the application, the cover text is determined based on the correlation between the original cover image of the video and the video associated text so as to automatically generate the video cover fused with the associated text, and a video publisher does not need to manually select the associated text, so that the cover generation efficiency is improved. And the associated text is mined, and the associated text with higher relativity with the original cover image is selected as the cover text, so that the target video cover can more closely reflect the key content of the video, the preference that the associated text manually selected by a video publisher based on personal tendency cannot be attached to other video recommended objects can be avoided, the expression effect of the video cover is further optimized, and the click rate of the video is improved.

Drawings

FIG. 1 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 2 is a flowchart of a method for generating a video cover according to an exemplary embodiment of the present application;

FIG. 3 is a schematic illustration of two video covers provided in accordance with one exemplary embodiment of the present application;

FIG. 4 is a flowchart of a method for generating a video cover according to another exemplary embodiment of the present application;

FIG. 5 is a schematic illustration of an image evaluation model provided in an exemplary embodiment of the present application;

FIG. 6 is a flowchart of a method for generating a video cover according to another exemplary embodiment of the present application;

FIG. 7 is a schematic illustration of a first text evaluation model provided in an exemplary embodiment of the present application;

FIG. 8 is a schematic diagram of a second text evaluation model provided in an exemplary embodiment of the present application;

FIG. 9 is a schematic diagram of a process for determining the vertical position of cover text provided in an exemplary embodiment of the present application;

FIG. 10 is a flowchart of a method for generating a video cover according to another exemplary embodiment of the present application;

FIG. 11 is a logical framework diagram of a method for generating a video cover according to an exemplary embodiment of the present application;

FIG. 12 is a block diagram of an apparatus for generating a video cover according to an exemplary embodiment of the present application;

fig. 13 is a block diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

References herein to "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

Fig. 1 is a schematic diagram of an implementation environment provided in an embodiment of the present application, where the implementation environment is described by taking a background server in which a method for generating a video cover is applied to a video application program as an example. The implementation environment may include: a first terminal 110, a background server 120, and a second terminal 130.

The first terminal 110 has a video-type application installed and running therein. When the first terminal 110 receives a text posting operation (e.g., a barrage posting operation, a comment posting operation, etc.) of the target video by the user, the first terminal 110 sends text posting data including the associated text, the video identification, and the text generation time to the background server 120. Only one first terminal 110 is shown in fig. 1, but in different embodiments there are a plurality of other terminals that can access the server 120 and send text distribution data.

The second terminal 130 has a video-type application installed and running therein. When the second terminal 130 receives the display instruction of the video push page, the second terminal 130 sends a video push request to the background server 120. Only one second terminal 130 is shown in fig. 1, but in different embodiments there are a plurality of other terminals that can access the server 120 and send video push requests.

Alternatively, the applications installed on the first terminal 110 and the second terminal 130 are the same, or the applications installed on the two terminals are the same type of application on different operating system platforms, or the applications installed on the two terminals are different. The first terminal 110 may refer broadly to one of the plurality of terminals, the second terminal 130 may refer broadly to another of the plurality of terminals, or the first terminal 110 and the second terminal 130 are the same device, and the embodiment is exemplified by only the first terminal 110 and the second terminal 130. The device types of the first terminal 110 and the second terminal 130 may be the same or different, and include: smart phones, tablet computers, electronic book readers, MP3 players, MP4 players, smart televisions, car terminals, laptop portable computers, desktop computers, and the like. Alternatively, the first terminal 110 may be further configured to send a video push request to the background server 120, and the second terminal 130 may be further configured to send text distribution data to the background server 120.

When the background server 120 receives the video push request of the second terminal 130 and determines that the video recommendation page sent to the second terminal 130 includes the video cover of the target video, the background server 120 obtains the original cover image of the target video and the associated text sent by the first terminal 110, determines the cover text based on the correlation between each associated text and the original cover image, fuses the cover text and the original cover image to obtain the target video cover, and sends the target video cover to the second terminal 130.

The background server 120 comprises a server, a server cluster formed by a plurality of servers, a cloud computing platform, a virtualization center and the like. The background server 120 is used to provide background services for video-type applications.

The method for generating the video cover in the application can be independently executed by the terminal provided with the video application program. The terminal receives the original cover image and the associated text sent by the background server, generates a target video cover, and displays the target video cover through a video recommendation page in the video application program. The method may also be performed by a background server of the video class application alone. The background server generates a target video cover based on the original cover image and the associated text, and sends the target video cover or a video recommendation page containing the target video cover to the terminal, and the terminal displays the target video cover or the video recommendation page containing the target video cover. In addition, the method can be cooperatively executed by the terminal and the background server. For example, the background server determines the cover text based on the correlation between the original video image and the associated text, and sends the original video image and the cover text to the terminal, the terminal performs fusion processing on the original cover image and the cover text to generate a target video cover, and the target video cover is displayed through a video recommendation page in the video application program. The following method embodiments are described in terms of the method being performed solely by the background server.

Fig. 2 is a flowchart illustrating a method for generating a video cover according to an exemplary embodiment of the present application. The present embodiment describes the method as an example of being executed solely by the background server, and the method includes the following steps.

In step 201, an original cover image of a video and associated text are acquired.

Wherein the associated text includes at least one of a video bullet screen and a video comment.

The original cover image refers to a cover image which is not subjected to the associated text fusion processing of the background server, namely, when the background server receives a video file, the cover image which is contained in the video file and is produced by a video publisher through a terminal, or the cover image which is automatically generated by the background server according to the video file sent by the terminal (for example, a video picture corresponding to a certain video frame in the video file).

In one possible implementation, the background server receives a text release request sent by a terminal, where the terminal includes a terminal of a video publisher and a terminal of a viewer, and the text release request includes an associated text, a video identifier, and a text generation time. And the background server forwards the associated text to other terminals according to the text release request, so that other audiences can view the associated text corresponding to the video in the process of watching the video. Meanwhile, the background server generates a video cover fused with the associated text based on the associated text and sends the video cover to the pushing terminal corresponding to the video, so that an object to be pushed can view the video cover fused with the associated text on the video recommendation page, and the effects of attracting audience to view the video and improving the click rate of the video are achieved.

Step 202, determining the cover text based on the correlation between the original cover image and each associated text.

The relevance of the cover text to the original cover image is higher than the relevance of other associated text to the original cover image.

Associated text such as a barrage and comments is generated based on subjective information such as personal preferences of viewers and interpretation of video content, and therefore the associated text generally comprises associated text with high correlation with video topics and associated text with low correlation with video topics. If the video cover is directly generated based on all the associated texts, on one hand, the video cover cannot be carried under the condition that the number of the associated texts is large, on the other hand, the effect of the associated texts with low correlation with the video theme on optimizing the cover expression effect is not great, and on the contrary, the interested degree of the target object on the video can be reduced. In one possible implementation, since the original cover image is generally information that is more consistent with the video theme or hotspot, the background server determines the associated text with the higher relevance as the cover text based on the relevance between the associated text and the original cover image.

For example, the background server performs image recognition and text recognition, determines comment content and/or barrage content for describing the content related to the original cover image as the cover text, or determines a video playing time corresponding to the video content related to the original cover image, and then determines the cover text based on the barrage corresponding to the video playing time.

And 203, fusing the original cover image and the cover text to generate the target video cover.

After the background server determines the cover texts, each piece of cover text is placed at a proper position of an original cover image, and fusion processing is carried out on the original cover image and the cover text to generate the target video cover. And further sending the target video cover or the video recommendation page containing the target video cover to the terminal corresponding to the target object.

Schematically, a comparison of a generic video cover with a target video cover is shown in fig. 3. The normal video cover 301 contains video pictures, video play amounts, bullet screen numbers and video names at a certain moment, and the target video cover 302 contains associated text in addition to the above. As can be seen from the figure, the target video cover 302 can express video content more directly and whitely, and can reflect the feeling and evaluation of other viewers on the video, thereby achieving the effect of attracting the target object to watch the video.

In summary, in the embodiment of the present application, the cover text is determined based on the correlation between the original cover image of the video and the associated text of the video, so as to automatically generate the video cover fused with the associated text, without the need of the video publisher to manually select the associated text, thereby improving the efficiency of generating the cover. And the associated text is mined, and the associated text with higher relativity with the original cover image is selected as the cover text, so that the target video cover can more closely reflect the key content of the video, the preference that the associated text manually selected by a video publisher based on personal tendency cannot be attached to other video recommended objects can be avoided, the expression effect of the video cover is further optimized, and the click rate of the video is improved.

In one possible implementation manner, the background server divides the video into a plurality of video clips, and calculates the relevance between the original cover image and each video clip, and the associated text corresponding to the video clip with high relevance is used as the selection range of the cover text. Fig. 4 is a flowchart illustrating a method for generating a video cover according to another exemplary embodiment of the present application. The present embodiment describes the method as an example of being executed solely by the background server, and the method includes the following steps.

In step 401, an original cover image of a video and associated text are acquired.

For specific embodiments of step 401, reference may be made to step 201 described above, and the embodiments of the present application are not repeated here.

A target video clip is determined from the video based on the original cover image, step 402.

The correlation between the video picture corresponding to the target video clip and the original cover image is higher than the correlation between the video picture corresponding to other video clips and the original cover image.

The video cover image is typically a video frame taken from the video by the video publisher based on the subject or point of view, or a video frame corresponding to a video frame extracted by the background server based on video play data (such as peak periods of interaction). Therefore, the original cover image can accurately reflect the video theme or the viewpoint, and the background server divides the video into a plurality of video clips, and determines the video clip with high relativity with the original cover image, namely the target video clip. The correlation degree between the target video clip and the original cover image is high, so that the correlation text corresponding to the target video clip has higher correlation with the original cover image, namely, the correlation text is more attached to the video theme and the viewpoint compared with the correlation text corresponding to other video clips. The method has the advantages that the cover text is selected from the associated text corresponding to the target video segment with higher correlation with the original cover image, so that the quality of the cover text can be improved, a large number of invalid associated texts can be rapidly screened out, and the generation efficiency of the target video cover is improved.

In one possible implementation, step 402 includes the steps of:

step 402a, segmenting the video according to the number of target segments or the target duration to obtain at least two candidate video segments.

Optionally, the background server segments the video based on the number of target segments or the target duration, and divides the video into at least two candidate video segments, which is not limited in the embodiment of the present application. For example, for a video with a video duration less than a first duration threshold, the background server segments the video according to the number of target segments; and for the video with the video time length being greater than a second time length threshold, the background server performs segmentation processing on the video according to the target time length, and the second time length threshold is greater than or equal to the first time length threshold.

Step 402b, inputting the video frames corresponding to the original cover image and the candidate video clips into an image evaluation model to obtain the image correlation scores corresponding to the candidate video clips.

The image evaluation model is trained based on positive and negative sample pairs, wherein the positive sample pair consists of a sample cover image and related video clips corresponding to the sample cover image, and the negative sample pair consists of sample cover images and non-related video clips corresponding to the sample cover image.

In one possible implementation, the background server needs to model training the image evaluation model before the actual application phase, i.e. before the original cover image of the video and the associated text is acquired. The background server performs model training based on the positive and negative pairs of samples. The sample data comprises a sample cover image and a video fragment, wherein the sample cover image and the related video fragment form a positive sample pair, and the sample cover image and the non-related video fragment form a negative sample pair. Further, in order to improve the model learning ability, the non-relevant video segments and the relevant video segments of the same sample cover image belong to the same video, and the time difference between the non-relevant video segments and the relevant video segments is smaller.

After model training is completed, in an application stage, a background server inputs an original cover image of a video and video frames of candidate video clips into an image evaluation model to obtain an image correlation score. Specifically, step 402b includes the steps of:

step one, inputting an original cover image into a first feature extraction network in an image evaluation model to obtain a cover feature vector corresponding to the original cover image.

The first feature extraction network is a neural network for identifying and extracting image features, such as a convolutional neural network (Convolutional Neural Network, CNN), a recurrent neural network (Recurrent Neural Network, RNN), a generative countermeasure network (Generative Adversarial Networks, GAN), and the like. The background server inputs the original cover image into a first feature extraction network in the image evaluation model to obtain the depth representation of the original cover image, namely the cover feature vector.

Inputting the video frames corresponding to the candidate video clips into a second feature extraction network in the image evaluation model to obtain video frame feature vectors corresponding to the video frames.

On the other hand, the background server inputs the video frames corresponding to the candidate video clips into a second feature extraction network in the image evaluation model to obtain depth representations corresponding to the video frames, namely video frame feature vectors. Optionally, the second feature extraction network is the same as the first feature extraction network, or the second feature extraction network is different from the first feature extraction network.

In general, a candidate video segment contains a large number of video frames, and if all video frames are input into a model to perform feature extraction, the calculation amount is large and the efficiency is low. And, the continuous video frames are relatively related, and the image content is similar. Thus in one possible implementation, the background server extracts a number of video frame input models from the candidate video clips according to a target frame number or a target frame interval duration.

And thirdly, carrying out feature fusion on the video frame feature vectors corresponding to each video frame through a Self-Attention mechanism (Self-Attention) in the image evaluation model to obtain segment feature vectors of candidate video segments.

Because the video frame feature vectors are used for representing the features of the respective corresponding single video frames, the background server needs to judge the correlation between the video segments and the original cover image, and therefore, the background server also needs to fuse the feature vectors of each video frame through a self-attention mechanism to obtain segment feature vectors of candidate video segments.

And step four, inputting the cover feature vector and the fragment feature vector into a full connection layer in the image evaluation model to obtain an image correlation score.

The background server generates interactive representation of the cover image-video clips through the full connection layer of the model based on the matrix formed by the model parameters, the cover feature vector and the clip feature vector, and further obtains the probability of correlation between the original plane image and the candidate video clips based on the interactive representation, namely image correlation scores.

Fig. 5 shows a schematic diagram of a model architecture of an image evaluation model. The first feature extraction network and the second feature extraction network in the model both employ an efficiency network (EfficientNet). The background server inputs an original cover image into a first EfficientNet to obtain a depth representation (cover feature vector) of the cover image, inputs video frames in candidate video clips into a second EfficientNet and Self-Attention to obtain a depth representation (clip feature vector) of the video clips, and further inputs the depth representation of the cover image and the depth representation of the video clips into a full connection layer to obtain probabilities (image correlation scores) of the candidate video clips and the cover image.

At step 402c, at least one candidate video segment with the highest image relevance score is determined as the target video segment.

And the background server sorts the candidate video clips according to the sequence of the relevance scores from high to low, and determines one or more candidate video clips with the highest image relevance scores as target video clips.

Step 403, determining the cover text from the associated text corresponding to the target video clip based on the correlation between the associated text and the original cover image.

After the background server determines the target video clip, the video period corresponding to the target video clip is obtained, and then the associated text corresponding to the video period is determined. For example, for a non-live video, if the background server determines that the video segment from 1 minute 15 to 1 minute 45 seconds is the target video segment, the background server determines the cover text from the video bullet screen played between 1 minute 15 to 1 minute 45 seconds; for live video, the background server determines that the video clip from 1 minute 15 seconds to 1 minute 45 seconds of the video is the target video clip, and then the background server determines cover text from video comments received from 1 minute 15 seconds to 1 minute 45 seconds during live video.

In one possible implementation, step 403 includes the steps of:

in step 403a, in response to the number of texts corresponding to the target video segment being less than or equal to the text number threshold, all associated texts corresponding to the target video segment are determined to be cover text.

When the text quantity of all the associated texts corresponding to the target video segment is lower than or equal to a text quantity threshold (for example, the text quantity threshold is 10 pieces, and the text quantity of all the associated texts is less than or equal to 10), the background server directly determines all the associated texts corresponding to the target video segment as cover texts.

In step 403b, in response to the number of texts corresponding to the target video segment being above the text number threshold, cover text is determined from the associated texts corresponding to the target video segment based on the correlation of the associated text with the original cover image.

When the text quantity of all the associated texts corresponding to the target video clip is higher than the text quantity threshold (for example, 10 pieces), the background server further accurately takes the text with higher relevance to the original cover image as the cover text from all the associated texts corresponding to the target video clip because the video cover cannot bear all the associated texts.

Step 404, determining the cover text from the associated text corresponding to the target video segment based on the correlation between the associated text and the target object label.

The target object tag is used for indicating the orientation of the target object to the video type, wherein the target object is an object to be pushed by the video of the background server, and the background server sends data comprising the target video cover to the terminal of the target object after generating the target video cover.

Each object corresponds to its own object tag. The object tag is selected by the object through a tag setting operation and transmitted to the background server by the terminal, or is obtained by the background server based on historical video viewing data thereof under the condition that the object allows. For example, when the total time length of watching a certain type of video by the object A reaches a time length threshold value or the total times reach a time number threshold value, adding the label corresponding to the type of video into the object label of the object A.

In a possible implementation manner, the video class application program in the embodiment of the application is provided with a video recommendation page, and when the terminal receives a display instruction for the video recommendation page, the video recommendation page (or an element for forming the video recommendation page) is acquired from a background server, wherein the video recommendation page contains video covers of all recommended videos. Since there are a large number of different types of users in the network platform, the associated text that is attractive to the different types of users is also different. Therefore, the background server respectively determines different cover texts based on the object labels of the objects to generate the personalized target video cover so as to achieve the effect of attracting multiple types of objects to watch the video.

Likewise, in response to the number of texts corresponding to the target video segment being below the text number threshold, the background server determines all associated texts corresponding to the target video segment as cover texts; and determining cover text from the associated text corresponding to the target video segment based on the correlation of the associated text and the target object label in response to the number of texts corresponding to the target video segment being above a text number threshold.

Step 405, performing fusion processing on the original cover image and the cover text, and generating a target video cover.

For a specific embodiment of step 405, reference may be made to step 203 described above, and the embodiments of the present application are not repeated here.

In the embodiment of the application, the characteristic that the original cover image has strong correlation with the video theme and the viewpoint is utilized, the target video fragment which is relatively correlated with the original cover image is selected from the video, and the cover text is determined from the correlated text corresponding to the target video fragment, so that the accuracy of the cover text can be improved, a large number of irrelevant texts can be removed rapidly, and the efficiency of the cover text determination and the target video cover generation is improved.

The background server determines the correlation between the associated text and the original cover image and the target object label by using a neural network model, and performs model training by adopting network big data in advance so as to improve the accuracy of the cover text. Fig. 6 is a flowchart illustrating a method for generating a video cover according to another exemplary embodiment of the present application. The present embodiment describes the method as an example of being executed solely by the background server, and the method includes the following steps.

In step 601, an original cover image of a video and associated text are acquired.

A target video clip is determined from the video based on the original cover image, step 602.

For the specific embodiments of steps 601 to 602, reference may be made to the above steps 401 to 402, and the embodiments of the present application will not be repeated here.

And 603, determining the associated text corresponding to the target video segment as a candidate associated text.

The background server acquires a target time period corresponding to the target video clip, and determines the associated text played in the target time period as a candidate associated text. Because the number of associated texts corresponding to the video clips is large, if all the candidate associated texts are placed on the original cover image, the number of stacked texts in the cover is large, and the expression effect of the video cover is adversely affected, so that the associated texts related to the main content of the video and the interests of the user need to be selected from the candidate associated texts.

Step 604, determining a first text score for the candidate associated text based on the relevance of the candidate associated text to the original cover image.

In one possible implementation, the background server integrates the relevance of each candidate associated text to the original cover image and the relevance of the candidate associated text to the target object label to determine the cover text.

Step 604 includes the steps of:

and inputting the original cover image and the candidate associated text into a first text evaluation model to obtain a first text score.

The first text evaluation model is trained based on positive and negative sample pairs. The positive sample pair consists of a sample video frame and a positive sample text, the playing time of the sample video frame is consistent with that of the positive sample text, and the negative sample pair consists of a sample video frame and a negative sample text, and the playing time of the sample video frame is inconsistent with that of the negative sample text.

In one possible implementation, the background server model-trains the first text evaluation model based on the sample data prior to the application phase, i.e., prior to acquiring the original cover image of the video and the associated text. Because relevant texts such as barrages and the like generally have stronger correlation with video frames corresponding to the release time of the relevant texts, sample texts in the same sample video, the release time of which is consistent with the play time of the sample video frames, are determined to be positive sample texts, and sample texts, the release time of which is inconsistent with the sample video frames, are determined to be negative sample texts.

The sample text is obtained by removing associated text with low correlation degree with corresponding video content through early stage manual screening. Illustratively, the sample video comprises 100 sample video frames, positive sample texts corresponding to the sample video frames are obtained through manual screening, and then negative sample texts of the sample video frames are determined based on a time corresponding relation. For example, for a 50 th sample video frame, to increase the modeling capability, the background server uses positive sample text corresponding to 35 th to 45 th frames and 65 th to 75 th sample video frames that are closer to the background server as negative sample text for the 50 th sample video frame.

Specifically, the process of determining the first text score through the first text evaluation model includes the following steps:

step 604a, inputting the original cover image into a third feature extraction network in the first text evaluation model to obtain a cover feature vector corresponding to the original cover image.

The third feature extraction network is a neural network for identifying and extracting image features, for example, CNN, RNN, GAN and the like. The background server inputs the original cover image into a third feature extraction network in the image evaluation model to obtain the depth representation of the original cover image, namely the cover feature vector. Optionally, the first feature extraction network, the second feature extraction network, and the third feature extraction network are the same or different.

Step 604b, inputting the candidate associated text into the text feature extraction network in the first text evaluation model to obtain text feature vectors corresponding to the candidate associated text.

The text feature extraction network is a neural network for identifying and extracting text features, such as a pre-trained language characterization model (Bidirectional Encoder Representation from Transformers, BERT), CNN, RNN, etc. And the background server inputs the candidate associated text into a first text evaluation model to obtain the depth representation of the candidate associated text, namely the text feature vector corresponding to the candidate associated text.

Step 604c, inputting the cover feature vector and the text feature vector into the full connection layer in the first text evaluation model to obtain a first text score.

The background server generates a fusion representation of the cover image and the candidate associated text based on a matrix formed by model parameters, the cover feature vector and the text feature vector through a full connection layer of the model, and further obtains the probability of correlation between the original plane image and the candidate associated text based on the fusion representation, namely a first text score.

Fig. 7 shows a schematic diagram of a model architecture of a first text evaluation model. The third feature extraction network in the model employs Efficient Net and the text feature extraction network employs BERT. The background server inputs an original cover image into EfficientNet to obtain a depth representation (cover feature vector) of the cover image, inputs candidate associated texts into BERT to obtain a depth representation (text feature vector) of the candidate associated texts, and inputs the depth representation of the cover image and the depth representation of the candidate associated texts into a full connection layer to obtain probabilities (first text scores) that the candidate associated texts are related to the cover image.

Step 605, determining a second text score for the candidate associated text based on the relevance of the candidate associated text to the target object tag.

Wherein the target object tag is used to indicate the orientation of the target object to the video type.

The object tag is information capable of reflecting the orientation of the object to the video interest, and thus the cover text can be determined based on the correlation between the object tag and the associated text. The background server generates a target video cover in a personalized way aiming at the objects to be pushed of the video. For example, if the background server determines that the target video needs to be pushed to the video recommendation page of the object a and the video recommendation page of the object B, the object label corresponding to the object a is "home, food, movie", the object label corresponding to the object B is "office, music, pet, book", and the target video covers generated by the background server for the object a and the object B may be different.

In one possible implementation, step 605 includes the steps of:

and inputting the target object label and the candidate associated text into a second text evaluation model to obtain a second text score.

The second text evaluation model is trained based on positive and negative sample pairs, wherein the positive sample pairs are composed of sample labels corresponding to sample objects and positive sample texts, the positive sample texts are associated texts for receiving positive feedback interaction operations of the sample objects, the negative samples are composed of sample labels and negative sample texts, and the negative sample texts are associated texts for receiving negative feedback interaction operations of the sample objects.

In one possible implementation, the background server model-trains the second text evaluation model based on the sample data prior to the application phase, i.e., prior to acquiring the original cover image of the video and the associated text. Users in the platform can interact with the associated text, such as clicking and approving the barrage and comments which are interested or approved, namely, performing positive feedback interaction operation, and shielding the barrage and comments which are not interested or approved, namely, performing negative feedback interaction operation. The background server may construct sample data based on historical feedback operations. And collecting sample texts receiving feedback interaction operation, wherein the sample texts receiving positive feedback interaction operation are used as positive sample texts of sample labels corresponding to the positive feedback interaction operation, and the sample texts receiving negative feedback interaction operation are used as negative sample texts of the sample labels corresponding to the negative feedback interaction operation. The sample label corresponding to the feedback interaction operation refers to an object label corresponding to the object triggering the feedback interaction operation.

Specifically, the process of determining the second text score through the second text evaluation model includes the following steps:

step 605a, inputting the target object label into the first text feature extraction network in the second text evaluation model to obtain a label feature vector corresponding to the target object label.

The first text feature extraction network is a neural network for identifying and extracting text features, for example BERT, CNN, RNN and the like. And the background server inputs the target object label into a first text feature extraction network to obtain the depth representation of the target object label, namely a label feature vector corresponding to the target object label.

In one possible implementation manner, the target object may correspond to a plurality of target object tags, the background server generates a target tag sequence (for example, a tag position with a larger weight is located before) corresponding to the target object based on the tag weights corresponding to the target object tags, and inputs the target tag sequence into the first text feature extraction network to obtain the tag feature vector.

Step 605b, inputting the candidate associated text into a second text feature extraction network in a second text evaluation model to obtain text feature vectors corresponding to the candidate associated text.

Likewise, the second text feature extraction network is a neural network for identifying and extracting text features, for example BERT, CNN, RNN and the like. And the background server inputs the candidate associated text into a second text feature extraction network to obtain the depth representation of the candidate associated text, namely the text feature vector corresponding to the candidate associated text.

Optionally, the first text feature extraction network is the same as the second text feature extraction network, or the first text feature extraction network is different from the second text feature extraction network.

And step 605c, inputting the tag feature vector and the text feature vector into a full connection layer in the second text evaluation model to obtain a second text score.

The background server generates a fusion representation of the target object label-candidate associated text based on a matrix formed by model parameters, label feature vectors and text feature vectors through a full connection layer of the model, and further obtains the probability of correlation between the target object label and the candidate associated text based on the fusion representation, namely a second text score.

Fig. 8 shows a schematic diagram of a model architecture of a second text evaluation model. The first text feature extraction network in the model employs BERT and the second text feature extraction network employs BERT. The background server inputs the target object label into the first BERT to obtain the depth representation (label feature vector) of the target object label, inputs the candidate associated text into the second BERT to obtain the depth representation (text feature vector) of the candidate associated text, and further inputs the depth representation of the target object label and the depth representation of the candidate associated text into the full connection layer to obtain the probability (second text score) that the candidate associated text is related to the target object label.

Step 606, determining a text relevance score based on the first text score, the first weight corresponding to the first text score, the second text score, and the second weight corresponding to the second text score.

The background server synthesizes the first text score and the second text score to obtain a text relevance score, and further determines the cover content which can reflect the related information of the original cover image and can fit the personal interests of the video recommendation object.

Illustratively, the text relevance score gra [ i ] for the candidate associated text i is calculated as follows:

gra[i]＝x1*rel[i]+x2*int[i]

where x1 is a first weight, rel [ i ] is a first text score, x2 is a second weight, and int [ i ] is a second text score. The first weight and the second weight may be set by a developer based on demand.

In step 607, n candidate associated texts with the highest text relevance scores are determined as cover texts.

n is a positive integer.

In one possible implementation manner, the background server sorts the candidate associated texts according to the sequence of the text relevance scores from high to low, determines n candidate associated texts with the highest text relevance scores as cover texts, and is used for fusing with the original cover images to construct the target video cover.

In step 608, the original cover image and the cover text are fused to generate the target video cover.

And the background server fuses the cover bullet screen determined according to the steps into the original cover image to construct the target video cover. The text relevance scores for the individual cover texts may be different, i.e., there is a difference in the degree of relevance between the different cover texts and the target object's personal interests and video content. Since the target object usually focuses on the middle part of each video cover first when browsing the video recommended page, the background server places the cover text with high score at the position where the target object can focus preferentially based on the text relevance score, so as to further optimize the target video cover.

In one possible implementation, step 608 includes the steps of:

in step 608a, a horizontal position of each cover text in the original cover image is determined based on the text generation time.

Since the visual focus of the user is generally affected by the vertical position, the horizontal position is not greatly affected, so that the selection of the horizontal position respects the appearance sequence of the cover texts, and the positions of the cover texts in the same row are determined according to the sequence of the text generation time.

For example, in the text generation time corresponding to the cover bullet screen of the same line (i.e., the vertical position is the same) with the length of the original cover image being L, the time difference between the earliest text generation time da and the latest text generation time db is dt=db-da, and then the horizontal start position of each cover bullet screen of the line is: l (text generation time-da)/dt.

Step 608b, determining the vertical position of each cover text in the original cover image based on the text relevance score.

The vertical distance between the cover text with high text relevance score and the horizontal center line is smaller than the vertical distance between the cover text with low text relevance score and the horizontal center line.

In the vertical direction, the background server sequentially distributes and places the cover texts from the middle to the upper side and the lower side according to the sequence of the text relevance scores of the cover texts from high to low. As shown in fig. 9, the cover text with high text relevance score is closer to the center line, the bullet screen with low text relevance score is farther from the center line.

Illustratively, the overall height of the original cover image is h, and the formula for determining the vertical position is as follows:

h/2-h/2 (text relevance score-lowest text relevance score)/(highest text relevance score-lowest text relevance score)

Each cover text can be randomly selected to be displayed above or below the central line.

And 608c, drawing the cover text on the original cover image according to the horizontal position and the vertical position to obtain the target video cover.

And the background server draws the cover text on the original cover image according to the calculated horizontal position and vertical position to obtain the target video cover. If there is a cover text overlap, the background server randomly reserves one piece of cover text for the cover text overlapped with each other, or the front end performs special processing (such as replacing with an ellipsis) on the overlapped part, and the like.

Correspondingly, if the background server only determines the cover text based on the correlation between the candidate associated text and the original cover image, the background server determines the vertical position based on the first text score; if the background server determines the cover text based only on the correlation between the candidate associated text and the object tag, the background server determines the vertical position based on the second text score.

In the embodiment of the application, the background server determines the comprehensive correlation between the candidate associated text and the original cover image and the target object label by utilizing the neural network model, trains the model in advance based on big data, and improves the accuracy of cover text selection; on the other hand, the background server performs fusion processing based on the scores of the cover texts, and places the cover texts with high scores at positions where target objects are easy to focus, so that the video cover is further optimized.

The various embodiments described above illustrate a process in which a computer device generates a target video cover based on the correlation between the associated text and the original cover image. In one possible implementation, the computer device first determines whether the video is necessary to generate a cover fused with the associated text prior to generating the target video cover, and updates the video cover as text information and interest of the target object change after generating the target video cover. Fig. 10 is a flowchart illustrating a method for generating a video cover according to another exemplary embodiment of the present application. The present embodiment describes the method as an example of being executed solely by the background server, and the method includes the following steps.

In step 1001, an original cover image of a video and associated text is acquired.

For specific embodiments of step 1001, reference may be made to step 601, which is not described herein.

Step 1002, performing optical character recognition on the original cover image, and determining cover text in the original cover image.

The original cover image is a cover image which is produced by a video publisher through a terminal, or is automatically generated by a background server according to a video file sent by the terminal. The original cover image may itself contain significant text content. For example, the video publisher adds text to the cover image in the later stage of video production, or the terminal or the background server automatically intercepts that the video picture serving as the cover image contains important text content. In this case, if a cover with the associated text fused is generated, the cover text can block the text in the original video picture, and the meaning of constructing the target video cover is not great. Therefore, after the background server acquires the original cover image of the video, optical character recognition (Optical Character Recognition, OCR) is performed on the original cover image, cover characters in the original cover image are determined, and whether the target cover image needs to be generated or not is judged.

In step 1003, in response to the number of words of the cover text being less than the word number threshold and/or the ratio of the display area of the cover text to the area of the original cover image being less than the ratio threshold, the cover text is determined based on the correlation between the original cover image and each associated text.

The background server takes the word number or the picture ratio of words in the original cover image as the basis for judging whether the target cover image needs to be constructed or not. If the original cover image already contains more texts (such as more than 32 words) or the frames occupy larger texts, the cover text is not determined any more directly, and the related texts are not fused in the original cover image.

The process of determining the cover text may refer to steps 602 to 607, which are not described herein.

Step 1004, performing fusion processing on the original cover image and the cover text to generate the target video cover.

For a specific embodiment of step 1004, reference may be made to step 608 described above, and the embodiments of the present application are not repeated here.

In response to the video meeting the cover update condition, step 1005 updates the video cover based on the original cover image and the associated text.

Wherein the cover update condition includes at least one of an increment of associated text reaching a text increment threshold and/or an increment of a target object tag reaching a tag increment threshold.

Two major variables affecting video covers are changes in associated text and changes in target object interests. Thus, as the interest of the associated text and target object changes, the video cover needs to be dynamically updated to generate a better video cover. When the increment of the associated text corresponding to the video reaches a text increment threshold (e.g., 500 pieces), the background server updates the video cover based on the original cover image and the associated text, and/or when the increment of the target object label reaches a label increment threshold (e.g., 3 pieces), the background server updates the video cover based on the original cover image and the associated text, so that the video covers of the same video viewed by the same target object at different times may be different.

It should be noted that, when the video cover is updated, the original cover image is unchanged and the video content is unchanged, so that the target video clip does not need to be redetermined.

In this embodiment, when the original video image contains more text content, the background server does not perform the steps of determining the cover text and generating the target video cover, so as to avoid generating a video cover that is meaningless and has adverse effects on the original cover image. In addition, as the associated text and the interest of the target object change, the background server dynamically updates the video cover, so that the attraction of the video cover to the target object can be further improved.

In connection with the above embodiments, fig. 11 is a block diagram illustrating a method for generating a video cover according to the present application.

When receiving the related text publishing operation on the target video, the first terminal 110 transmits the related text corresponding to the related text publishing operation to the background server 120.

The background server 120 stores the associated text, the target video, and the text distribution time in association with each other in the database 121. The background server 120 obtains the target video from the database 121, and divides the target video by the video processing module 122 to obtain a plurality of candidate video clips. The background server 120 sends the candidate video clips and the original cover image which is obtained from the database 121 and is flat to the image evaluation model 123, so as to obtain the image correlation score of each candidate video clip, and determines the target video clip based on the image correlation score. The background server 120 obtains the associated text corresponding to the target video segment from the associated text of the target video as a candidate associated text. The background server 120 inputs the candidate associated text and the original cover image of the target video into the first text evaluation model 124 to obtain a first text score indicating the relevance of the candidate associated text to the original cover image, and simultaneously sends the candidate associated text and the target object label to the second text evaluation model 125 to obtain a second text score indicating the relevance of the candidate associated text to the target object label. The background server 120 inputs the first text score and the second text score into the text scoring module 126, and combines the two scores to determine cover text from the candidate associated text. The background server 120 inputs the cover text and the original cover image into the cover generation module 127 to obtain a target video cover, and transmits the target video cover to the second terminal 130 corresponding to the target object.

When the video recommendation page is displayed and the recommended video contains the target video, the second terminal 130 displays the target video cover generated in the above steps through the video recommendation page.

Fig. 12 is a block diagram of a video cover generating apparatus according to an exemplary embodiment of the present application, where the apparatus includes the following components:

an obtaining module 1201, configured to obtain an original cover image of a video and an associated text, where the associated text includes at least one of a video bullet screen and a video comment;

a first determining module 1202, configured to determine a cover text based on a correlation between the original cover image and each piece of associated text, where the correlation between the cover text and the original cover image is higher than the correlation between other associated texts and the original cover image;

the image processing module 1203 is configured to perform fusion processing on the original cover image and the cover text, and generate a target video cover.

Optionally, the first determining module 1202 includes:

a first determining unit, configured to determine a target video clip from the video based on the original cover image, where a correlation between a video picture corresponding to the target video clip and the original cover image is higher than a correlation between video pictures corresponding to other video clips and the original cover image;

And the second determining unit is used for determining the cover text from the associated text corresponding to the target video segment based on the correlation between the associated text and the original cover image.

Optionally, the first determining unit is further configured to:

segmenting the video according to the number of target segments or the target duration to obtain at least two candidate video segments;

inputting the video frames corresponding to the original cover images and the candidate video clips into an image evaluation model to obtain image correlation scores corresponding to the candidate video clips, wherein the image evaluation model is trained based on positive and negative sample pairs, the positive sample pairs consist of sample cover images and relevant video clips corresponding to the sample cover images, and the negative sample pairs consist of sample cover images and irrelevant video clips corresponding to the sample cover images;

and determining at least one candidate video segment with the highest image relevance score as the target video segment.

Optionally, the first determining unit is further configured to:

inputting the original cover image into a first feature extraction network in the image evaluation model to obtain a cover feature vector corresponding to the original cover image;

Inputting the video frames corresponding to the candidate video segments into a second feature extraction network in the image evaluation model to obtain video frame feature vectors corresponding to the video frames;

feature fusion is carried out on the video frame feature vectors corresponding to each video frame through a self-attention mechanism in the image evaluation model, so that segment feature vectors of the candidate video segments are obtained;

and inputting the cover characteristic vector and the fragment characteristic vector into a full connection layer in the image evaluation model to obtain the image correlation score.

Optionally, the second determining unit is further configured to:

determining all associated texts corresponding to the target video clips as the cover text in response to the number of texts corresponding to the target video clips being lower than or equal to a text number threshold;

and determining the cover text from the associated text corresponding to the target video segment based on the correlation of the associated text and the original cover image in response to the number of texts corresponding to the target video segment being higher than the text number threshold.

Optionally, the first determining module 1202 further includes:

a third determining unit, configured to determine a target video clip from the video based on the original cover image, where a correlation between a video picture corresponding to the target video clip and the original cover image is higher than a correlation between video pictures corresponding to other video clips and the original cover image;

And the fourth determining unit is used for determining the cover text from the associated text corresponding to the target video segment based on the correlation between the associated text and a target object label, wherein the target object label is used for indicating the orientation of the target object to the video type.

Optionally, the second determining unit is further configured to:

determining the associated text corresponding to the target video segment as a candidate associated text;

determining a first text score for the candidate associated text based on a relevance of the candidate associated text to the original cover image;

determining a second text score for the candidate associated text based on a relevance of the candidate associated text to a target object tag, the target object tag being used to indicate an orientation of the target object to the video type;

determining a text relevance score based on the first text score, a first weight corresponding to the first text score, the second text score, and a second weight corresponding to the second text score;

and determining n candidate associated texts with highest text relevance scores as the cover text, wherein n is a positive integer.

Optionally, the second determining unit is further configured to:

And inputting the original cover image and the candidate associated text into a first text evaluation model to obtain the first text score, wherein the first text evaluation model is trained based on positive and negative sample pairs, the positive sample pair consists of a sample video frame and a positive sample text, the playing time of the sample video frame is consistent with that of the positive sample text, the negative sample pair consists of the sample video frame and a negative sample text, and the playing time of the sample video frame is inconsistent with that of the negative sample text.

Optionally, the second determining unit is further configured to:

inputting the original cover image into a third feature extraction network in the first text evaluation model to obtain a cover feature vector corresponding to the original cover image;

inputting the candidate associated text into a text feature extraction network in the first text evaluation model to obtain a text feature vector corresponding to the candidate associated text;

and inputting the cover characteristic vector and the text characteristic vector into a full connection layer in the first text evaluation model to obtain the first text score.

Optionally, the second determining unit is further configured to:

And inputting the target object label and the candidate associated text into a second text evaluation model to obtain the second text score, wherein the second text evaluation model is trained based on positive and negative sample pairs, the positive sample pairs are composed of sample labels corresponding to sample objects and positive sample texts, the positive sample texts are associated texts for receiving positive feedback interaction operations of the sample objects, the negative samples are composed of the sample labels and negative sample texts, and the negative sample texts are associated texts for receiving negative feedback interaction operations of the sample objects.

Optionally, the second determining unit is further configured to:

inputting the target object label into a first text feature extraction network in the second text evaluation model to obtain a label feature vector corresponding to the target object label;

inputting the candidate associated text into a second text feature extraction network in the second text evaluation model to obtain a text feature vector corresponding to the candidate associated text;

and inputting the tag feature vector and the text feature vector into a full connection layer in the second text evaluation model to obtain the second text score.

Optionally, the image processing module 1203 includes:

a fifth determining unit configured to determine a horizontal position of each of the cover texts in the original cover image based on a text generation time;

a sixth determining unit, configured to determine, based on the text relevance scores, vertical positions of the respective cover texts in the original cover image, where a vertical distance between the cover text with a high text relevance score and a horizontal center line is smaller than a vertical distance between the cover text with a low text relevance score and the horizontal center line;

and the image processing unit is used for drawing the cover text on the original cover image according to the horizontal position and the vertical position to obtain the target video cover.

Optionally, the apparatus further includes:

and a cover update model for updating a video cover based on the original cover image and the associated text in response to the video meeting a cover update condition, the cover update condition including an increment of associated text reaching a text increment threshold and/or an increment of the target object label reaching at least one of label increment thresholds.

Optionally, the apparatus further includes:

the character recognition module is used for carrying out optical character recognition on the original cover image and determining cover characters in the original cover image;

the first determining module 1202 includes:

and a seventh determining module, configured to determine the cover text based on a correlation between the original cover image and each of the associated texts, in response to the number of words of the cover text being less than a word number threshold, and/or the ratio of the display area of the cover text to the area of the original cover image being less than a ratio threshold.

Referring to fig. 13, a schematic structural diagram of a computer device provided in an embodiment of the present application is shown, where the computer device may be a terminal with a video class application program installed, or a background server of the video class application program. Specifically, the present invention relates to a method for manufacturing a semiconductor device.

The computer apparatus 1300 includes a central processing unit (Central Processing Unit, CPU) 1301, a system Memory 1304 including a random access Memory (Random Access Memory, RAM) 1302 and a Read Only Memory (ROM) 1303, and a system bus 1305 connecting the system Memory 1304 and the central processing unit 1301. The computer device 1300 may also include a basic Input/Output (I/O) controller 1306 to facilitate the transfer of information between various devices within the computer, and a mass storage device 1307 for storing an operating system 1313, application programs 1314, and other program modules 1315.

In some embodiments, the basic input/output system 1306 includes a display 1308 for displaying information, and an input device 1309, such as a mouse, keyboard, or the like, for a user to input information. Wherein the display 1308 and the input device 1309 are connected to the central processing unit 1301 through an input output controller 1310 connected to the system bus 1305. The basic input/output system 1306 may also include an input/output controller 1310 for receiving and processing input from a keyboard, mouse, or electronic stylus, among a plurality of other devices. Similarly, the input/output controller 1310 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1307 is connected to the central processing unit 1301 through a mass storage controller (not shown) connected to the system bus 1305. The mass storage device 1307 and its associated computer-readable media provide non-volatile storage for the computer device 1300. That is, the mass storage device 1307 may include a computer-readable medium (not shown) such as a hard disk or a compact disk-Only (CD-ROM) drive.

The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-only memory (Erasable Programmable Read Only Memory, EPROM), flash memory, or other solid state memory technology, CD-ROM, digital video disk (Digital Video Disc, DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1304 and mass storage device 1307 described above may be referred to collectively as memory.

According to various embodiments of the present application, the computer device 1300 may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the computer device 1300 may be connected to the network 1312 via a network interface unit 1311 coupled to the system bus 1305, or alternatively, the network interface unit 1311 may be used to connect to other types of networks or remote computer systems (not shown).

The memory also includes at least one instruction, at least one program, code set, or instruction set stored in the memory and configured to be executed by one or more processors to implement the method of generating a video cover described above.

Embodiments of the present application also provide a computer readable storage medium storing at least one instruction that is loaded and executed by a processor to implement the method for generating a video cover according to the above embodiments.

Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable storage medium. Computer-readable storage media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

It will be appreciated that in the specific embodiments of the present application, personal data of the user, i.e. user tags, feedback interaction operations, etc., are involved, and when the above embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of relevant data need to comply with relevant laws and regulations and standards of relevant countries and regions.

The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims

1. A method for generating a video cover, the method comprising:

2. The method of claim 1, wherein the determining cover text based on a correlation between the original cover image and each of the associated text comprises:

determining a target video clip from the video based on the original cover image, wherein the correlation between a video picture corresponding to the target video clip and the original cover image is higher than the correlation between video pictures corresponding to other video clips and the original cover image;

and determining the cover text from the associated text corresponding to the target video clip based on the correlation between the associated text and the original cover image.

3. The method of claim 2, wherein the determining a target video clip from the video based on the original cover image comprises:

4. The method of claim 3, wherein inputting the video frames corresponding to the original cover image and the candidate video clips into an image evaluation model to obtain the image relevance score corresponding to the candidate video clips comprises:

5. The method of claim 2, wherein the determining the cover text from the associated text corresponding to the target video segment based on the correlation of the associated text with the original cover image comprises:

6. The method of claim 1, wherein the determining cover text based on a correlation between the original cover image and each of the associated text comprises:

and determining the cover text from the associated text corresponding to the target video segment based on the correlation between the associated text and a target object label, wherein the target object label is used for indicating the orientation of the target object to the video type.

7. The method of claim 2, wherein the determining the cover text from the associated text corresponding to the target video segment based on the correlation of the associated text with the original cover image comprises:

8. The method of claim 7, wherein the determining a first text score for the candidate associated text based on a relevance of the candidate associated text to the original cover image comprises:

9. The method of claim 8, wherein the entering the original cover image and the candidate associated text into a first text scoring model results in the first text score, comprising:

10. The method of claim 7, wherein the determining a second text score for the candidate associated text based on the relevance of the candidate associated text to the target object tag comprises:

11. The method of claim 10, wherein said entering the target object tag and the candidate associated text into a second text scoring model results in the second text score, comprising:

12. The method according to any one of claims 7 to 11, wherein the fusing the original cover image and the cover text to generate the target video cover includes:

determining the horizontal position of each piece of cover text in the original cover image based on text generation time;

determining the vertical position of each cover text in the original cover image based on the text relevance scores, wherein the vertical distance between the cover text with high text relevance scores and a horizontal center line is smaller than the vertical distance between the cover text with low text relevance scores and the horizontal center line;

And drawing the cover text on the original cover image according to the horizontal position and the vertical position to obtain the target video cover.

13. The method according to any one of claims 7 to 11, wherein after the fusing of the original cover image and the cover text to generate the target video cover, the method further comprises:

and updating the video cover based on the original cover image and the associated text in response to the video meeting a cover update condition, the cover update condition including at least one of an increment of associated text reaching a text increment threshold, and an increment of the target object tag reaching a tag increment threshold.

14. The method of any of claims 1 to 11, wherein prior to determining cover text based on a correlation between the original cover image and each of the associated text, the method further comprises:

performing optical character recognition on the original cover image to determine cover characters in the original cover image;

the determining the cover text based on the correlation between the original cover image and each piece of associated text comprises the following steps:

And determining the cover text based on the correlation between the original cover image and each associated text in response to the number of words of the cover text being less than a word number threshold and/or the proportion of the display area of the cover text to the area of the original cover image being less than a proportion threshold.

15. A video cover generation apparatus, the apparatus comprising:

16. A computer device, the computer device comprising a processor and a memory; the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the method of generating a video cover as recited in any one of claims 1 to 14.

17. A computer readable storage medium having stored therein at least one computer program loaded and executed by a processor to implement the method of generating a video cover as claimed in any one of claims 1 to 14.

18. A computer program product or computer program, characterized in that the computer program product or computer program comprises computer instructions stored in a computer readable storage medium; a processor of a computer device reads the computer instructions from the computer-readable storage medium, the processor executing the computer instructions to cause the terminal to perform the method of generating a video cover as claimed in any one of claims 1 to 14.