CN112800276B

CN112800276B - Video cover determining method, device, medium and equipment

Info

Publication number: CN112800276B
Application number: CN202110075978.4A
Authority: CN
Inventors: 张帆; 刘畅; 李亚; 周杰; 余俊; 徐佳燕; 王长虎
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2023-06-20
Anticipated expiration: 2041-01-20
Also published as: CN112800276A

Abstract

The disclosure relates to a method, a device, a medium and equipment for determining a video cover, wherein the method comprises the following steps: acquiring a plurality of image frames in a target video; for each image frame, determining salient object feature information of the image frame, and determining predicted click rate information of a user on the target video if the image frame is used as a cover image of the target video according to the salient object feature information; and determining a target cover image of the target video from the plurality of image frames according to the predicted click rate information. Through the technical scheme, the image frame with the significant object is more easily selected as the cover image of the target video, so that the determined target cover image is more in line with the browsing interest of the user, the accuracy of selecting the cover image is improved, and the click rate and the user attention of the target video can be improved after the target cover image is used as the cover to release the target video.

Description

Video cover determining method, device, medium and equipment

Technical Field

The disclosure relates to the technical field of internet, and in particular relates to a method, a device, a medium and equipment for determining a video cover.

Background

The video cover is the first information of the video seen by the user, is the first impression of the video by the user, and can generally directly determine whether the user clicks on the video to watch, so it is important to select an appropriate image frame as the video cover.

In the related art, the cover image is generally selected according to factors such as color, texture, definition, composition integrity of the image frame in the video, however, the cover image selected in this way may not improve the attention of the user to the video. Or, a proper image frame is manually selected as a cover image of the video, but the mode wastes a great deal of manpower and has low efficiency, so that the release of the video is affected.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a video cover determination method, the method comprising: acquiring a plurality of image frames in a target video; for each image frame, determining salient object feature information of the image frame, and determining predicted click rate information of a user on the target video if the image frame is used as a cover image of the target video according to the salient object feature information; and determining a target cover image of the target video from the plurality of image frames according to the predicted click rate information.

In a second aspect, the present disclosure provides a video cover determination apparatus, the apparatus comprising: the first acquisition module is used for acquiring a plurality of image frames in the target video; the first determining module is used for determining the salient object feature information of each image frame, and determining the predicted click rate information of a user on the target video if the image frame is used as the cover image of the target video according to the salient object feature information; and the second determining module is used for determining a target cover image of the target video from the plurality of image frames according to the predicted click rate information.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which when executed by a processing device implements the steps of the method provided by the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising: a storage device having a computer program stored thereon; processing means for executing said computer program in said storage means to carry out the steps of the method provided by the first aspect of the present disclosure.

According to the technical scheme, according to the salient object feature information of the image frame, if the image frame is used as the cover image of the target video, the predicted click rate information of the user on the target video is determined. The user generally has higher attention to the salient objects in the video, so that the predicted click rate information corresponding to the image frames is determined according to the salient object feature information of the image frames, the predicted click rate information corresponding to the image frames with the salient objects can be relatively higher, the image frames with the salient objects can be more easily selected as the cover images of the target video when the cover selection is carried out, the determined target cover images are more in line with the browsing interests of the user, the accuracy of the cover image selection is improved, and the click rate and the user attention of the target video can be improved after the target cover images are used as the cover to release the target video.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flowchart illustrating a method of video cover determination, according to an exemplary embodiment.

FIG. 2 is a schematic diagram illustrating a video cover determination model according to an example embodiment.

FIG. 3 is a flowchart illustrating a method of determining salient object feature information for an image frame, according to one illustrative embodiment.

Fig. 4 is a schematic diagram illustrating a process for processing an image frame according to an exemplary embodiment.

FIG. 5 is a flowchart illustrating a method of determining predicted click rate information based on salient object feature information for an image frame, according to one illustrative embodiment.

FIG. 6 is a flowchart illustrating a training method for a video cover determination model, according to an exemplary embodiment.

FIG. 7 is a schematic diagram illustrating a model training process, according to an example embodiment.

Fig. 8 is a flowchart illustrating a method of determining a loss function value corresponding to a set of training data, according to an example embodiment.

FIG. 9 is a block diagram illustrating a video cover determination device according to an exemplary embodiment.

Fig. 10 is a schematic diagram showing a structure of an electronic device according to an exemplary embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Fig. 1 is a flowchart illustrating a video cover determination method according to an exemplary embodiment, which may be applied to an electronic device having processing capability, such as a terminal or a server, as shown in fig. 1, and which may include S101 to S103.

In S101, a plurality of image frames in a target video are acquired.

The target video refers to a video which needs to determine the cover image, for example, the video which is shot by a user in real time, the video which is stored in advance, or the video which is released to the network and needs to be replaced. The plurality of image frames may be any plurality of image frames in the target video, and the number of image frames acquired from the target video is not particularly limited in the present disclosure.

In S102, for each image frame, salient object feature information of the image frame is determined, and according to the salient object feature information, predicted click rate information of the user on the target video if the image frame is taken as a cover image of the target video is determined.

The salient objects can be characters, articles and the like, taking short videos as an example, the short videos usually have main shooting subjects when shooting, more salient main objects usually exist in the videos, for example, the characters are taken as the shooting subjects, and the characters in the videos can be taken as the salient objects. The user generally has higher attention to the salient objects in the video, so that the predicted click rate information corresponding to the image frames is determined according to the salient object feature information of the image frames, and the predicted click rate information corresponding to the image frames in which the salient objects appear can be relatively higher.

The predicted click rate information refers to the predicted click rate information of a user on the target video if the image frame is taken as a cover image of the target video before the video is released. The higher the predicted click rate information, the more attractive the user's attention that the image frame is taken as a cover image, the higher the actual click rate of the user on the target video may be. In an alternative embodiment, the predicted click rate information may be represented in the form of a numerical value, such as a numerical value between (0, 1).

In S103, a target cover image of the target video is determined from the plurality of image frames according to the predicted click rate information.

In one embodiment, for example, an image frame with highest predicted click rate information may be used as the target cover image of the target video. In another embodiment, a plurality of image frames with highest predicted click rate information may be provided to the user, and the image frame selected by the user may be taken as the target cover image of the target video according to a selection operation by the user.

In an optional embodiment, the predicted click rate information may be obtained by processing an image frame through a video cover determination model, and an electronic device executing the video cover determination method provided by the disclosure may be configured with a pre-trained video cover determination model, the electronic device may input the image frame into the video cover determination model, and the video cover determination model may determine the predicted click rate information by determining salient object feature information of the image frame and according to the salient object feature information. Fig. 2 is a schematic diagram of a video cover determining model according to an exemplary embodiment, and it should be noted that the video cover determining model shown in fig. 2 is merely exemplary, and is not limited to the embodiment of the present disclosure, and the form and structure of the model are not limited thereto in practical application.

Alternatively, an exemplary embodiment of determining salient object feature information of an image frame in S102 may include S301 and S302 as shown in fig. 3.

In S301, image overall characteristic information of an image frame is extracted.

Referring to fig. 2, image global feature information of an image frame may be extracted by an image feature extraction module in a model, which may be, for example, a convolution module. The feature information of the preset number of pixel points, corresponding to the designated position in the image frame, of each pixel in the image overall feature information comprises, for example, color feature information, brightness feature information, edge feature information and the like of the pixel points.

The resolution of an image frame is, for example, m×n, M representing the number of pixels in the length direction of the image frame, and N representing the number of pixels in the width direction of the image frame. The image global feature information may be represented in the form of a matrix, for example, a matrix of h×w×c, where H is the number of columns of the matrix of the image global feature information, W is the number of rows of the matrix of the image global feature information, and C is the dimension of the feature vector.

Fig. 4 is a schematic diagram illustrating processing of an image frame according to an exemplary embodiment, and it should be noted that fig. 4 is merely exemplary, so that those skilled in the art may better understand the processing procedure, and should not be construed as limiting the embodiments of the present disclosure. Fig. 4 illustrates an example of performing convolution processing on an image frame and having a convolution kernel size of 3×3, where the above-mentioned preset number relates to the convolution kernel size and the network depth, and a convolution calculation process may refer to a related art in the field. As shown in fig. 4, the element Y in the whole characteristic information of the image ₁₁ For example, the element Y ₁₁ Corresponding to X in the image frame as shown in FIG. 4 ₁₁ 、X ₁₂ 、X ₁₃ 、X ₂₁ 、X ₂₂ 、X ₂₃ 、X ₃₁ 、X ₃₂ 、X ₃₃ And comprehensive characteristic information of 9 pixels. Other elements are not described in detail.

In S302, salient object feature information of the image frame is determined from the image overall feature information.

Referring to fig. 2, salient object detection may be performed by a salient object detection module in the model based on the image global feature information. Each element in the salient object feature information is used for representing the credibility that the position of a preset number of pixel points at a designated position in an image frame is a salient object, and the larger the element value is, the larger the probability that the position of the preset number of pixel points is the salient object can be represented.

The salient object feature information may be represented in the form of a matrix, and the number of rows and columns of the matrix may be the same as the number of rows and columns of the matrix of the image overall feature information, and may be a matrix of h×w size. As shown in fig. 5, element Z in the salient object feature information ₁₁ Can be used for representing X in image frames ₁₁ 、X ₁₂ 、X ₁₃ 、X ₂₁ 、X ₂₂ 、X ₂₃ 、X ₃₁ 、X ₃₂ 、X ₃₃ A total of 9 pixels are the likelihood of a salient object.

In this way, the salient object feature information of the image frame is determined according to the integral feature information of the image frame, so that the salient object feature information of the image frame, namely the credibility of the salient object at the position of the pixel point in the image frame, can be accurately determined.

Fig. 5 is a flowchart illustrating a method for determining predicted click rate information according to salient object feature information of an image frame according to an exemplary embodiment, and as shown in fig. 5, determining predicted click rate information of a target video for a user if the image frame is a cover image of the target video according to salient object feature information in S102 may include S501 to S503.

In S501, image salient feature information of an image frame is determined from image global feature information and salient object feature information of the image frame.

For example, the method of extracting the image global feature information of the image frame may be as in the embodiment of S301. Referring to fig. 2, the salient object feature enhancement module in the model may determine the image salient feature information based on the image global feature information and the salient object feature information of the image frame. The image salient feature information may be obtained from a product of an element in the image global feature information and a corresponding element in the salient object feature information. The image saliency characteristic information can also be represented in a matrix form, wherein corresponding elements can refer to elements with the same coordinate positions in the matrix. As shown in fig. 4, element M in the image saliency feature information ₁₁ May be Y ₁₁ And Z ₁₁ Product of M ₁₂ May be Y ₁₂ And Z ₁₂ Is a product of (a) and (b).

In this way, in the obtained image salient feature information of the image frame, the features of the areas belonging to the salient object are reinforced, and the features of the areas not belonging to the salient object are weakened, so that the image salient feature information can more represent the salient object in the image frame.

In S502, feature fusion is performed according to the overall feature information and the salient feature information of the image, so as to obtain fusion feature information.

Referring to fig. 2, a feature fusion operation may be performed in a fusion module of a model. As shown in fig. 2, the image salient feature information output by the salient object feature enhancement module may be processed by a global average pooling (GAP, global Average Pooling) module and a Full Connected (FC) module to obtain processed image salient feature information F _s . The image integral characteristic information is processed by a global average pooling module and a full connection module to obtain the processed image integral characteristic information F _b . The fusion module can obtain F according to the image saliency characteristic information _s And F obtained according to the integral characteristic information of the image _b And fusing to obtain fusion characteristic information.

In an alternative embodiment, information F may be provided _s Sum information F _b As the fusion characteristic information.

Preferably, in another embodiment of the present disclosure, the step S502 may include: determining a first weight of the image overall characteristic information and a second weight of the image salient characteristic information according to the image overall characteristic information and the image salient characteristic information; and obtaining fusion characteristic information according to the first weight, the second weight, the integral characteristic information of the image and the salient characteristic information of the image.

Illustratively, the first weight and the second weight may be determined by the following formula (1), and the fused feature information may be obtained by the following formula (2):

F _a ＝λ·F _s +(1-λ)·F _b (2)

wherein λ represents a first weight, (1- λ) represents a second weight, F _a Representing fusion characteristic information, sigmoid being a known function, Q and K representing linear transformation operations, d being a constant greater than 0, e.g. the value of dThe value of the feature vector dimension C may be as described above.

In this way, considering that the importance of the salient object and the image background in the image is often different for different image frames, compared with feature fusion according to a preset weight value, the method and the device can adaptively determine the weights of the salient object and the image background according to the image salient feature information and the image integral feature information in the disclosure, so that the obtained fusion feature information is more consistent with the features of the image, and the accuracy is higher.

In S503, predicted click rate information is determined from the fusion feature information.

As shown in fig. 2, the information obtained after the fusion feature information is processed by the fully-connected module may be used as predicted click rate information corresponding to the image frame.

According to the technical scheme, the fact that more obvious main objects such as human faces exist in the video is considered, so that obvious object detection is conducted on the image frames in the target video, the image obvious characteristic information of the image frames is determined according to the obvious object characteristic information of the image frames, and the image obvious characteristic information of the image frames can be used for representing the main objects in the image frames. In addition, the attention of the user to the salient objects in the image frames is higher, a salient object detection module is added in the model, and when the model is confirmed to carry out cover selection through the trained video cover, the image frames with the salient objects are more easily selected as the cover images of the target video, so that the confirmed cover images more accord with the browsing interests of the user.

The video cover determining model can be obtained by training historical actual click rate information of the historical video according to a user. The historical video may include video published over a historical period (e.g., the past week or month). For example, the historical video may include video that has been presented a number of times or exposed a number of times exceeding a certain number of thresholds, and the user's historical actual click rate information for the historical video refers to the ratio of the number of times the historical video was clicked to the number of times the historical video was presented.

The historical cover image of the historical video, namely the image of the historical video used as the cover when the historical video is displayed, whether the user clicks the video to watch, has a direct relation with the cover image of the video, and can represent the historical cover image of the historical video to a certain extent to attract the attention of the user if the user clicks the historical video to watch, so that the actual click rate information of the user on the historical video can represent the attraction degree of the historical cover image on the user. According to the method and the device, a video cover determining model is trained according to the historical actual click rate information of the user on the historical video.

Fig. 6 is a flowchart illustrating a training method of a video cover determination model according to an exemplary embodiment, which may be applied to an electronic device having processing capability, such as a terminal or a server, and the electronic device performing the model training method may be the same as or different from the electronic device performing the video cover determination method. As shown in fig. 6, the method may include S601 to S607.

In S601, at least one set of training data is acquired from a training set. Each set of training data may include a first historical cover image of a first historical video and first historical actual click rate information of a user on the first historical video, and a second historical cover image of a second historical video and second historical actual click rate information of a user on the second historical video.

The historical cover images and the historical actual click rate information have been described above. The first historical video and the second historical video are two different historical videos. The historical cover images of the historical videos and the historical actual click rate information of the user on the historical videos can be used as training data and stored in a training set in advance, at least one group of training data can be randomly selected from the training set, and each group of training data can comprise the historical cover images and the historical actual click rate information of two different historical videos. The number of sets of training data obtained from the training set is not particularly limited in the present disclosure, and may be one or more sets.

In S602, for each set of training data, a first history cover image and a second history cover image included in the set of training data are respectively used as inputs of a model, and first target feature information of the first history cover image output after the model processes the first history cover image and second target feature information of the second history cover image output after the model processes the second history cover image are respectively acquired.

The first history cover image and the second history cover image may be input to the model in various ways, for example, the first history cover image and the second history cover image may be input to the model in sequence according to a preset sequence, or the first history cover image and the second history cover image may be input to two identical models at the same time.

FIG. 7 is a schematic diagram illustrating a model training process, according to an example embodiment. The two models shown in fig. 7 are identical, are twin network models, and only the processed historical cover images are different, and training the models by two different historical cover images at the same time does not mean training the two models.

The first target characteristic information output after the model processes the first historical cover image can represent click rate information of the first historical video predicted by the model, and the second target characteristic information output after the model processes the second historical cover image can represent click rate information of the second historical video predicted by the model. The target feature information output by the model may be represented in the form of a numerical value, for example, a numerical value between (0, 1).

In S603, a loss function value corresponding to the set of training data is determined according to the first target feature information, the second target feature information, the first historical actual click rate information, the second historical actual click rate information, and a preset loss function.

The loss function can be preset, and the corresponding loss function value can represent the difference information between the predicted click rate information and the actual click rate information of the model.

In S604, a target loss function value is determined according to the loss function value corresponding to each of the at least one set of training data, and parameters of the model are updated according to the target loss function value.

If a set of training data is acquired in S601, the objective loss function value is a loss function value corresponding to the set of training data. The loss function value obtained according to one set of training data can only reflect the prediction error of the model for the set of training data, so that the reliability is relatively low, and preferably, multiple sets of training data can be obtained in S601, the target loss function value can be a mean value or a weighted value of the loss function values corresponding to the multiple sets of training data, in this embodiment, the target loss function value can reflect the overall prediction result of the model for the multiple sets of training data, and the model can be trained more accurately according to the target loss function value. For example, the parameters of the model may be updated according to the objective loss function value by using a gradient descent method, as shown in fig. 7, for example, parameters of modules such as an image feature extraction module, a full connection module, a salient object detection module, and a fusion module in the model may be updated.

In S605, it is determined whether the number of training data in the training set to be used reaches a preset threshold. If not, S601 to S604 are executed again; in the case of yes, S606 is performed.

The training of the model can be performed in multiple rounds, each round of training comprises a training phase and a verification phase, the training phase updates parameters of the model, and the verification phase verifies performance and prediction accuracy of the model. In the case that the number of training data in the training set is not used to reach the preset threshold, the training phase of one round of training may be considered to be not completed, and S601 to S604 may be re-executed to continue training the model. The preset threshold may be pre-calibrated, and the number of sets of training data acquired from the training data set may be the same or different for each time.

In S606, it is determined whether the model is trained to be complete by the validation set. In the case of yes, S607 is performed; if not, S601 to S606 are executed again.

Under the condition that the number of the training data in the training set reaches a preset threshold, the training phase of one round of training can be considered to be completed, and the verification phase of the model is performed.

There are a number of ways to determine whether the model is trained through the validation set. For example, the verification set may include multiple sets of verification data, the verification data may include a historical cover image of the historical video and historical actual click rate information of the user on the historical video, and the training set and the verification set may have no intersection. And obtaining the average value or the weighted value of the loss function values corresponding to each group of verification data according to the current trained model and the loss function, and taking the average value or the weighted value as the comprehensive loss function value. If the decreasing amplitude of the comprehensive loss function value obtained through the verification set is smaller than a preset value in the continuous multi-round training, model convergence can be represented, and model training is determined to be completed. For another example, the verification set may include a historical video with a higher actual click rate, a cover image of the historical video is selected through a currently trained model, the accuracy of the cover image selection by the model is used as a judgment standard, if the cover image selected by the model is more accurate, the model is close to the actual historical cover image of the historical video with a higher click rate, the prediction result of the model is considered to be accurate, the model performance is better, and the model training can be determined to be completed.

In S607, a video cover determination model is obtained in response to the model training being completed.

If the model training is completed, a video cover determination model can be obtained, and if the model is not completed, S601-S606 can be re-executed to perform the next training until the model training is completed.

According to the technical scheme, the video cover determination model is obtained by training the historical actual click rate information of the historical video according to the user, and the click rate information of the user on the video can be predicted by the video cover determination model if the image frame is used as the cover of the video. According to the predicted click rate information output by the video cover determination model, an accurate basis can be provided for determining the cover image of the video, so that the determined cover image can attract more attention of a user, the accuracy of cover image selection is improved, and the click rate and the user attention of the video are improved.

The process of processing the history cover image by the model is described below, where the history cover image may be the first history cover image or the second history cover image, the target feature information output by the model is the first target feature information when the history cover image is the first history cover image, and the target feature information output by the model is the second target feature information when the history cover image is the second history cover image.

As shown in fig. 7, the image feature extraction module in the model may extract the image overall feature information of the history cover image, the salient object detection module determines the salient object feature information of the history cover image according to the image overall feature information of the history cover image, and the salient object feature enhancement module determines the image salient feature information of the history cover image according to the image overall feature information and the salient object feature information of the history cover image. And then, carrying out feature fusion by a fusion module according to the overall feature information and the salient feature information of the image of the history cover image to obtain fusion feature information, and determining the target feature information of the history cover image according to the fusion feature information. The way in which the untrained model processes the historical cover image, such as the way in which features are extracted and the way in which features are fused, may be similar to the way in which the trained video cover determination model processes the image frames described above.

After the first target feature information and the second target feature information output by the model are obtained, a corresponding loss function value of the set of training data may be determined, and an exemplary manner of determining the corresponding loss function value is described below.

Fig. 8 is a flowchart illustrating a method for determining a loss function value corresponding to a set of training data according to an exemplary embodiment, and S603 may include S801 to S803 as shown in fig. 8.

In S801, model prediction difference information between the first target feature information and the second target feature information is determined.

For example, the model output may be a number between (0, 1), and the model prediction difference information may be obtained by the following formula (3):

D _p ＝y _i -y _j (3)

wherein D is _p Representing model predictive difference information, y _i Representing first object characteristic information, y _j Representing second target characteristic information.

In S802, normalization processing is performed according to the first historical actual click rate information and the second historical actual click rate information, so as to determine target click rate difference information.

The value of the historical actual click rate is usually very small, for example, 0.02, if the model is directly trained by using the historical actual click rate information, the model is easy to train and difficult to converge, and the prediction effect is not accurate enough. In order to ensure normal convergence of the model and accuracy of a prediction result of the trained model, in the present disclosure, when determining click rate difference information, normalization processing may be performed first according to first historical actual click rate information and second historical actual click rate information, and a value after normalization is positively correlated with the actual click rate information and is greater than the actual click rate, and then target click rate difference information is determined.

Illustratively, this step S802 may include: and determining target click rate difference information according to the ratio information of the larger value to the smaller value in the first historical actual click rate information and the second historical actual click rate information and a preset function.

Under the condition that the first historical actual click rate information is larger than the second historical actual click rate information, determining target click rate difference information according to the ratio information between the first historical actual click rate information and the second historical actual click rate information and a preset function; and under the condition that the first historical actual click rate information is smaller than or equal to the second historical actual click rate information, determining target click rate difference information according to the ratio information between the second historical actual click rate information and the first historical actual click rate information and a preset function. The preset function may employ, for example, a tanh function, and the target click rate difference information may be determined by the following formula (4):

wherein D is _g Representing target click rate difference information, c _i Representing first historical actual click rate information, c _j Representing second historical actual click rate information.

In S803, according to the model prediction difference information, the target click rate difference information, and the preset loss function, a loss function value corresponding to the set of training data is determined.

By way of example, equation (5) for the preset loss function is as follows:

where Loss represents a Loss function value, max represents a well-known function taking the maximum value, m _s And m _d Is a preset threshold value and 0 < m _s ＜m _d 。

According to the technical scheme, the numerical value of the historical actual click rate is usually very small, so that model training is difficult to achieve easily, model convergence is difficult to achieve, and in order to ensure normal model convergence and accuracy of a prediction result of a trained model, when click rate difference information is determined, normalization processing is performed according to the first historical actual click rate information and the second historical actual click rate information, and target click rate difference information is determined. And then determining a corresponding loss function value according to the model prediction difference information, the target click rate difference information and a preset loss function.

Based on the same inventive concept, the present disclosure also provides a video cover determining apparatus, and fig. 9 is a block diagram of a video cover determining apparatus according to an exemplary embodiment, and as shown in fig. 9, the apparatus 900 may include:

a first acquiring module 901, configured to acquire a plurality of image frames in a target video;

a first determining module 902, configured to determine, for each of the image frames, salient object feature information of the image frame, and determine, according to the salient object feature information, predicted click rate information of a user on the target video if the image frame is used as a cover image of the target video;

A second determining module 903, configured to determine a target cover image of the target video from the plurality of image frames according to the predicted click rate information.

Optionally, the first determining module 902 may include: the extraction sub-module is used for extracting the image integral characteristic information of the image frame, wherein each element in the image integral characteristic information corresponds to the characteristic information of a preset number of pixel points at a designated position in the image frame; the first determining submodule is used for determining the salient object feature information of the image frame according to the integral feature information of the image, wherein each element in the salient object feature information is used for representing the credibility of the salient object at the position of the preset number of pixel points of the appointed position in the image frame.

Optionally, the first determining module 902 may include: the second determining submodule is used for determining image salient feature information of the image frame according to the image integral feature information and the salient object feature information of the image frame; the fusion sub-module is used for carrying out feature fusion according to the overall feature information of the image and the salient feature information of the image to obtain fusion feature information; and the third determining submodule is used for determining the predicted click rate information according to the fusion characteristic information.

Optionally, the fusion sub-module may include: a fourth determining submodule, configured to determine a first weight of the image global feature information and a second weight of the image salient feature information according to the image global feature information and the image salient feature information; and a fifth determining submodule, configured to obtain the fusion feature information according to the first weight, the second weight, the image overall feature information and the image saliency feature information.

Optionally, the predicted click rate information is obtained by processing the image frame through a video cover determination model, the video cover determination model determines salient object feature information of the image frame and determines the predicted click rate information according to the salient object feature information, the video cover determination model is obtained by training through a training device of the video cover determination model, and the training device of the video cover determination model includes: the second acquisition module is used for acquiring at least one group of training data from the training set, wherein each group of training data comprises a first historical cover image of a first historical video, first historical actual click rate information of a user on the first historical video, a second historical cover image of a second historical video and second historical actual click rate information of the user on the second historical video; a third determining module, configured to, for each set of training data, respectively take, as input of a model, a first history cover image and a second history cover image included in the set of training data, and respectively obtain first target feature information of the first history cover image output after the model processes the first history cover image, and second target feature information of the second history cover image output after the process of the second history cover image; determining a loss function value corresponding to the training data according to the first target feature information, the second target feature information, the first historical actual click rate information, the second historical actual click rate information and a preset loss function; the parameter updating module is used for determining a target loss function value according to the loss function value corresponding to each of the at least one group of training data and updating the parameters of the model according to the target loss function value; the triggering module is used for triggering the second acquisition module to acquire at least one group of training data from the training set and the parameter updating module to determine a target loss function value according to the loss function value corresponding to each group of training data under the condition that the number of the training data in the training set is not up to a preset threshold, and updating the parameters of the model according to the target loss function value; a fourth determining module, configured to determine, by using a verification set, whether the model is trained, if the number of training data in the training set is up to the preset threshold; and the model acquisition module is used for responding to the completion of model training to obtain the video cover determination model.

Optionally, the third determining module includes: a prediction difference determination sub-module configured to determine model prediction difference information between the first target feature information and the second target feature information; the target difference value determining sub-module is used for carrying out normalization processing according to the first historical actual click rate information and the second historical actual click rate information to determine target click rate difference value information; and the loss function value determining submodule is used for determining the loss function value corresponding to the training data according to the model prediction difference information, the target point impact rate difference information and the preset loss function.

Optionally, the target difference determining submodule is configured to: and determining the target point click rate difference information according to the ratio information of the larger value to the smaller value in the first historical actual click rate information and the second historical actual click rate information and a preset function.

Referring now to fig. 10, a schematic diagram of an electronic device 1000 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 10 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 10, the electronic device 1000 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 1001 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage means 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 are also stored. The processing device 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

In general, the following devices may be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 1007 including, for example, a Liquid Crystal Display (LCD), speaker, vibrator, etc.; storage 1008 including, for example, magnetic tape, hard disk, etc.; and communication means 1009. The communication means 1009 may allow the electronic device 1000 to communicate wirelessly or by wire with other devices to exchange data. While fig. 10 shows an electronic device 1000 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 1009, or installed from the storage device 1008, or installed from the ROM 1002. The above-described functions defined in the method of the embodiment of the present disclosure are performed when the computer program is executed by the processing device 1001.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a plurality of image frames in a target video; for each image frame, determining salient object feature information of the image frame, and determining predicted click rate information of a user on the target video if the image frame is used as a cover image of the target video according to the salient object feature information; and determining a target cover image of the target video from the plurality of image frames according to the predicted click rate information.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of the module is not limited to the module itself in some cases, and for example, the first acquisition module may be also described as an "image frame acquisition module".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In accordance with one or more embodiments of the present disclosure, example 1 provides a video cover determination method, the method comprising: acquiring a plurality of image frames in a target video; for each image frame, determining salient object feature information of the image frame, and determining predicted click rate information of a user on the target video if the image frame is used as a cover image of the target video according to the salient object feature information; and determining a target cover image of the target video from the plurality of image frames according to the predicted click rate information.

In accordance with one or more embodiments of the present disclosure, example 2 provides the method of example 1, the determining salient object feature information of the image frame comprising: extracting image integral characteristic information of the image frame, wherein each element in the image integral characteristic information corresponds to characteristic information of a preset number of pixel points at a designated position in the image frame; and determining the salient object feature information of the image frame according to the integral feature information of the image, wherein each element in the salient object feature information is used for representing the credibility of the salient object at the position of the preset number of pixel points at the appointed position of the image frame.

According to one or more embodiments of the present disclosure, example 3 provides the method of example 1, wherein the determining, based on the salient object feature information, predicted click rate information of the target video for the user if the image frame is taken as a cover image of the target video includes: determining image salient feature information of the image frame according to the image integral feature information and the salient object feature information of the image frame; feature fusion is carried out according to the overall feature information of the image and the salient feature information of the image, so that fusion feature information is obtained; and determining the predicted click rate information according to the fusion characteristic information.

According to one or more embodiments of the present disclosure, example 4 provides the method of example 3, wherein the feature fusion is performed according to the image global feature information and the image salient feature information, to obtain fused feature information, including: determining a first weight of the image overall characteristic information and a second weight of the image salient characteristic information according to the image overall characteristic information and the image salient characteristic information; and obtaining the fusion characteristic information according to the first weight, the second weight, the image integral characteristic information and the image saliency characteristic information.

According to one or more embodiments of the present disclosure, example 5 provides the method of example 1, the predicted click rate information being obtained by processing the image frame by a video cover determination model, the video cover determination model being obtained by determining salient object feature information of the image frame and determining the predicted click rate information from the salient object feature information, wherein the video cover determination model is trained by: acquiring at least one group of training data from a training set, wherein each group of training data comprises a first historical cover image of a first historical video and first historical actual click rate information of a user on the first historical video, and a second historical cover image of a second historical video and second historical actual click rate information of the user on the second historical video; for each group of training data, respectively taking the first history cover image and the second history cover image included in the training data as input of a model, and respectively acquiring first target feature information of the first history cover image output after the model processes the first history cover image and second target feature information of the second history cover image output after the model processes the second history cover image; determining a loss function value corresponding to the training data according to the first target feature information, the second target feature information, the first historical actual click rate information, the second historical actual click rate information and a preset loss function; determining a target loss function value according to the loss function value corresponding to each of the at least one group of training data, and updating parameters of the model according to the target loss function value; re-executing the step of acquiring at least one group of training data from the training set to the step of determining target loss function values according to the loss function values corresponding to the at least one group of training data respectively and updating parameters of the model according to the target loss function values under the condition that the number of the training data in the training set is not up to a preset threshold; determining whether the model is trained by a verification set under the condition that the number of the training data in the training set reaches the preset threshold; and responding to the model training completion, and obtaining the video cover determination model.

According to one or more embodiments of the present disclosure, example 6 provides the method of example 5, the determining the loss function value corresponding to the set of training data according to the first target feature information, the second target feature information, the first historical actual click rate information, the second historical actual click rate information, and a preset loss function, including: determining model prediction difference information between the first target feature information and the second target feature information; carrying out normalization processing according to the first historical actual click rate information and the second historical actual click rate information, and determining target click rate difference information; and determining a loss function value corresponding to the training data according to the model prediction difference information, the target point impact rate difference information and the preset loss function.

In accordance with one or more embodiments of the present disclosure, example 7 provides the method of example 6, the normalizing the first historical actual click rate information and the second historical actual click rate information to determine target click rate difference information, comprising: and determining the target point click rate difference information according to the ratio information of the larger value to the smaller value in the first historical actual click rate information and the second historical actual click rate information and a preset function.

In accordance with one or more embodiments of the present disclosure, example 8 provides a video cover determination apparatus, the apparatus comprising: the first acquisition module is used for acquiring a plurality of image frames in the target video; the first determining module is used for determining the salient object feature information of each image frame, and determining the predicted click rate information of a user on the target video if the image frame is used as the cover image of the target video according to the salient object feature information; and the second determining module is used for determining a target cover image of the target video from the plurality of image frames according to the predicted click rate information.

According to one or more embodiments of the present disclosure, example 9 provides a computer-readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the method of any of examples 1-7.

In accordance with one or more embodiments of the present disclosure, example 10 provides an electronic device, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to implement the steps of the method of any one of examples 1-7.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims

1. A method for determining a video cover, the method comprising:

acquiring a plurality of image frames in a target video;

determining salient object feature information of each image frame, and determining predicted click rate information of a user on the target video if the image frame is used as a cover image of the target video according to the salient object feature information, wherein each element in the salient object feature information is used for representing the credibility of a salient object at the position of a preset number of pixel points at the appointed position in the image frame;

determining a target cover image of the target video from the plurality of image frames according to the predicted click rate information;

And determining predicted click rate information of a user on the target video if the image frame is used as a cover image of the target video according to the salient object feature information, wherein the predicted click rate information comprises the following steps:

determining image salient feature information of the image frame according to the image integral feature information of the image frame and the salient object feature information, wherein the features of the region belonging to the salient object in the image salient feature information are enhanced and the features of the region not belonging to the salient object are weakened;

feature fusion is carried out according to the overall feature information of the image and the salient feature information of the image, so that fusion feature information is obtained;

and determining the predicted click rate information according to the fusion characteristic information.

2. The method of claim 1, wherein said determining salient object feature information for the image frame comprises:

extracting image integral characteristic information of the image frame, wherein each element in the image integral characteristic information corresponds to characteristic information of a preset number of pixel points at a designated position in the image frame;

and determining the salient object characteristic information of the image frame according to the integral characteristic information of the image.

3. The method according to claim 1, wherein the performing feature fusion according to the image global feature information and the image salient feature information to obtain fusion feature information includes:

determining a first weight of the image overall characteristic information and a second weight of the image salient characteristic information according to the image overall characteristic information and the image salient characteristic information;

and obtaining the fusion characteristic information according to the first weight, the second weight, the image integral characteristic information and the image saliency characteristic information.

4. The method of claim 1, wherein the predicted click rate information is obtained by processing the image frames through a video cover determination model that determines salient object feature information for the image frames and determines the predicted click rate information based on the salient object feature information,

the video cover determination model is trained by the following steps:

acquiring at least one group of training data from a training set, wherein each group of training data comprises a first historical cover image of a first historical video and first historical actual click rate information of a user on the first historical video, and a second historical cover image of a second historical video and second historical actual click rate information of the user on the second historical video;

For each group of training data, respectively taking the first history cover image and the second history cover image included in the training data as input of a model, and respectively acquiring first target feature information of the first history cover image output after the model processes the first history cover image and second target feature information of the second history cover image output after the model processes the second history cover image; determining a loss function value corresponding to the training data according to the first target feature information, the second target feature information, the first historical actual click rate information, the second historical actual click rate information and a preset loss function;

determining a target loss function value according to the loss function value corresponding to each of the at least one group of training data, and updating parameters of the model according to the target loss function value;

re-executing the step of acquiring at least one group of training data from the training set to the step of determining target loss function values according to the loss function values corresponding to the at least one group of training data respectively and updating parameters of the model according to the target loss function values under the condition that the number of the training data in the training set is not up to a preset threshold;

Determining whether the model is trained by a verification set under the condition that the number of the training data in the training set reaches the preset threshold;

and responding to the model training completion, and obtaining the video cover determination model.

5. The method of claim 4, wherein determining the loss function value corresponding to the set of training data according to the first target feature information, the second target feature information, the first historical actual click rate information, the second historical actual click rate information, and a predetermined loss function comprises:

determining model prediction difference information between the first target feature information and the second target feature information;

carrying out normalization processing according to the first historical actual click rate information and the second historical actual click rate information, and determining target click rate difference information;

and determining a loss function value corresponding to the training data according to the model prediction difference information, the target point impact rate difference information and the preset loss function.

6. The method of claim 5, wherein the normalizing the first historical actual click rate information and the second historical actual click rate information to determine target click rate difference information comprises:

And determining the target point click rate difference information according to the ratio information of the maximum value to the minimum value in the first historical actual click rate information and the second historical actual click rate information and a preset function.

7. A video cover determining apparatus, the apparatus comprising:

the first acquisition module is used for acquiring a plurality of image frames in the target video;

the first determining module is used for determining salient object feature information of each image frame, and determining predicted click rate information of a user on the target video if the image frame is used as a cover image of the target video according to the salient object feature information, wherein each element in the salient object feature information is used for representing the credibility of a salient object at the position of a preset number of pixel points at the appointed position in the image frame;

the second determining module is used for determining a target cover image of the target video from the plurality of image frames according to the predicted click rate information;

wherein the first determining module comprises:

a second determining sub-module, configured to determine image salient feature information of the image frame according to the image overall feature information of the image frame and the salient object feature information, where features of an area belonging to a salient object in the image salient feature information are enhanced, and features of an area not belonging to the salient object are weakened;

The fusion sub-module is used for carrying out feature fusion according to the overall feature information of the image and the salient feature information of the image to obtain fusion feature information;

and the third determining submodule is used for determining the predicted click rate information according to the fusion characteristic information.

8. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-6.

9. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method according to any one of claims 1-6.