CN112800276A

CN112800276A - Video cover determination method, device, medium and equipment

Info

Publication number: CN112800276A
Application number: CN202110075978.4A
Authority: CN
Inventors: 张帆; 刘畅; 李亚; 周杰; 余俊; 徐佳燕; 王长虎
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2021-05-14
Anticipated expiration: 2041-01-20
Also published as: CN112800276B

Abstract

The present disclosure relates to a method, apparatus, medium, and device for determining a video cover, the method including: acquiring a plurality of image frames in a target video; determining the characteristic information of a salient object of each image frame, and determining the predicted click rate information of a user on the target video if the image frame is taken as a cover image of the target video according to the characteristic information of the salient object; and determining an object cover image of the object video from the plurality of image frames according to the predicted click rate information. By the technical scheme, the image frames with the significant objects are more easily selected to serve as the cover images of the target videos, so that the determined target cover images are more in line with the browsing interests of users, the accuracy of cover image selection is improved, and the click rate of the target videos and the attention of the users can be improved after the target cover images serve as cover release target videos.

Description

Video cover determination method, device, medium and equipment

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a method, an apparatus, a medium, and a device for determining a video cover.

Background

The video cover is information of a video which is seen by a user firstly, and is the first impression of the user on the video, and the video cover can generally directly determine whether the user clicks the video for watching, so that it is particularly important to select a proper image frame as the video cover.

In the related art, a cover image is generally selected according to the factors of color, texture, definition, composition integrity and the like of an image frame in a video, however, the cover image selected in this way may not improve the attention degree of a user to the video. Or, a proper image frame is manually selected as a cover image of the video, but the method wastes a great deal of manpower and has low efficiency, thereby affecting the distribution of the video.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method for video cover determination, the method comprising: acquiring a plurality of image frames in a target video; determining the characteristic information of a salient object of each image frame, and determining the predicted click rate information of a user on the target video if the image frame is taken as a cover image of the target video according to the characteristic information of the salient object; determining a target cover image of the target video from the plurality of image frames according to the predicted click rate information.

In a second aspect, the present disclosure provides a video cover determination apparatus, the apparatus comprising: the first acquisition module is used for acquiring a plurality of image frames in a target video; the first determining module is used for determining the characteristic information of the salient objects of the image frames aiming at each image frame, and determining the predicted click rate information of the user on the target video if the image frames are used as cover images of the target video according to the characteristic information of the salient objects; and the second determining module is used for determining a target cover image of the target video from the plurality of image frames according to the predicted click rate information.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method provided by the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to implement the steps of the method provided by the first aspect of the present disclosure.

Through the technical scheme, according to the characteristic information of the salient object of the image frame, the predicted click rate information of the user on the target video is determined if the image frame is used as the cover image of the target video. The user generally has higher attention to the salient objects in the video, so that the predicted click rate information corresponding to the image frame is determined according to the characteristic information of the salient objects in the image frame, the predicted click rate information corresponding to the image frame in which the salient objects appear can be relatively higher, and therefore, when cover selection is performed, the image frame with the salient objects is more easily selected to serve as a cover image of a target video, the determined target cover image is more in line with the browsing interest of the user, the accuracy of cover image selection is improved, and after the target cover image is used as a cover to release the target video, the click rate of the target video and the attention of the user can be improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow diagram illustrating a method for video cover determination in accordance with an exemplary embodiment.

FIG. 2 is a diagram illustrating a video cover determination model according to an exemplary embodiment.

Fig. 3 is a flow diagram illustrating a method of determining salient object feature information for an image frame in accordance with an exemplary embodiment.

Fig. 4 is a schematic diagram illustrating a process for processing an image frame according to an exemplary embodiment.

FIG. 5 is a flowchart illustrating a method of determining predicted click rate information from salient object feature information of an image frame in accordance with an exemplary embodiment.

FIG. 6 is a flow diagram illustrating a method of training a video cover determination model in accordance with an exemplary embodiment.

FIG. 7 is a schematic diagram illustrating a model training process in accordance with an exemplary embodiment.

Fig. 8 is a flow chart illustrating a method of determining loss function values for a set of training data according to an example embodiment.

FIG. 9 is a block diagram illustrating a video cover determination device in accordance with one illustrative embodiment.

Fig. 10 is a schematic structural diagram of an electronic device according to an exemplary embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 1 is a flowchart illustrating a video cover determination method according to an exemplary embodiment, which may be applied to an electronic device having a processing capability, such as a terminal or a server, and may include S101 to S103, as shown in fig. 1.

In S101, a plurality of image frames in a target video are acquired.

The target video refers to a video needing to determine a cover image, and for example, the target video may be a video shot by a user in real time, may be a video stored in advance, or may be a video already distributed to a network and needing to replace the cover image. The plurality of image frames may be any plurality of image frames in the target video, and the number of image frames acquired from the target video is not particularly limited by the present disclosure.

In S102, for each image frame, the salient object feature information of the image frame is determined, and the predicted click rate information of the user on the target video is determined according to the salient object feature information if the image frame is used as a cover image of the target video.

The salient objects may be characters, articles, and the like, for example, a short video is taken as an example, the short video usually has a main shooting subject when being shot, a relatively salient main object usually exists in the video, for example, the characters are taken as the shooting subject, and the characters in the video can be taken as the salient objects. Generally, the attention of a user to a salient object in a video is higher, so that the predicted click rate information corresponding to an image frame is determined according to the characteristic information of the salient object in the image frame, and the predicted click rate information corresponding to the image frame in which the salient object appears can be relatively higher.

The predicted click rate information refers to the click rate information of the target video which is predicted by the user before the video is released and if the image frame is taken as a cover image of the target video. The higher the predicted click rate information is, the more attractive the image frame is as a cover image to attract the attention of the user, and the higher the actual click rate of the user on the target video may be. In an alternative embodiment, the predicted click rate information may be represented in the form of a numerical value, such as a value between (0, 1).

In S103, an object cover image of the object video is determined from the plurality of image frames according to the predicted click rate information.

In one embodiment, for example, the image frame with the highest predicted click rate information may be used as the target cover image of the target video. In another embodiment, a plurality of image frames with the highest predicted click rate information may be provided to the user, and the image frame selected by the user may be used as the target cover image of the target video according to the selection operation of the user.

In an optional embodiment, the predicted click rate information may be obtained by processing an image frame through a video cover determination model, an electronic device executing the video cover determination method provided by the present disclosure may be configured with a video cover determination model trained in advance, the electronic device may input the image frame into the video cover determination model, and the video cover determination model may determine the predicted click rate information according to the salient object feature information by determining salient object feature information of the image frame. Fig. 2 is a schematic diagram of a video cover determination model according to an exemplary embodiment, and it should be noted that the video cover determination model shown in fig. 2 is only an example and does not limit the embodiments of the disclosure, and in practical applications, the form and structure of the model are not limited thereto.

Alternatively, an exemplary embodiment of determining the salient object feature information of the image frame in S102 may be as shown in fig. 3, including S301 and S302.

In S301, image overall feature information of the image frame is extracted.

Referring to fig. 2, image global feature information of an image frame may be extracted by an image feature extraction module in a model, which may be, for example, a convolution module. Each element in the image overall feature information corresponds to feature information of a preset number of pixel points at a specified position in the image frame, and the feature information includes color feature information, brightness feature information, edge feature information and the like of the pixel points, for example.

Illustratively, the resolution of the image frame is, for example, M × N, where M denotes the number of pixel points in the image frame length direction and N denotes the number of pixel points in the image frame width direction. The image overall characteristic information may be represented in a matrix form, for example, a matrix of H × W × C, where H is the number of columns of the matrix of the image overall characteristic information, W is the number of rows of the matrix of the image overall characteristic information, and C is a dimension of the characteristic vector.

Fig. 4 is a schematic diagram illustrating an image frame processing according to an exemplary embodiment, and it should be noted that fig. 4 is only an example to facilitate better understanding of the processing procedure by those skilled in the art, and should not be construed as a limitation to the embodiments of the present disclosure. Fig. 4 illustrates an example of convolution processing performed on an image frame and a convolution kernel size of 3 × 3, where the preset number is related to the convolution kernel size and a network depth, and the process of convolution calculation may refer to related art. As shown in fig. 4, the element Y in the image global feature information is used₁₁For example, the element Y₁₁Corresponding to X in the image frame as shown in FIG. 4₁₁、X₁₂、X₁₃、X₂₁、X₂₂、X₂₃、X₃₁、X₃₂、X₃₃And 9 pixels of comprehensive characteristic information. Other elements are not described in detail.

In S302, salient object feature information of the image frame is determined according to the image overall feature information.

Referring to fig. 2, salient object detection may be performed by a salient object detection module in the model according to the overall feature information of the image. Each element in the significant object feature information is used for representing the credibility that the positions of a preset number of pixel points at a specified position in an image frame are significant objects, and the probability that the positions of the preset number of pixel points are significant objects is higher when the element value is larger.

The salient object feature information may be represented in the form of a matrix, and the number of rows and columns of the matrix may be the same as the number of rows and columns of the matrix of the image overall feature information, and may be an H × W matrix. As shown in fig. 5, the element Z in the salient object feature information₁₁Can be used to characterize X in image frames₁₁、X₁₂、X₁₃、X₂₁、X₂₂、X₂₃、X₃₁、X₃₂、X₃₃The possibility that a total of 9 pixel points are salient objects.

Therefore, the salient object feature information of the image frame is determined according to the image overall feature information of the image frame, and the salient object feature information of the image frame, namely the reliability of the position of the pixel point in the image frame as a salient object, can be accurately determined.

Fig. 5 is a flowchart illustrating a method for determining predicted click rate information according to salient object feature information of an image frame according to an exemplary embodiment, where, as shown in fig. 5, determining, in S102, predicted click rate information of a user on a target video if the image frame is taken as a cover image of the target video according to the salient object feature information may include S501 to S503.

In S501, image salient feature information of the image frame is determined from the image overall feature information and the salient object feature information of the image frame.

For example, the image overall feature information of the image frame may be extracted as in the embodiment of S301. Referring to fig. 2, the salient object feature enhancement module in the model may determine salient feature information of the image according to the image global feature information and the salient object feature information of the image frame. The image saliency characteristic information may be based on elements andand the product of corresponding elements in the characteristic information of the salient object is obtained. The image saliency feature information may also be represented in the form of a matrix, where the corresponding elements may refer to elements in the same coordinate position in the matrix. As shown in fig. 4, the element M in the image saliency feature information₁₁May be Y₁₁And Z₁₁Product of (A), M₁₂May be Y₁₂And Z₁₂The product of (a).

In this way, in the obtained image saliency feature information of the image frame, the features of the regions belonging to the saliency objects are strengthened, and the features of the regions not belonging to the saliency objects are weakened, so that the image saliency feature information can better represent the saliency objects in the image frame.

In S502, feature fusion is performed according to the image overall feature information and the image saliency feature information to obtain fusion feature information.

Referring to FIG. 2, feature fusion operations may be performed in a fusion module of the model. As shown in fig. 2, the image salient feature information output by the salient object feature enhancing module may be processed by a Global Average Pooling (GAP) module and a Full Connected (FC) module to obtain processed image salient feature information F_s. The image overall characteristic information is also processed by a global average pooling module and a full-connection module to obtain processed image overall characteristic information F_b. The fusion module can obtain F according to the image saliency characteristic information_sAnd F obtained according to the overall characteristic information of the image_bAnd fusing to obtain fusion characteristic information.

In an alternative embodiment, the information F may be transmitted_sAnd information F_bAs the fusion feature information.

Preferably, in another embodiment of the present disclosure, the step S502 may include: determining a first weight of the image overall characteristic information and a second weight of the image saliency characteristic information according to the image overall characteristic information and the image saliency characteristic information; and obtaining fusion characteristic information according to the first weight, the second weight, the image overall characteristic information and the image saliency characteristic information.

For example, the first weight and the second weight may be determined by the following formula (1), and the fused feature information may be obtained by the following formula (2):

F_a＝λ·F_s+(1-λ)·F_b (2)

wherein λ represents a first weight, (1- λ) represents a second weight, F_aRepresenting fusion feature information, sigmoid is a known function, Q and K represent linear transformation operation, d is a constant greater than 0, for example, the value of d may be the value of the feature vector dimension C described above.

In this way, considering that importance of a salient object and an image background in an image is often different for different image frames, compared with performing feature fusion according to a preset weight value, in the present disclosure, according to image salient feature information and image overall feature information, respective weights may be determined adaptively, so that the obtained fusion feature information better conforms to features of the image itself, and accuracy is higher.

In S503, the predicted click rate information is determined based on the fusion feature information.

As shown in fig. 2, information obtained after the fusion feature information is processed by the full-connection module may be used as the predicted click rate information corresponding to the image frame.

By the technical scheme, the fact that a relatively significant main object, such as a human face, generally exists in the video is considered, the significant object detection is carried out on the image frame in the target video, the image significant feature information of the image frame is determined according to the significant object feature information of the image frame, and the image significant feature information of the image frame can better represent the main object in the image frame. And the user has higher attention to the salient objects in the image frames, the salient object detection module is added in the model, and when the model is determined to select the cover through the trained video cover, the image frames with the salient objects are more easily selected as the cover images of the target video, so that the determined cover images are more in line with the browsing interests of the user.

The video cover determination model can be obtained by training historical actual click rate information of historical videos according to users. The historical video may include videos published over a historical period of time, such as the past week or month. For example, the historical video may include videos with the number of presentations or exposures exceeding a certain number threshold, and the historical actual click rate information of the user on the historical video refers to the ratio of the number of times the historical video is clicked to the number of times the historical video is presented.

The historical cover image of the historical video, namely the historical video is used as the cover image when being displayed, whether a user clicks the video for watching or not has a direct relation with the cover image of the video, if the user clicks the historical video for watching, the historical cover image of the historical video can be represented to a certain degree to attract the attention of the user, and therefore the actual historical click rate information of the user on the historical video can represent the attraction degree of the historical cover image to the user. In the method, a video cover determining model is trained according to historical actual click rate information of a user on a historical video.

Fig. 6 is a flowchart illustrating a training method for a video cover determination model according to an exemplary embodiment, which may be applied to an electronic device with processing capability, such as a terminal or a server, and the electronic device performing the model training method may be the same as or different from the electronic device performing the video cover determination method. As shown in fig. 6, the method may include S601 to S607.

At S601, at least one set of training data is obtained from a training set. Each set of training data may include a first historical cover image of a first historical video and first historical actual click rate information of the user on the first historical video, and a second historical cover image of a second historical video and second historical actual click rate information of the user on the second historical video.

The history cover image and the history actual click rate information have been described above. The first historical video and the second historical video are two different historical videos. In the step, at least one group of training data can be randomly selected from the training set, and each group of training data can comprise the respective historical cover images and the historical actual click rate information of two different historical videos. The number of sets of training data obtained from the training set is not particularly limited in the present disclosure, and may be one or more sets.

In S602, for each set of training data, a first history cover image and a second history cover image included in the set of training data are respectively used as input of a model, and first target feature information of the first history cover image output after the first history cover image is processed by the model and second target feature information of the second history cover image output after the second history cover image is processed by the model are respectively obtained.

The mode of inputting the first and second history cover images to the model may be various, for example, the first and second history cover images may be input to the model in a preset order, or the first and second history cover images may be input to two same models at the same time.

FIG. 7 is a schematic diagram illustrating a model training process in accordance with an exemplary embodiment. The two models shown in fig. 7 are the same, and are twin network models, except that the processed history cover images are different, and the models are trained by two different history cover images at the same time, which does not mean that the two models are trained.

The model processes the first historical cover image and outputs first target characteristic information, the click rate information of the first historical video predicted by the model can be represented, and the model processes the second historical cover image and outputs second target characteristic information, the click rate information of the second historical video predicted by the model can be represented. The target feature information output by the model may be represented in a numerical form, for example, a numerical value between (0, 1).

In S603, a loss function value corresponding to the set of training data is determined according to the first target feature information, the second target feature information, the first historical actual click rate information, the second historical actual click rate information, and a preset loss function.

The loss function can be preset, and the corresponding loss function value can represent difference information between the model predicted click rate information and the actual click rate information.

In S604, a target loss function value is determined according to the loss function value corresponding to each of the at least one set of training data, and the parameters of the model are updated according to the target loss function value.

If a set of training data is obtained in S601, the target loss function value is a loss function value corresponding to the set of training data. In this embodiment, the target loss function value may reflect an overall prediction result of the model for a plurality of sets of training data, and the model may be trained more accurately according to the target loss function value. For example, the parameters of the model may be updated by a gradient descent method according to the target loss function value, as shown in fig. 7, for example, the parameters of the image feature extraction module, the full-connection module, the salient object detection module, and the fusion module in the model may be updated.

In S605, it is determined whether the number of used training data in the training set reaches a preset threshold. If not, re-executing S601-S604; in the case of yes, S606 is executed.

The training of the model can be divided into multiple rounds, each round of training comprises a training stage and a verification stage, the parameters of the model are updated in the training stage, and the performance and the prediction accuracy of the model are verified in the verification stage. Under the condition that the used quantity of the training data in the training set does not reach the preset threshold value, the training stage of one training round is not completed, and S601-S604 can be executed again to continue training the model. The preset threshold may be calibrated in advance, and the number of sets of training data obtained from the training data set at each time may be the same or different.

In S606, it is determined whether the model training is completed through the validation set. In case of yes, S607 is executed; if not, S601 to S606 are executed again.

Under the condition that the used quantity of the training data in the training set reaches a preset threshold value, the training stage of one round of training can be considered to be completed, and the model verification stage is carried out.

There are various ways to determine whether the model is trained through the validation set. For example, the verification set may include multiple sets of verification data, the verification data may include historical cover page images of a historical video and historical actual click rate information of the user on the historical video, and the training set and the verification set may not be intersected. According to the currently trained model and the loss function, the average value or the weighted value of the loss function values corresponding to the multiple groups of verification data can be obtained and used as the comprehensive loss function value. And if the descending amplitude of the comprehensive loss function value obtained through the verification set is smaller than the preset value in the continuous multi-round training, the model convergence can be represented, and the model training is determined to be finished. In another example, the verification set may include a historical video with a higher historical actual click rate, a cover image of the historical video is selected through a currently trained model, the accuracy of selecting the cover image according to the model is used as a judgment standard, and if the cover image selected by the model is more accurate and is close to the actual historical cover image of the historical video with the higher click rate, the prediction result of the model may be considered to be accurate, the performance of the model is better, and the model training may be determined to be completed.

In S607, a video cover determination model is obtained in response to completion of model training.

If the model training is finished, the video cover determination model can be obtained, and if the model training is not finished, S601-S606 can be executed again to perform the next round of training until the model training is finished.

According to the technical scheme, the video cover determination model is obtained by training historical actual click rate information of a historical video according to a user, and the click rate information of the user on the video can be predicted if an image frame is used as the cover of the video through the video cover determination model. According to the predicted click rate information output by the video cover determination model, accurate basis can be provided for determining the cover image of the video, so that the determined cover image can attract the attention of the user better, the accuracy of selecting the cover image is improved, and the click rate and the attention of the user of the video are improved.

Next, a process of processing a history cover image by a model, where the history cover image may be the first history cover image or the second history cover image, and in the case where the history cover image is the first history cover image, the object feature information output by the model is the first object feature information, and in the case where the history cover image is the second history cover image, the object feature information output by the model is the second object feature information, will be described.

As shown in fig. 7, the image feature extraction module in the model may extract the image overall feature information of the history cover image, the salient object detection module may determine the salient object feature information of the history cover image according to the image overall feature information of the history cover image, and the salient object feature enhancement module may determine the image salient feature information of the history cover image according to the image overall feature information of the history cover image and the salient object feature information. And then, performing feature fusion by the fusion module according to the image overall feature information and the image saliency feature information of the history cover image to obtain fusion feature information, and determining target feature information of the history cover image according to the fusion feature information. The method for processing the historical cover image by the untrained model, such as the method for extracting the features and the method for fusing the features, may be similar to the method for processing the image frame by the trained video cover determination model described above.

After obtaining the first target feature information and the second target feature information output by the model, the loss function value corresponding to the set of training data may be determined, and an exemplary manner of determining the corresponding loss function value is described below.

Fig. 8 is a flowchart illustrating a method for determining a loss function value corresponding to a set of training data according to an exemplary embodiment, and as shown in fig. 8, S603 may include S801 to S803.

In S801, model prediction difference information between the first target feature information and the second target feature information is determined.

For example, the model output may be a number between (0, 1), and the model prediction difference information may be obtained by the following formula (3):

D_p＝y_i-y_j (3)

wherein D is_pRepresenting model prediction difference information, y_iRepresenting first object feature information, y_jRepresenting second object characteristic information.

In S802, normalization processing is performed according to the first historical actual click rate information and the second historical actual click rate information, and target click rate difference information is determined.

The value of the historical actual click rate is usually a very small number, such as 0.02, and if the model is trained by directly using the historical actual click rate information, the model is difficult to train and not easy to converge, so that the prediction effect is not accurate enough. In order to ensure normal convergence of the model and accuracy of a prediction result of the trained model, in the disclosure, when determining click rate difference information, normalization processing may be performed according to first historical actual click rate information and second historical actual click rate information, a value after normalization is in positive correlation with the actual click rate information and is larger than the actual click rate, and then target click rate difference information is determined.

Illustratively, this step S802 may include: and determining target click rate difference information according to ratio information of a larger value and a smaller value in the first historical actual click rate information and the second historical actual click rate information and a preset function.

Under the condition that the first historical actual click rate information is larger than the second historical actual click rate information, determining target click rate difference value information according to ratio information between the first historical actual click rate information and the second historical actual click rate information and a preset function; and under the condition that the first historical actual click rate information is smaller than or equal to the second historical actual click rate information, determining target click rate difference value information according to ratio information between the second historical actual click rate information and the first historical actual click rate information and a preset function. The preset function may adopt, for example, a tanh function, and the target click rate difference information may be determined, for example, by the following formula (4):

wherein D is_gRepresenting target click rate difference information, c_iRepresenting first historical actual click rate information, c_jRepresenting second historical actual click rate information.

In S803, a loss function value corresponding to the set of training data is determined according to the model prediction difference information, the target point hit rate difference information, and the preset loss function.

Illustratively, equation (5) of the preset loss function is as follows:

where Loss denotes the Loss function value, max denotes the known function taking the maximum value, m_sAnd m_dIs a preset threshold value, and 0 < m_s＜m_d。

According to the technical scheme, because the numerical value of the historical actual click rate is usually a very small number, the model is difficult to train and not easy to converge, and in order to ensure normal convergence of the model and the accuracy of a prediction result of the trained model, when click rate difference information is determined, normalization processing is firstly carried out according to the first historical actual click rate information and the second historical actual click rate information, and target click rate difference information is determined. And then determining a corresponding loss function value according to the model prediction difference information, the target point hit rate difference information and a preset loss function.

Based on the same inventive concept, the present disclosure also provides a video cover determination apparatus, and fig. 9 is a block diagram illustrating a video cover determination apparatus according to an exemplary embodiment, and as shown in fig. 9, the apparatus 900 may include:

a first obtaining module 901, configured to obtain multiple image frames in a target video;

a first determining module 902, configured to determine, for each image frame, salient object feature information of the image frame, and determine, according to the salient object feature information, predicted click rate information of a user on the target video if the image frame is used as a cover image of the target video;

a second determining module 903, configured to determine, according to the predicted click rate information, an object cover image of the object video from the multiple image frames.

Optionally, the first determining module 902 may include: the extraction submodule is used for extracting the image overall characteristic information of the image frame, wherein each element in the image overall characteristic information corresponds to the characteristic information of a preset number of pixel points at a specified position in the image frame; the first determining submodule is used for determining the characteristic information of the salient objects of the image frame according to the overall characteristic information of the image, wherein each element in the characteristic information of the salient objects is used for representing the credibility that the positions of a preset number of pixel points at a specified position in the image frame are the salient objects.

Optionally, the first determining module 902 may include: the second determining submodule is used for determining the image salient feature information of the image frame according to the image overall feature information of the image frame and the salient object feature information; the fusion submodule is used for carrying out feature fusion according to the image overall feature information and the image saliency feature information to obtain fusion feature information; and the third determining submodule is used for determining the predicted click rate information according to the fusion characteristic information.

Optionally, the fusion submodule may include: the fourth determining submodule is used for determining a first weight of the image overall characteristic information and a second weight of the image saliency characteristic information according to the image overall characteristic information and the image saliency characteristic information; and the fifth determining submodule is used for obtaining the fusion characteristic information according to the first weight, the second weight, the image overall characteristic information and the image saliency characteristic information.

Optionally, the predicted click rate information is obtained by processing the image frame through a video cover determination model, the video cover determination model is obtained by determining salient object feature information of the image frame and determining the predicted click rate information according to the salient object feature information, the video cover determination model is obtained by training through a training device of the video cover determination model, and the training device of the video cover determination model includes: the second acquisition module is used for acquiring at least one group of training data from a training set, wherein each group of training data comprises a first historical cover image of a first historical video, first historical actual click rate information of a user on the first historical video, a second historical cover image of a second historical video and second historical actual click rate information of the user on the second historical video; a third determining module, configured to respectively use, for each set of training data, a first history cover image and a second history cover image included in the set of training data as input of a model, and respectively obtain first target feature information of the first history cover image output after the model processes the first history cover image, and second target feature information of the second history cover image output after the model processes the second history cover image; determining a loss function value corresponding to the set of training data according to the first target characteristic information, the second target characteristic information, the first historical actual click rate information, the second historical actual click rate information and a preset loss function; a parameter updating module, configured to determine a target loss function value according to the loss function values corresponding to the at least one set of training data, and update a parameter of the model according to the target loss function value; the triggering module is used for triggering the second obtaining module to obtain at least one group of training data from the training set under the condition that the used quantity of the training data in the training set does not reach a preset threshold value, and the parameter updating module determines a target loss function value according to the loss function values respectively corresponding to the at least one group of training data and updates the parameters of the model according to the target loss function values; a fourth determining module, configured to determine, through a validation set, whether training of the model is completed, when the number of used training data in the training set reaches the preset threshold; and the model acquisition module is used for responding to the completion of model training and obtaining the video cover determination model.

Optionally, the third determining module includes: a prediction difference determination sub-module for determining model prediction difference information between the first target feature information and the second target feature information; the target difference value determining submodule is used for carrying out normalization processing according to the first historical actual click rate information and the second historical actual click rate information to determine target click rate difference value information; and the loss function value determining submodule is used for determining the loss function value corresponding to the training data according to the model prediction difference value information, the target click rate difference value information and the preset loss function.

Optionally, the target difference determination sub-module is configured to: and determining the target click rate difference value information according to the ratio information of the larger value and the smaller value in the first historical actual click rate information and the second historical actual click rate information and a preset function.

Referring now to FIG. 10, a block diagram of an electronic device 1000 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 10, the electronic device 1000 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 1001 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage means 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 are also stored. The processing device 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Generally, the following devices may be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 1007 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 1008 including, for example, magnetic tape, hard disk, and the like; and a communication device 1009. The communication device 1009 may allow the electronic device 1000 to communicate with other devices wirelessly or by wire to exchange data. While fig. 10 illustrates an electronic device 1000 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 1009, or installed from the storage means 1008, or installed from the ROM 1002. The computer program, when executed by the processing device 1001, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a plurality of image frames in a target video; determining the characteristic information of a salient object of each image frame, and determining the predicted click rate information of a user on the target video if the image frame is taken as a cover image of the target video according to the characteristic information of the salient object; determining a target cover image of the target video from the plurality of image frames according to the predicted click rate information.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a module does not in some cases constitute a limitation of the module itself, for example, the first acquisition module may also be described as an "image frame acquisition module".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a video cover determination method, according to one or more embodiments of the present disclosure, the method including: acquiring a plurality of image frames in a target video; determining the characteristic information of a salient object of each image frame, and determining the predicted click rate information of a user on the target video if the image frame is taken as a cover image of the target video according to the characteristic information of the salient object; determining a target cover image of the target video from the plurality of image frames according to the predicted click rate information.

Example 2 provides the method of example 1, the determining salient object feature information for the image frame, comprising: extracting image overall characteristic information of the image frame, wherein each element in the image overall characteristic information corresponds to the characteristic information of a preset number of pixel points at a specified position in the image frame; and determining the characteristic information of the salient objects of the image frame according to the overall characteristic information of the image, wherein each element in the characteristic information of the salient objects is used for representing the credibility that the positions of a preset number of pixel points at the specified positions in the image frame are the salient objects.

Example 3 provides the method of example 1, wherein determining, according to the salient object feature information, predicted click rate information of a user on the target video if the image frame is taken as a cover image of the target video comprises: determining image salient feature information of the image frame according to the image overall feature information of the image frame and the salient object feature information; performing feature fusion according to the image overall feature information and the image saliency feature information to obtain fusion feature information; and determining the predicted click rate information according to the fusion characteristic information.

Example 4 provides the method of example 3, and the performing feature fusion according to the image overall feature information and the image saliency feature information to obtain fused feature information includes: determining a first weight of the image overall characteristic information and a second weight of the image significance characteristic information according to the image overall characteristic information and the image significance characteristic information; and obtaining the fusion characteristic information according to the first weight, the second weight, the image overall characteristic information and the image significance characteristic information.

Example 5 provides the method of example 1, the predicted click-through rate information is obtained by processing the image frame through a video cover determination model, the video cover determination model is obtained by determining salient object feature information of the image frame and determining the predicted click-through rate information according to the salient object feature information, wherein the video cover determination model is trained by: acquiring at least one group of training data from a training set, wherein each group of training data comprises a first historical cover image of a first historical video, first historical actual click rate information of a user on the first historical video, a second historical cover image of a second historical video and second historical actual click rate information of the user on the second historical video; for each group of training data, respectively taking the first historical cover image and the second historical cover image included in the group of training data as input of a model, and respectively acquiring first target feature information of the first historical cover image output after the model processes the first historical cover image and second target feature information of the second historical cover image output after the model processes the second historical cover image; determining a loss function value corresponding to the set of training data according to the first target characteristic information, the second target characteristic information, the first historical actual click rate information, the second historical actual click rate information and a preset loss function; determining a target loss function value according to the loss function values corresponding to the at least one group of training data, and updating the parameters of the model according to the target loss function values; under the condition that the used quantity of the training data in the training set does not reach a preset threshold value, re-executing the step of obtaining at least one group of training data from the training set to the step of determining a target loss function value according to the loss function values respectively corresponding to the at least one group of training data, and updating the parameters of the model according to the target loss function values; determining whether the model is trained through a verification set under the condition that the used quantity of the training data in the training set reaches the preset threshold; and responding to the completion of model training to obtain the video cover determination model.

Example 6 provides the method of example 5, and the determining, according to the first target feature information, the second target feature information, the first historical actual click rate information, the second historical actual click rate information, and a preset loss function, a loss function value corresponding to the set of training data includes: determining model prediction difference information between the first target feature information and the second target feature information; performing normalization processing according to the first historical actual click rate information and the second historical actual click rate information to determine target click rate difference information; and determining a loss function value corresponding to the set of training data according to the model prediction difference information, the target click rate difference information and the preset loss function.

Example 7 provides the method of example 6, wherein performing normalization processing according to the first historical actual click rate information and the second historical actual click rate information to determine target click rate difference information includes: and determining the target click rate difference value information according to the ratio information of the larger value and the smaller value in the first historical actual click rate information and the second historical actual click rate information and a preset function.

Example 8 provides, in accordance with one or more embodiments of the present disclosure, a video cover determination apparatus, the apparatus comprising: the first acquisition module is used for acquiring a plurality of image frames in a target video; the first determining module is used for determining the characteristic information of the salient objects of the image frames aiming at each image frame, and determining the predicted click rate information of the user on the target video if the image frames are used as cover images of the target video according to the characteristic information of the salient objects; and the second determining module is used for determining a target cover image of the target video from the plurality of image frames according to the predicted click rate information.

Example 9 provides a computer readable medium having stored thereon a computer program that, when executed by a processing apparatus, performs the steps of the method of any of examples 1-7, in accordance with one or more embodiments of the present disclosure.

Example 10 provides, in accordance with one or more embodiments of the present disclosure, an electronic device comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to carry out the steps of the method of any of examples 1-7.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method for video cover determination, the method comprising:

acquiring a plurality of image frames in a target video;

determining the characteristic information of a salient object of each image frame, and determining the predicted click rate information of a user on the target video if the image frame is taken as a cover image of the target video according to the characteristic information of the salient object;

determining a target cover image of the target video from the plurality of image frames according to the predicted click rate information.

2. The method of claim 1, wherein said determining salient object feature information for the image frame comprises:

extracting image overall characteristic information of the image frame, wherein each element in the image overall characteristic information corresponds to the characteristic information of a preset number of pixel points at a specified position in the image frame;

and determining the characteristic information of the salient objects of the image frame according to the overall characteristic information of the image, wherein each element in the characteristic information of the salient objects is used for representing the credibility that the positions of a preset number of pixel points at the specified positions in the image frame are the salient objects.

3. The method of claim 1, wherein the determining, according to the salient object feature information, the predicted click rate information of the target video from the user if the image frame is taken as a cover image of the target video comprises:

determining image salient feature information of the image frame according to the image overall feature information of the image frame and the salient object feature information;

performing feature fusion according to the image overall feature information and the image saliency feature information to obtain fusion feature information;

and determining the predicted click rate information according to the fusion characteristic information.

4. The method according to claim 3, wherein the performing feature fusion according to the image overall feature information and the image saliency feature information to obtain fused feature information comprises:

determining a first weight of the image overall characteristic information and a second weight of the image significance characteristic information according to the image overall characteristic information and the image significance characteristic information;

and obtaining the fusion characteristic information according to the first weight, the second weight, the image overall characteristic information and the image significance characteristic information.

5. The method of claim 1, wherein the predicted click-through rate information is obtained by processing the image frames through a video cover determination model that determines salient object feature information for the image frames and determines the predicted click-through rate information based on the salient object feature information,

wherein the video cover determination model is trained in the following way:

acquiring at least one group of training data from a training set, wherein each group of training data comprises a first historical cover image of a first historical video, first historical actual click rate information of a user on the first historical video, a second historical cover image of a second historical video and second historical actual click rate information of the user on the second historical video;

for each group of training data, respectively taking the first historical cover image and the second historical cover image included in the group of training data as input of a model, and respectively acquiring first target feature information of the first historical cover image output after the model processes the first historical cover image and second target feature information of the second historical cover image output after the model processes the second historical cover image; determining a loss function value corresponding to the set of training data according to the first target characteristic information, the second target characteristic information, the first historical actual click rate information, the second historical actual click rate information and a preset loss function;

determining a target loss function value according to the loss function values corresponding to the at least one group of training data, and updating the parameters of the model according to the target loss function values;

under the condition that the used quantity of the training data in the training set does not reach a preset threshold value, re-executing the step of obtaining at least one group of training data from the training set to the step of determining a target loss function value according to the loss function values respectively corresponding to the at least one group of training data, and updating the parameters of the model according to the target loss function values;

determining whether the model is trained through a verification set under the condition that the used quantity of the training data in the training set reaches the preset threshold;

and responding to the completion of model training to obtain the video cover determination model.

6. The method of claim 5, wherein determining the loss function value corresponding to the set of training data according to the first target feature information, the second target feature information, the first historical actual click rate information, the second historical actual click rate information, and a preset loss function comprises:

determining model prediction difference information between the first target feature information and the second target feature information;

performing normalization processing according to the first historical actual click rate information and the second historical actual click rate information to determine target click rate difference information;

and determining a loss function value corresponding to the set of training data according to the model prediction difference information, the target click rate difference information and the preset loss function.

7. The method of claim 6, wherein the determining target click rate difference information by normalizing the first historical actual click rate information and the second historical actual click rate information comprises:

and determining the target click rate difference value information according to the ratio information of the larger value and the smaller value in the first historical actual click rate information and the second historical actual click rate information and a preset function.

8. A video cover determination apparatus, the apparatus comprising:

the first acquisition module is used for acquiring a plurality of image frames in a target video;

the first determining module is used for determining the characteristic information of the salient objects of the image frames aiming at each image frame, and determining the predicted click rate information of the user on the target video if the image frames are used as cover images of the target video according to the characteristic information of the salient objects;

and the second determining module is used for determining a target cover image of the target video from the plurality of image frames according to the predicted click rate information.

9. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 7.

10. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 7.