CN116932788A

CN116932788A - Cover image extraction method, device, equipment and computer storage medium

Info

Publication number: CN116932788A
Application number: CN202210337415.2A
Authority: CN
Inventors: 高洵; 罗文寒; 徐鲁辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2023-10-24

Abstract

The application discloses a cover image extraction method, a device, equipment and a computer storage medium, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving, a map and the like, a plurality of candidate images are extracted from target multimedia resources, corresponding image description texts are respectively generated, the resource contents in various media forms included in the target multimedia resources and the image description texts are input into a trained classification evaluation model, the target resource type of the target multimedia resources and the model evaluation result of each candidate image are respectively obtained, the model evaluation result is updated based on a cover evaluation rule determined by the target resource type, the target evaluation result of each candidate image is obtained, and the cover image is determined from each candidate image based on the target evaluation result of each candidate image.

Description

Cover image extraction method, device, equipment and computer storage medium

Technical Field

The application relates to the technical field of computer technology artificial intelligence, in particular to the technical field of artificial intelligence, and provides a cover image extraction method, device and equipment and a computer storage medium.

Background

With the development of network technology, multimedia resources can be conveniently obtained from the network; for example: a search may be conducted in the resource provisioning platform to obtain a multimedia resource containing multiple images. When a multimedia asset containing multiple images is presented, a cover image of the multimedia asset is typically presented first, so that the content presented by the cover image directly determines the first impression of the multimedia asset. For example, the multimedia resource is a video or an atlas, and the information such as the content and style of the video can be intuitively perceived through the cover image of the video, so that whether the video is clicked to be watched or not is determined to a certain extent. It follows that the selection of cover images for multimedia assets such as video or atlases is important.

Currently, when selecting a cover image, the images are generally evaluated uniformly and then sorted and selected, for example: and analyzing the main body information of each image, obtaining a picture evaluation value based on the position, the definition, the picture quality and other information of the main body, and finally selecting a cover image based on the picture evaluation value.

However, this selection manner adopts a uniform evaluation standard for all multimedia resources, and in an actual scene, the uniform evaluation standard may not be applicable to all multimedia resources, and thus, inaccurate selection of the cover image may occur due to inaccurate image evaluation.

Disclosure of Invention

The embodiment of the application provides a cover image extraction method, device and equipment and a computer storage medium, which are used for improving the accuracy of an extracted cover image.

In one aspect, there is provided a cover image extraction method, the method including:

extracting a plurality of candidate images from each original image included in the target multimedia resource, and respectively generating corresponding image description texts based on the image content of each candidate image;

inputting the resource contents of various media forms included in the target multimedia resource and the description text of each image into a trained classification evaluation model to respectively obtain the target resource type of the target multimedia resource and the model evaluation result of each candidate image, wherein each model evaluation result represents: the corresponding candidate image is recommended as the recommendation degree of the cover image;

Determining a cover evaluation rule corresponding to the target multimedia resource based on the target resource type, and respectively updating each obtained model evaluation result based on the cover evaluation rule to obtain target evaluation results of each candidate image;

and determining a cover image from the candidate images based on the target evaluation results of the candidate images.

In one aspect, there is provided a cover image extraction apparatus, the apparatus including:

an image extraction unit for extracting a plurality of candidate images from each original image included in the target multimedia resource;

an image description unit for generating corresponding image description text based on the image content of each candidate image;

the classification evaluation unit is used for inputting the resource contents of the multiple media forms included in the target multimedia resource and the image description texts into the trained classification evaluation model to respectively obtain the target resource type of the target multimedia resource and the model evaluation results of the candidate images, wherein each model evaluation result represents: the corresponding candidate image is recommended as the recommendation degree of the cover image;

the image selecting unit is used for determining a cover evaluation rule corresponding to the target multimedia resource based on the target resource type, and respectively updating the obtained model evaluation results based on the cover evaluation rule to obtain target evaluation results of the candidate images; and determining a cover image from the respective candidate images based on the target evaluation results of the respective candidate images.

Optionally, the graph selecting unit is specifically configured to:

for each candidate image, the following steps are respectively executed:

performing main body detection on one candidate image aiming at the one candidate image to obtain a target area where an image main body in the one candidate image is located;

and obtaining a subject evaluation result of the one candidate image according to the overlapping degree between the target area and the central area of the one candidate image.

Optionally, each model evaluation result includes a recommendation degree of the corresponding candidate image with respect to each resource type; the graph selecting unit is specifically configured to:

based on the target resource types, determining weights corresponding to the resource types in the cover evaluation rule;

and determining target evaluation results of the candidate images based on the recommendation degree of the candidate images relative to the resource types and the corresponding weights.

Optionally, the resource content includes resource description text of the original images and the target multimedia resource; the classification evaluation unit is specifically configured to:

extracting corresponding original image features from each original image respectively;

Extracting first text features from the resource description texts, and extracting corresponding second text features from the image description texts respectively;

performing text feature fusion on the obtained first text features and each second text feature to obtain fused text features, and performing image feature fusion on each obtained original image feature to obtain fused image features;

and predicting the target resource type and the evaluation result of each model based on the fused text features and the fused image features.

Optionally, the classification evaluation model includes a text encoder and an image encoder, the text encoder and the image encoder sharing weight parameters; the classification evaluation unit is specifically configured to:

adopting the text encoder to perform text feature encoding on the obtained first text feature and each second text feature based on the weight parameter to obtain the fused text feature;

and adopting the image encoder to encode the image characteristics of each obtained original image characteristic based on the weight parameters to obtain the fused image characteristics.

Optionally, the image extraction unit is specifically configured to:

Extracting N primary selected images from each original image by adopting an equidistant extraction mode;

determining M primary selected images meeting the index conditions from the N primary selected images based on at least one basic image index of the N primary selected images respectively, wherein M is smaller than N;

clustering the M primary selected images based on the image similarity among the M primary selected images to obtain L class clusters, wherein each class cluster comprises at least one primary selected image with the similarity meeting the condition;

aiming at the L class clusters, respectively selecting a primary selected image with highest stationarity in each class cluster to obtain L primary selected images, wherein L is smaller than M;

and selecting the candidate images based on the basic image indexes of the L primary selected images.

Optionally, the image description unit is specifically configured to:

for each candidate image, the following steps are respectively executed:

image coding is carried out on one candidate image based on an image coding part of an image description model to obtain coding characteristics corresponding to the one candidate image;

decoding the coding features based on a plurality of serialization decoding units included in the image description model to obtain a plurality of description words; each serialization decoding unit predicts based on the coding feature and the descriptive word output by the previous serialization decoding unit to obtain a corresponding descriptive word;

And obtaining the image description text based on the descriptive segmentations.

Optionally, the apparatus further comprises a model training unit, configured to train to obtain the classification evaluation model through the following process:

acquiring a plurality of sample multimedia resources, and respectively acquiring corresponding resource classification labels aiming at each sample multimedia resource;

extracting a plurality of candidate images from original images of each sample multimedia resource respectively, and acquiring corresponding image evaluation grades aiming at each candidate image, wherein the image evaluation grades represent: the corresponding candidate images determined based on the priori knowledge are recommended as recommendation degrees of the cover images;

and constructing a plurality of training samples based on the obtained sample multimedia resources, the resource classification labels and the image evaluation grades, and training the classification evaluation model based on the plurality of training samples.

Optionally, the model training unit is specifically configured to:

forward predicting each sample multimedia resource by adopting the classification evaluation model to obtain the prediction type of each sample multimedia resource and the prediction evaluation result of each candidate image;

constructing a classification loss function based on the obtained differences between the respective prediction types and the corresponding resource classification labels, and constructing an evaluation loss function based on the obtained differences between the prediction evaluation results and the corresponding image evaluation levels;

And carrying out parameter adjustment on the classification evaluation model based on the classification loss function and the evaluation loss function.

In one aspect, a computer device is provided comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods described above when the computer program is executed.

In one aspect, there is provided a computer storage medium having stored thereon computer program instructions which, when executed by a processor, perform the steps of any of the methods described above.

In one aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the steps of any of the methods described above.

In the embodiment of the application, when the cover image is extracted from the target multimedia resource, the resource type of the target multimedia resource is accurately predicted through the resource content in various media forms and the image description text of the candidate image, and meanwhile, the classification evaluation model is used for evaluating the candidate image and outputting a corresponding model evaluation result, further, after the model evaluation results are updated according to the resource type, a more accurate target evaluation result is obtained, and further, the cover image is selected according to the target evaluation result.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to the provided drawings without inventive effort for those skilled in the art.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a flowchart of a cover image extraction method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an image description result according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a data processing of a classification evaluation model according to an embodiment of the present application;

fig. 5 is a schematic flow chart of extracting candidate images according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of image description of candidate images according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an image description model according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a training process of a classification evaluation model according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a classification evaluation model according to an embodiment of the present application;

FIG. 10 is a schematic flow chart of resource classification and image evaluation according to an embodiment of the present application;

FIG. 11 is a schematic flow chart of obtaining a target evaluation result according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a subject evaluation process performed by a subject evaluation model according to an embodiment of the present application;

FIG. 13 is a schematic flow chart of another embodiment of the present application for obtaining a target evaluation result;

FIG. 14 is a schematic diagram of a cover image extracting apparatus according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 16 is a schematic structural diagram of another electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. Embodiments of the application and features of the embodiments may be combined with one another arbitrarily without conflict. Also, while a logical order is depicted in the flowchart, in some cases, the steps depicted or described may be performed in a different order than presented herein.

In order to facilitate understanding of the technical solution provided by the embodiments of the present application, some key terms used in the embodiments of the present application are explained here:

multimedia resources: the multimedia resource according to the embodiment of the present application refers to a multimedia resource containing a plurality of images, and may be, for example, a video or an atlas.

Image description (Image capture): the task process of the method is to give an image, sense objects in the image through a machine, even capture relations in an image picture, and finally generate a language describing properties, which is similar to the process of 'looking at speaking'. In this process, the machine is not only required to detect objects in the image, but also to understand the interrelationship between the objects, and finally to express them in a reasonable language, which is very challenging for the machine.

Medium form: or mode, in the embodiment of the present application, each media form of information may be referred to as a mode, and for video, the media form of video information may include audio, image and text; for an atlas, its intermediate form may include images as well as text.

Multi-modal multi-tasking machine learning (Multi Modal Multi Task Machine Learning,): refers to the ability to implement processing and understanding of multi-modal information and multi-task prediction through a machine learning approach. In the embodiment of the application, the multi-modal information mainly relates to texts and images of multimedia resources, aims at realizing the capability of processing and understanding the multi-modal information, eliminates redundancy among modes by utilizing complementarity among the multi-modalities, thereby learning better characteristic representation, and mainly relates to classification subtasks and image evaluation subtasks based on the learned characteristic representation for prediction of subsequent subtasks.

And (3) main body detection: for an image, which typically includes an image subject and an image background, the image subject is typically the most interesting content of the image, for example, when a face or a body is included in an image and occupies a large area of the image, the face or the body may be the image subject of the image. The subject detection needs to identify the image subject of the image from the image, and output the related information of the image subject, such as type, position, etc.

The technical scheme of the embodiment of the application relates to an artificial intelligence (Artificial Intelligence, AI) technology and a Machine Learning (ML) technology, wherein the artificial intelligence is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a Machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Computer Vision (CV) is a science of how to "look" at a machine, and further means to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eye observation or transmission to an instrument for detection. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, robotic, smart medical, smart customer service, car networking, autopilot, smart transportation, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and will be of increasing importance.

Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like. An artificial neural network (Artificial Neural Network, ANN) abstracts the human brain neural network from the point of information processing, builds a simple model, and forms different networks according to different connection modes. The neural network is an operation model, which is formed by interconnecting a plurality of nodes (or neurons), each node represents a specific output function, called an excitation function (activation function), the connection between every two nodes represents a weighting value for the signal passing through the connection, called a weight, which is equivalent to the memory of an artificial neural network, the output of the network is different according to the connection mode of the network, the weight value and the excitation function are different, and the network itself is usually an approximation to a certain algorithm or function in nature, and can also be an expression of a logic strategy.

In the embodiment of the application, the prediction of the resource type and the model evaluation of the candidate image are carried out based on the resource content in various media forms of the multimedia resource and the image description text of the candidate image, and the artificial neural network model based on deep learning is adopted, namely, the embodiment of the application adopts a machine learning method to obtain a classification evaluation model, and the classification evaluation model learns better resource characteristic representation based on the capability of machine learning to realize processing and understanding the resource content in various media forms of the video, and respectively executes classification subtasks and evaluation subtasks based on the characteristic representation. Because multimedia resources involve resource content in various media, related technologies such as computer vision technology and natural language processing technology are required to be involved in model training and application processes.

Specifically, the classification evaluation model and the like related to the embodiment of the application can be divided into two parts, including a training part and an application part. The training part relates to the technical field of machine learning, and in the training part, an artificial neural network model (namely models such as a classification evaluation model mentioned later) is trained through the machine learning technology, so that the artificial neural network model is trained based on the resource content of various media forms of each sample multimedia resource, and model parameters are continuously adjusted through an optimization algorithm until the model converges; the application part is used for predicting the resource type of the target multimedia resource of the cover image to be extracted and evaluating the candidate image by using the artificial neural network model obtained by training in the training part so as to assist in selecting the cover image. In addition, it should be noted that the artificial neural network model in the embodiment of the present application may be online training or offline training, which is not limited herein. This is illustrated herein by way of example with offline training.

The following briefly describes the design concept of the embodiment of the present application:

at present, when selecting the cover images, the images are generally evaluated uniformly and then the images are sorted and selected, but the selecting mode adopts uniform evaluation standards for all multimedia resources, and in an actual scene, the uniform evaluation standards may not be applicable to all multimedia resources, and the situation that the cover images are selected inaccurately due to inaccurate image evaluation may occur. In addition, in the related art, the evaluation and judgment are generally performed based on manually set standards, so that the subjectivity is high, the interpretability is low, and the difference of the multimedia resources of different categories on the graph selection standard is ignored.

In view of this, the embodiment of the application provides a cover image extraction method, in which, when a cover image is extracted from a target multimedia resource, the resource type of the target multimedia resource is accurately predicted through the resource content in various media forms and the image description text of a candidate image, and meanwhile, a classification evaluation model is used for evaluating and outputting a corresponding model evaluation result for the candidate image, and further, after updating the model evaluation results according to the resource type, a more accurate target evaluation result is obtained, and further, a cover image is selected according to the target evaluation result.

Specifically, the embodiment of the application classifies the deep learning method based on the resource content of various media forms of the multimedia resource to accurately obtain the resource type, and the evaluation dimension is selected with emphasis based on the resource type, so that the image evaluation can be accurately performed after the evaluation standard is customized based on the resource type.

The following description is made for some simple descriptions of application scenarios applicable to the technical solution of the embodiment of the present application, and it should be noted that the application scenarios described below are only used for illustrating the embodiment of the present application, but not limiting. In the specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The scheme provided by the embodiment of the application can be applied to various scenes including but not limited to cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like, and is particularly suitable for scenes related to extracting the theme images of multimedia resources, such as a cover image extraction scene, a cover image recommendation scene, an image material library establishment scene and the like, and is not exemplified one by one. As shown in fig. 1, a schematic view of an application scenario provided in an embodiment of the present application may include a terminal device 101 and a server 102. The terminal device 101 in the embodiment of the present application may be provided with a client related to cover image extraction. The server 102 may include a cover image extraction-related server. In addition, the client in the present application may be software, web page, applet, etc., and the server is a background server corresponding to the software, web page, applet, etc., or a server dedicated to image processing, model training, etc., and the present application is not limited in particular

Note that, the cover image extraction method in the embodiment of the present application may be executed by the terminal device 101 or the server 102 alone, or may be executed by the server 102 and the terminal device 101 together. For example, the server 102 extracts a plurality of candidate images, inputs the resource contents in a plurality of media forms and the image description text of the candidate images into a classification evaluation model, accurately predicts the resource type of the target multimedia resource, and simultaneously evaluates the candidate images to output corresponding model evaluation results, so that the server 102 obtains more accurate target evaluation results after updating the model evaluation results according to the resource type, and further selects the cover image according to the target evaluation results. Alternatively, the above steps are performed by the terminal device 101. Or, the server 102 obtains the resource type and the model evaluation result of each candidate image based on the above steps, and then the terminal device updates each model evaluation result based on the resource type, and then selects a cover image according to the target evaluation result, and presents the selected cover image through the client, which is not limited herein, and the server 102 is mainly taken as an example for illustration.

Taking the example of the server 102 performing the steps described above, the server 102 may include one or more processors 1021, memory 1022, and I/O interfaces 1023 for interaction with terminals, etc. In addition, the server 102 may further configure a database 1024, and the database 1024 may be used to store the resource content of the sample multimedia resource, the model parameters obtained by training, and the like. The memory 1022 of the server 102 may further store program instructions of the cover image extraction method provided in the embodiment of the present application, where the program instructions, when executed by the processor 1021, can be used to implement steps of the cover image extraction method provided in the embodiment of the present application, so as to implement a cover image extraction process.

In a possible implementation manner, the method of the embodiment of the present application may be applied to a cover image extraction scene, and then, for a multimedia resource of a cover image to be extracted, based on the cover image extraction method provided by the embodiment of the present application, candidate images are extracted from the multimedia resource, and image description text is generated based on image content of each candidate image, and is input into a classification evaluation model provided by the embodiment of the present application, so as to obtain a resource type of the multimedia resource and a model evaluation result of each candidate image, and further, the model evaluation result is updated through the resource type, so that the cover image is extracted based on the obtained target evaluation result, and an effect of accurately selecting the cover image based on the resource type is achieved.

In a possible implementation manner, the method of the embodiment of the present application may be applied to a cover image recommendation scenario, and then the multimedia resource of the cover image to be extracted may be uploaded by a client on the terminal device 101, and the server 102 is requested to extract the cover image of the multimedia resource by using the cover image extraction method provided by the embodiment of the present application, and the selected cover image is pushed as a recommendation object to the client for displaying, so as to assist in selecting the cover image of the multimedia resource.

Taking a multimedia resource as an example of a video, when uploading a new video, it is generally required to select a cover image of the video, then the method provided by the embodiment of the present application may be used to extract the cover image of the video, and use the extracted cover image as the cover image of the video, or may recommend the cover image to an uploading person, where the uploading person may select to set the cover image as the cover image of the video. The cover image extraction or recommendation of video may be applied to a video platform or a short video platform. In addition, the extracted cover images are usually images with better quality in videos, so that the cover image extraction method provided by the embodiment of the application can be used as an extraction mode of image materials to expand an image database, and the cover images of thousands of people and thousands of pictures can be selected in a diversified manner by utilizing the same technical thought of the embodiment of the application.

In the embodiment of the present application, the terminal device 101 may be, for example, a mobile phone, a tablet personal computer (PAD), a notebook computer, a desktop computer, an intelligent home appliance, an intelligent vehicle-mounted device, an intelligent wearable device, an intelligent voice interaction device, an aircraft, and the like. The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, i.e., a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligence platform, but is not limited thereto.

In the embodiment of the present application, the terminal device 101 and the server 102 may be directly or indirectly connected through one or more networks 103. The network 103 may be a wired network, or may be a Wireless network, for example, a mobile cellular network, or may be a Wireless-Fidelity (WIFI) network, or may be other possible networks, which are not limited in this embodiment of the present application.

It should be noted that, the number of terminal devices and servers shown in fig. 1 is merely illustrative, and the number of terminal devices and servers is not limited in practice, and is not particularly limited in the embodiment of the present application.

In one possible application scenario, related data (such as resource content, feature vectors, etc.) and model parameters related to the embodiment of the present application may be stored using a cloud storage (closed storage) technology. Cloud storage is a new concept which extends and develops in the concept of cloud computing, and a distributed cloud storage system refers to a storage system which integrates a large number of storage devices (or called storage nodes) of different types in a network through application software or application interfaces to cooperatively work and jointly provides data storage and service access functions for the outside through functions of cluster application, grid technology, a distributed storage file system and the like.

The cover image extraction method provided by the exemplary embodiment of the present application will be described below with reference to the accompanying drawings in conjunction with the above-described application scenario, and it should be noted that the above-described application scenario is only shown for the convenience of understanding the spirit and principle of the present application, and the embodiment of the present application is not limited in any way in this respect.

Referring to fig. 2, a flowchart of a cover image extraction method according to an embodiment of the present application is illustrated by taking a server as an execution body. The specific implementation flow of the method is as follows:

Step 201: and extracting a plurality of candidate images from each original image included in the target multimedia resource, and generating corresponding image description text based on the image content of each candidate image.

In the embodiment of the present application, the target multimedia resource refers to a resource containing a plurality of images, for example, the target multimedia resource may be a video or an album, the plurality of images contained in the target multimedia resource are called as original images, and the purpose of extracting the cover image is to select the image most suitable for the target multimedia resource from the original images.

In the embodiment of the application, the image description text is obtained after image description processing is carried out on the candidate image, the image description processing refers to that after the candidate image is analyzed, the image content expressed in the candidate image is described in a text form, and the content of the image is described in a sentence, namely, "who", "where", "what is done", "who" refers to the image main body in the image, "where" expresses the environment or scene in the image, and "what is done" expresses the event occurring in the image. For example, referring to fig. 3, an image a is input, based on the thinking that a person looks at speaking, two dolls and a snowman can be seen in the image, and the current scene is in outdoor snowfield, and it can be seen that an image description text such as "two dolls are piled up in snowfield" can be obtained through the image description process.

Step 202: inputting the resource contents of various media forms included in the target multimedia resource and the description text of each image into a trained classification evaluation model to respectively obtain the target resource type of the target multimedia resource and the model evaluation result of each candidate image, wherein each model evaluation result represents: the corresponding candidate image is recommended as the recommendation degree of the cover image.

In the embodiment of the application, the resource content in the various media forms can comprise content in any media form included in the target multimedia resource, for example, the resource content can comprise text, images, audio and other media forms, for example, for the target multimedia resource, the resource content comprises various original images and resource description texts included in the resource content.

Taking the target multimedia asset as an example of a video, the asset description text may include any possible text information associated with the video. Specifically, one or more of the following modes may be adopted to obtain the text information associated with the video:

(1) And extracting associated text information from the descriptive text of the video. For example, the title and text information contained in the video profile may be extracted from the title of the video.

2) Associated text information is extracted from each video frame included in the video. Specifically, an image text recognition method may be adopted to extract text information contained in each video frame included in the video, for example, an OCR method may be adopted to extract lyrics, subtitles, product text and the like in the video frame, where the product text refers to text (such as a product name, manufacturer and the like) on a product presented in a video picture.

3) Associated text information is extracted from audio data included in the video. Specifically, a voice text recognition method may be used to extract text information contained in audio data included in video. For example, automatic speech recognition (Automatic Speech Recognition, ASR) techniques may be employed to extract text information in audio (e.g., person conversations in television episodes, lyrics in MVs, etc.) included in the video.

In the embodiment of the application, the classification evaluation model adopts a multi-mode multi-task model architecture, as shown in fig. 4, is a data process processing schematic diagram of the classification evaluation model, can perform feature extraction based on resource contents (such as a resource description text, an original image and an extracted candidate image shown in fig. 4) and image description texts of a multi-media resource, obtain corresponding each embedding, encode the emdodder through two encoders (such as encoder1 and encoder2 shown in fig. 4), perform feature fusion through a feature fusion layer (such as a concate shown in fig. 4), and then respectively input the emdodder into a subtask prediction layer (such as MLP1 and MLP2 shown in fig. 4) to realize prediction of the resource type of the multi-media resource, and evaluate whether each candidate image is suitable as a cover image of the multi-media resource. The classification evaluation model will be described in detail later, and thus will not be described in detail here.

In one possible implementation, the resource type may be set according to its actual traffic scenario. For example, when the multimedia resource is a video, the resource type may be a first class classification tag of the video, for example, classified into categories such as tv drama, variety, documentary, cartoon, child, life, news, etc.; when the multimedia resource is an atlas, the multimedia resource may be classified according to the topic of the atlas, for example, into categories of flowers and plants, hand-drawing, ancient wind, photography, parents and children, etc. Of course, other classification schemes may be used, and embodiments of the present application are not limited in this regard.

Step 203: and determining cover evaluation rules corresponding to the target multimedia resources based on the target resource types, and updating the obtained evaluation results of each model based on the cover evaluation rules respectively to obtain target evaluation results of each candidate image.

Considering the multimedia resources of different resource types, the cover selection of the multimedia resources can be selected in a manner of being prone to the resource types to which the multimedia resources belong, so that different resource types can correspond to different emphasis points, therefore, after model evaluation, the cover evaluation rule corresponding to the target multimedia resources can be determined through the predicted target resource types, update processing is carried out on the model evaluation result based on the cover evaluation rule, and a final target evaluation result is obtained and is used as a reference basis for cover image selection.

Step 204: a cover image is determined from the respective candidate images based on the target evaluation results of the respective candidate images.

Specifically, sorting is performed according to the target evaluation results of the candidate images, and one or more candidate images with the front sorting are selected from the sorted candidate images to serve as cover images of the target multimedia resources.

In one possible implementation manner, the server may select a plurality of candidate images from the candidate images to recommend to the client in the terminal device, so that the final candidate image may be selected as the cover image of the target multimedia resource based on the selection in the client, thereby not only meeting the accuracy requirement of the cover image, but also selecting the cover image meeting the preference of the user.

In the embodiment of the application, when the cover image is extracted from the target multimedia resource, the resource type of the target multimedia resource is accurately predicted through the resource content in various media forms and the image description text of the candidate image, and meanwhile, the classification evaluation model is used for evaluating the candidate image and outputting a corresponding model evaluation result, and further, after updating each model evaluation result through the resource type, a more accurate target evaluation result is obtained, and further, the cover image is selected according to the target evaluation result.

In a possible implementation manner, the candidate image selecting part in step 201 may be implemented according to a schematic diagram shown in fig. 5, and fig. 5 is a schematic flow diagram of extracting a candidate image according to an embodiment of the present application, including the following steps:

step 2011: and performing frame extraction processing on each original image to obtain N primary selected images.

In the embodiment of the application, any one of the following frame extraction modes can be adopted:

(1) And the equal interval extraction mode is to calculate the extraction frame interval according to the total number of the original images, and then extract a primary selected image from the original images at intervals of the extraction frame interval.

Taking 60 frames of video extraction as an example, for an input video segment, the frame extraction strategy is that the frame extraction frequency of the video within one minute is 1fps, namely one frame is extracted per second, if the video length exceeds one minute duration, the frame extraction interval is calculated, so that 60 frames are uniformly sampled, the number of frames is reasonably controlled, and the full length of the video is covered as much as possible.

(2) The random extraction mode is to randomly extract a specified number of primary images, such as 60 primary images, from the original images.

(3) For video, FFMPEG (a multimedia processing tool) may be employed to extract video frames of a particular length (e.g., 30 frames) in the video.

Step 2012: and (3) basic index screening, namely determining M primary selected images meeting the index conditions from the N primary selected images based on at least one basic image index of the N primary selected images respectively, wherein M is smaller than N.

In the embodiment of the application, the basic analysis of the primary selected images can be performed by adopting a traditional image operator, and the light weight algorithm is adopted in consideration of the fact that the number of the currently reserved primary selected images is large.

In one possible embodiment, the base metrics may include, but are not limited to, a combination of one or more of the following metrics:

(1) Image sharpness, by calculating the image sharpness, the top-ranked (e.g., top 90%) primary images are retained.

(2) Image brightness, by calculating the image brightness, the first selected image with brightness ordered earlier (such as 90% before) is preserved.

(3) The image entropy and the entropy information represent the information quantity carried by the image, and the carried information quantity is larger and better, so that the initial images with the image entropy ordered at the front (such as the front 90%) are reserved by calculating the image entropy.

(4) The image stillness degree can be calculated by subtracting the front frame from the rear frame, and the stillness degree is the highest if the difference degree is the smallest, so that the image stillness degree can be used for preferentially selecting in the class cluster.

Through experimental verification, before clustering, ultra-low quality pictures are primarily filtered, so that the pictures can be prevented from being selected as candidate images in a single mode, and the image quality of the candidate images is improved. Taking 60 frames as an example of the number of primary images, 60×0.9×0.9=32 primary images can be theoretically obtained by basic index screening.

Step 2013: clustering and de-duplication, namely dividing similar pictures into classes through clustering, and selecting an optimal image in each class after the classes are classified.

In the embodiment of the application, based on the image similarity among M primary selected images, the M primary selected images are clustered to obtain L class clusters, each class cluster comprises at least one primary selected image with the similarity meeting the condition, and the primary selected image with the highest stillness in each class cluster is selected for the L class clusters respectively to obtain L primary selected images, wherein L is less than M. The clustering can be performed by a machine learning method, and a plurality of clustering algorithms can be applied to achieve the clustering effect.

In one possible implementation, the clustering process described above is performed using a hierarchical clustering algorithm. Specifically, by calculating the hash similarity between every two primary selected images, when clustering is performed, combining two primary selected images closest to each other into one class in each step until the two primary selected images are combined to the number of target class clusters, the number of class clusters can be set to be more than 2 times of the required candidate images, and screening space for image scenes is reserved.

Step 2014: a plurality of candidate images are selected based on the base image indices of the L primary selected images.

In the embodiment of the application, after the clustering and de-duplication steps are completed, the primary selected images in each class cluster are analyzed and generated, the basic quality score of the picture is calculated, and the optimal frame is selected as the candidate frame, for example, 16 candidate images are preferentially extracted. Therefore, after the clustering is completed, if the quality of the whole cluster-like picture is poor, the picture can be filtered by sorting order in the subsequent carefully chosen stage, so that the difference degree and the image quality between candidate images are considered.

In a possible implementation manner, the image description part in step 201 may be implemented according to a schematic diagram as shown in fig. 6, and fig. 6 is a schematic flow diagram of image description of candidate images according to an embodiment of the present application, and since the process of image description for each candidate image is similar, a candidate image a is described herein as an example. The image description process comprises the following steps:

step 601: and carrying out image coding on the candidate image A based on an image coding part of the image description model to obtain coding characteristics corresponding to the candidate image A.

In the embodiment of the present application, the Image description process is performed by using an Image description model based on deep learning, where the Image description model is essentially a translation model, indicating that an input is not a word sequence, but is replaced by an Image, where the Image is a series of pixel values, and where a corresponding visual feature needs to be extracted from the Image and then decoded into an output sequence, so that an encoding-decoding (encoder-decoder) structure or a Universal Image-text language representation learning (Universal Image-Text Representation, UNITER) structure may be used, where an encoder part is used to implement Image encoding, and a neural network capable of implementing Image feature extraction, such as a convolutional neural network (Convolutional Neural Networks, CNN) or a multi-layer deep neural network (Deep Neural Networks, DNN) network, and a decoder part may be used to implement a neural network capable of implementing sequential feature decoding, such as a cyclic neural network (Recurrent Neural Network, RNN) or a Long-Short-Term Memory network (LSTM).

In one possible implementation, a model architecture as shown in fig. 7 may be used, which is a schematic structural diagram of an image description model provided by an embodiment of the present application. In the model architecture, the image encoding part is realized by adopting a CNN network, and the decoding part is realized by adopting an LSTM network, and the model architecture is composed of a plurality of serialization decoding units.

Specifically, the CNN network includes an input layer, an output layer, and multiple hidden layers between the input layer and the output layer, where the input layer inputs the candidate image provided by the embodiment of the present application, and the multiple hidden layers are used to implement feature extraction of different size perception fields in the candidate image.

Considering that the image main body in the image is usually an important factor for describing the image, when the feature extraction is performed, the Attention (Attention) mechanism can be combined, the Attention mechanism can be used for extracting corresponding features for different positions of the candidate image, the features containing position information are obtained, and the decoding process has the capability of selecting among the position features, namely the Attention mechanism. The essence of the attention mechanism is from the human visual attention mechanism, and human vision generally does not see a scene from head to tail every time when perceiving things, but usually observes and notices a specific part according to requirements, and when people find that a scene often appears something that the people want to observe on the part, people learn to pay attention to the part when similar scenes appear again in the future. Thus, the attention mechanism is essentially a means of screening high value information from a large amount of information in which different information is of different importance to the result, which importance can be represented by giving weights of different magnitudes, in other words, the attention mechanism can be understood as a rule of assigning weights when synthesizing a plurality of sources.

Step 602: decoding the coding features based on a plurality of serialization decoding units included in the image description model to obtain a plurality of description words; each serialization decoding unit predicts based on the coding characteristics and the descriptive word output by the previous serialization decoding unit, and obtains the corresponding descriptive word.

In the embodiment of the application, the coding feature output by the image coding part is a feature sequence, and the feature sequence is input into the serialization decoding part of the image description model for decoding. Referring to fig. 7, the serialization decoding portion includes a plurality of serialization decoding units, wherein the input of each serialization decoding unit is the output of the preceding serialization decoding unit and the feature of the corresponding position in the coding feature, the input of the first serialization decoding unit only includes the feature of the corresponding position in the coding feature, the output of each serialization decoding unit is the probability of all words in the word table, and the word with the highest probability can be selected as the descriptive word output by the serialization decoding unit based on the probability.

As an example, the second LSTM cell shown in FIG. 7, S ₀ Representing descriptive segmentation obtained by the last LSTM unit, p1 represents the predicted probability output by the second LSTM unit, which is a vector of 1*D, and D represents the length of the word list, i.e. p1 represents the probability that the second descriptive segmentation is each word in the word list. At the same time, the content descriptive text is obtained at each stage to try to restore the image characteristics, so that the generated text expresses visual information as far as possible.

Step 603: based on the plurality of descriptive segmentations, image descriptive text of the candidate image a is obtained.

The image description text of the candidate image A can be obtained by combining the descriptive words output by the LSTM units, and the image description text of the two dolls snowman in the snow can be obtained through image description processing with reference to FIG. 3.

The training and the prediction of the classification evaluation model are carried out through the combination of the image description text, so that on one hand, the data basis required by classification and evaluation is enriched, the accuracy of classification and evaluation results is improved, and on the other hand, the image description simulates the thinking of people to understand the image, improves the understanding capability of the candidate image, and is helpful for helping to judge whether the candidate image is suitable for being used as a cover image.

In a possible implementation manner, reference may be made to fig. 8, which is a schematic training flow diagram of a classification evaluation model provided by an embodiment of the present application, including the following steps:

step 801: and acquiring a plurality of sample multimedia resources, and respectively acquiring corresponding resource classification labels aiming at each sample multimedia resource.

In the embodiment of the application, the acquisition of the resource classification label can be classified by adopting the existing classification model, or the acquisition of the resource classification label can be acquired according to the existing label of the multimedia resource, and taking the video as an example, the video can be acquired in the existing video platform as a sample multimedia resource, and the first class classification label of each video is adopted as the resource classification label, so that the preparation workload of the early training sample is reduced, and the training process is accelerated.

Step 802: extracting a plurality of candidate images from original images of each sample multimedia resource respectively, and acquiring corresponding image evaluation grades aiming at each candidate image, wherein the image evaluation grades are characterized in that: the corresponding candidate images determined based on the prior knowledge are recommended as recommendation degrees of the cover images.

In the embodiment of the application, the image evaluation level is mainly used for constructing the loss function of the subsequent evaluation subtask, and the image evaluation level of each candidate image can be determined based on priori knowledge.

In one possible implementation, the image evaluation level may manually evaluate the "whether the candidate frame is suitable as the cover image of the multimedia resource" by 5-level evaluation values, such as extremely suitable, very suitable, more suitable, generally suitable, unsuitable, etc. The process of extracting the candidate image may be referred to the description of the corresponding parts, and will not be repeated here.

Step 803: based on the obtained sample multimedia resources, the resource classification labels and the image evaluation level, a plurality of training samples are constructed.

In the embodiment of the application, each training sample comprises the following contents:

(1) The original image is, for a sample video, each video frame contained in the sample video, and for a sample atlas, each image contained in the sample atlas.

(2) The candidate image, i.e., the plurality of candidate images extracted in step 802 described above.

(3) The asset description text may be, for example, a title, a brief introduction, a comment, etc. of the sample multimedia asset.

(4) Image description text, that is, the image description information of each of the plurality of candidate images extracted as described above.

(5) Resource classification labels.

(6) Image evaluation grade.

It can be seen that the embodiment of the application fully considers the abundant resource data of each multimedia resource, thereby ensuring that the resource classification and the image evaluation result are more accurate.

When the sample is constructed, positive samples and negative samples can be randomly constructed, wherein the positive samples refer to the resource classification labels and the samples with the image evaluation grades completely conforming to the multimedia resources of the current samples, and the negative samples refer to the resource classification labels or the samples with the image evaluation grades not conforming to the multimedia resources of the current samples, so that the constructed training samples are utilized to train the classification evaluation model. Since the training process includes a plurality of iterative processes, and the processes performed by each iterative process are similar, a single iterative process is specifically described herein as an example.

Step 804: and respectively carrying out forward prediction on each sample multimedia resource by adopting a classification evaluation model to obtain the prediction type of each sample multimedia resource and the prediction evaluation result of each candidate image.

In one possible implementation, the classification evaluation model may employ a model architecture as shown in fig. 9, where the architecture includes a feature extraction layer, a coding layer, a fusion layer, and a subtask prediction layer.

(1) Feature extraction layer

The feature extraction layer may employ a neural network capable of feature extraction, such as CNN or a contrast learning-based large-scale Image Pre-model (CLIP), where both text and Image encoders are based on a transducer structure.

(2) Coding layer

The encoding layer may include a text encoder and an image encoder for implementing encoding of text features and encoding of image features, respectively. As shown in fig. 9, the text encoder converter 1 and the image encoder converter 2 may use, for example, a converter structure or an encoder part of an encoder-decoder, and the converter is a network structure composed of self-attention (self-attention) and a feedforward neural network (Feed Forward Neural Network), and is widely used in the NLP field and the CV field.

In one possible implementation, the text encoder and the image encoder may share weight parameters.

(3) Fusion layer

The fusion layer is used for realizing fusion of text and/or image coding characteristics and inputting the text and/or image coding characteristics into the subsequent subtask prediction layer for predicting each subtask.

(4) Subtask prediction layer

The method comprises a classification subtask and an evaluation subtask, wherein the classification subtask is used for predicting the type of the sample multimedia resource aiming at the fused characteristics, outputting the probability of each sample media resource type relative to each resource type, and the evaluation subtask is used for evaluating whether each candidate image is suitable for being used as a cover image relative to the sample multimedia resource from which the candidate image is sourced, and outputting the model evaluation result of each candidate image.

Both the classification subtask and the evaluation subtask can be implemented using a network architecture such as a multi-layer perceptron (MultilayerPerceptron, MLP), as shown in fig. 9, showing one possible MLP architecture comprising, in order, a linear layer, an activation layer (Rectified Linear Units, relu), and a linear layer.

In the embodiment of the present application, the classification evaluation model shown in fig. 9 is used to perform forward prediction based on the relevant data of each sample multimedia resource, so as to obtain the prediction type of each sample multimedia resource and the prediction evaluation result of each candidate image.

Step 805: a classification loss function is constructed based on the differences between the obtained respective prediction types and the corresponding resource classification labels, and an evaluation loss function is constructed based on the differences between the obtained prediction evaluation results and the corresponding image evaluation levels.

In one possible implementation, the classification subtask may use a cross entropy loss function (Cross Entropy Loss Function), denoted as l_cls, the assessment subtask may use a center loss function, denoted as l_scr, and thus the model loss function of the classification assessment model may be expressed as:

Loss＝alpha*L_cls+beita*L_scr

wherein alpha is the loss weight of the classification subtask, beita is the loss weight of the evaluation subtask, and the parameters are adjustable, and the ranges are (0, 1). In the implementation process, the specification can be generally performed according to the magnitude of two losses, for example, the value of l_scr ranges from 0 to 1, the value of l_cls ranges from 1 to 2, and the weight of l_cls can be reduced according to the requirement. Further, it is also possible to set according to the importance of the subtasks, because the primary task is cover image selection and the secondary task is resource classification, the weight beita corresponding to the resource classification is set to a smaller value (e.g., 0.1), and the image evaluation weight alpha is set to a larger value (e.g., 0.9).

Step 806: and judging whether the classification evaluation model converges or not.

In one possible embodiment, the convergence condition may include any one of the following conditions:

(1) The model loss is not greater than the set loss threshold.

(2) The iteration number reaches a set number threshold.

Step 807: if the determination result in step 806 is no, parameter adjustment is performed on the classification evaluation model based on the classification loss function and the evaluation loss function, and the next iteration process is performed based on the adjusted video encoder, i.e. step 804 is skipped. The model parameters mainly comprise weight parameters of the layers.

If the determination result in step 806 is yes, that is, the classification evaluation model has reached the convergence condition, the training process is ended, and then the classification evaluation model needs to be evaluated.

In the embodiment of the application, before performing the evaluation, the evaluation data set needs to be prepared in advance. Taking a video data set as an example, the method for constructing the evaluation data set is used for selecting an evaluation video, performing frame extraction, clustering and sequencing processes in a picture selection process on each video, reserving a plurality of (for example, 16 frames) candidate images for each video, and manually evaluating the candidate images of each video to select candidate images with the first three recommendation degrees, thereby obtaining the evaluation data set. Furthermore, by adopting the method provided by the embodiment of the application, simulation image selection is carried out on 16 frames of each video in the evaluation data set, candidate images with the first three recommendation degrees are selected, and are compared with the first three images marked manually, the sorting evaluation index (Normalize Discounted Cumulative Gain, NDCG) is calculated, and the higher the index value is, the better the algorithm effect is.

In the evaluation process of the embodiment of the application, the evaluation picture set is fixed in advance, so that the efficiency is higher than that of a general method for evaluating the artificial effect after selecting pictures from videos. In the practical process, 5000 videos are screened out from on-line data to extract candidate images, a plurality of frames are extracted from each video to form an evaluation data set, the set is manually selected for the first three frames of video frames, finally, the result selected by the method is compared with the result of manual annotation, the matching degree of algorithm ordering and annotation ordering is calculated, and finally, on the graph selecting effect, compared with the scheme in the related art, the method is improved from 0.69 to 0.72 on NDCG value, namely, the graph selecting effect is improved, the attraction of the pictures can be further improved, and the Click-Through-Rate (CTR) of the videos is improved. Moreover, the evaluation mode provided by the application can be used for evaluating most evaluation algorithms, and the evaluation period can be shortened from the original one week to one day while the subjective data evaluation problem is solved.

The process of step 202 is described below in conjunction with the model architecture shown in fig. 9, and referring to fig. 10, a flow chart for classifying resources and evaluating images is shown, which includes the following steps:

Step 2021: and extracting corresponding original image features from each original image.

In one possible implementation, feature extraction of CLIP vectors may be performed for each original image. It should be noted that, since the candidate images are selected from the original images, feature extraction is performed on each original image, which corresponds to acquiring CLIP vectors of each candidate image.

Step 2022: and extracting first text features from the resource description texts, and extracting corresponding second text features from the image description texts.

In the embodiment of the application, the extraction process of the resource description text is similar to that of the image description text, for example, after the resource description text or the image description text is segmented, a text segmentation sequence is obtained, and then feature extraction is performed on the text segmentation sequence through a serialization feature extraction method, so that corresponding text features are obtained.

In one possible implementation manner, the resource description text or the image description text may be divided into words, and then feature extraction of the CLIP vector is performed on each division word, so as to obtain the CLIP vector of each division word, and the CLIP vector is input into a subsequent transform.

Step 2023: and carrying out text feature fusion on the obtained first text features and each second text feature to obtain fused text features, and carrying out image feature fusion on each obtained original image feature to obtain fused image features.

Referring to fig. 9, the classification evaluation model includes a text encoder (transducer 1) and an image encoder (transducer 2), weight parameters are shared between the text encoder and the image encoder, text feature encoding is performed by using the text encoder based on the weight parameters thereof for the obtained first text feature and each second text feature to obtain a fused text feature, and image feature encoding is performed by using the image encoder based on the weight parameters thereof for each obtained original image feature to obtain a fused image feature.

It should be noted that, referring to fig. 9, in order to accurately evaluate the image of each candidate image, in addition to taking the original image as input, the embodiment of the present application additionally adds the candidate image as input, so that the fused image features can be focused on the candidate image, and help to determine whether the image is suitable for being used as a cover image later.

Step 2024: and predicting the target resource type and each model evaluation result based on the fused text features and the fused image features.

In the embodiment of the application, the obtained fusion text feature and the obtained image text feature can be further subjected to feature fusion through the fusion layer to obtain the fusion resource feature. Feature fusion refers to integrating information contained in the fused text features and the image text features, and the feature fusion can be implemented in any one of the following ways:

(1) Vector splice (concatate)

When the feature fusion is carried out in a vector splicing mode, the vector splicing is carried out by fusing text features and image text features according to a set mode. For example, the image feature is followed by the text feature.

For the feature fusion performed in a vector splicing mode, the obtained spliced feature vector has more dimensionality, and the spliced feature can be subjected to feature reduction in a certain feature reduction mode, so that the fusion resource feature is obtained.

(2) Feature pooling (pooling)

When feature fusion is performed in the feature pooling manner, pooling processing is performed on the fused text features and the image text features, for example, pooling processing manners such as max-pooling (max-pooling) or mean-pooling (mean-pooling) are used, which is not limited in the embodiment of the present application.

(3) Convolution (Convolition) processing

When feature fusion is carried out through convolution processing, a feature matrix formed by fusion text features and image text features is subjected to convolution operation through a convolution layer by adopting a set step length.

(4) Full join processing

When the feature fusion is carried out through the full connection processing, the fusion text feature and the image text feature are mapped through the full connection layer (fully connected layers, FC) to obtain the fusion resource feature.

In the embodiment of the application, the obtained fusion resource characteristics are output to different subtasks for prediction, and corresponding resource types and model evaluation results are obtained.

Referring to fig. 9, taking resource classification as an example, the fused resource features are output to a classification subtask, and after the linear layer, the activation layer and the linear layer are sequentially processed, corresponding classification probabilities are obtained, the classification probabilities represent the probability that the target multimedia resource belongs to each resource type, and the resource type with the largest probability value is selected as the target resource type of the target multimedia resource.

In the embodiment of the application, besides the description text and the image characteristics included in the multimedia resources, the image description text and the image characteristics of the candidate images are additionally added, and the rich resource data of each multimedia resource are fully considered, so that the resource classification and the image evaluation result are more accurate.

In the embodiment of the application, when updating the evaluation results of each model aiming at the target resource type, the update can be performed in a plurality of modes, and the description is performed one by one. Since this process is the same for each candidate image, the following description will specifically take one candidate image as an example.

(1) First mode

Considering different types of multimedia resources, the cover selection can focus on different evaluation dimensions, so that candidate images can be evaluated from a plurality of evaluation dimensions, and the weight of each evaluation dimension is determined by the resource type obtained by the model.

Referring to fig. 11, a flow chart for obtaining a target evaluation result is shown. In addition to the model evaluation by using the classification evaluation model, the evaluation can be performed in combination with other modes, so that each candidate is comprehensively evaluated from multiple aspects and multiple dimensions as the recommendation degree of the cover image.

In a possible implementation manner, referring to fig. 11, the evaluation of the candidate image may include two aspects, that is, on one hand, performing comprehensive model evaluation through a classification evaluation model, which may be referred to in the foregoing description, and not be repeated, and on the other hand, performing subject evaluation on the candidate image through a subject evaluation model, to obtain a corresponding subject evaluation result, so as to obtain a final target evaluation result by combining the two aspects.

It should be noted that the subject evaluation is only one possible evaluation mode, and may be performed in combination with other evaluation modes in the practical application process, which is not limited in the embodiment of the present application.

In one possible implementation, the subject evaluation process may be performed using a subject evaluation model as shown in FIG. 12.

Specifically, for one candidate image a, image main body detection is performed on the candidate image a to obtain a target area where an image main body in the candidate image a is located, and further, an overlapping degree between the target area and a central area of the candidate image a is determined, where the overlapping degree may be measured by using an Intersection-over-Union (IoU), for example, so as to obtain a main body evaluation result of the candidate image a. In general, the center region of the image may be 1/3 of the position of the picture, the center of the image is a more focused part of the image, so that the image content is easier to express, and thus, whether the image is accurately expressed can be measured by whether the image main body is in the center, for example, for the snow stacker image in fig. 3, two dolls are positioned in the center region to more intuitively express the content of the image, and the image is more esthetically pleasing in terms of the aesthetic effect, so that the higher the intersection ratio is, the higher the main body evaluation value is, which indicates that the image is more suitable as a cover image.

In specific implementation, the image main body detection can be focused on human face detection and human body detection, the target frame of the position of the human face and the human face is marked by detecting the human face and the human face, and the area overlapping calculation is carried out with the 1/3 position of the center of the picture, so that the effective area and the ineffective area are estimated, and further the task evaluation result is obtained.

Continuing to refer to fig. 11, when the evaluation is performed by using multiple evaluation manners, the cover evaluation rule may include weight parameters for the multiple evaluation manners, where the weight parameters may relate to resource types, and weights of respective evaluation results in the cover evaluation rule are formulated according to the obtained target resource types, for example, weights of model evaluation results and subject evaluation results are determined, and then, based on the model evaluation results and subject evaluation results of respective candidate images, and the obtained weights, target evaluation results of the respective candidate images are obtained, and the target evaluation result P may be represented as follows:

P＝a*x ₁ +b*x ₂

wherein x is ₁ Characterization model evaluation result, x ₂ And (3) representing the subject evaluation result, wherein a represents the weight of the model evaluation result, and b represents the weight of the subject evaluation result. The values of a and b are determined according to the above-mentioned target resource types, for example, for video, the video a and b of film or variety are set to 0.5, and the video a and b of documentary can be set to 1 and 0, so that the evaluation rule can be properly adjusted based on a specific type, so as to more conform to the actual scene, improve the accuracy of image evaluation, and the dynamic adjustment strategy is simple in process, and is more efficient and faster in the actual image selection process.

(2) Second mode

In this manner, each model evaluation result may include a recommendation degree of the corresponding candidate image with respect to each target resource type, that is, for each candidate image, a model evaluation result including M probabilities may be obtained, each probability corresponding to a resource type, and the recommendation degree of the candidate image as a cover image of the resource type is characterized.

The classification evaluation model may output a recommendation level matrix, see fig. 13, whose behavior resource types are listed as candidate images, e.g. the first row and first column identifies the recommendation level of candidate image 1 for resource type 1.

Then, the cover evaluation rule may include a weight parameter for each resource type, where the weight parameter may relate to the resource type, and further may determine a weight value for each resource type according to a target resource type of the target multimedia resource, and further determine a target evaluation result of each candidate image based on a recommendation degree of the candidate image with respect to each resource type and a corresponding weight. That is, the recommendation degree of each candidate image is fused by carrying out weight assignment on each resource type and carrying out weighted summation on the recommendation degree of each candidate image, so as to obtain a final target evaluation result, and the obtained target evaluation result can be used as a reference basis for selecting a subsequent cover.

In a possible implementation manner, the weight value of the target resource type in each resource type may be set to 1, and the weight values of the other resource types are set to 0, which is equivalent to determining the recommendation degree corresponding to the target resource type from each model evaluation result, and determining each obtained recommendation degree as the target evaluation result of each candidate image. Referring to fig. 13, a corresponding column is found based on the target resource type, and the recommendation degree of the column is the final evaluation result of each candidate image, and the cover image can be selected based on the final evaluation result.

In one possible implementation manner, the weight value of each resource type may be set according to the association degree between other resource types and the target resource type, and the higher the association degree between the resource type and the target resource type, the higher the weight value, for example, the video type includes a food, a variety, a drama, a documentary and a movie, and since the drama and the movie are mainly drama, the higher the association degree is, and when the target resource is a drama, the weight value of the movie may be higher relative to the weight value of other resource types. And then carrying out weighted summation based on the recommendation degree of each resource type to obtain a target evaluation result of each candidate image.

In summary, in the embodiment of the present application, the classification of the deep learning method is performed through the associated data provided by the multimedia resources, so as to obtain the resource classification, get rid of the dependence on the manual labeling classification, and then comprehensively consider the methods of the picture evaluation criteria (for example, the aesthetic angle, whether the picture is attached to the video, whether the picture is attached to the title, etc.) according to different vertical designs, so as to select the evaluation dimension with emphasis on different categories, and enable the machine to obtain the multi-dimensional information and the category customized evaluation criteria for picture evaluation. Taking video as an example, in the actual business process, the link of video cover selection is very early, and a proper cover diagram is needed to be generated under the condition that the video is not operated at all, and at the moment, the related information of video classification cannot be acquired.

Referring to fig. 14, based on the same inventive concept, an embodiment of the present application further provides a content recommendation device 140, including:

an image extraction unit 1401 for extracting a plurality of candidate images from respective original images included in the target multimedia asset;

an image description unit 1402 for generating corresponding image description text based on image contents of the respective candidate images, respectively;

A classification evaluation unit 1403, configured to input the resource contents in multiple media forms included in the target multimedia resource and the respective image description texts into a trained classification evaluation model, and obtain a target resource type of the target multimedia resource and a model evaluation result of each candidate image, where each model evaluation result represents: the corresponding candidate image is recommended as the recommendation degree of the cover image;

the image selecting unit 1404 is configured to determine a cover evaluation rule corresponding to the target multimedia resource based on the target resource type, and update each obtained model evaluation result based on the cover evaluation rule, so as to obtain a target evaluation result of each candidate image; and determining a cover image from the respective candidate images based on the target evaluation results of the respective candidate images.

Optionally, the graph selecting unit 1404 is specifically configured to:

performing cover image evaluation based on the position information of the image main body in each candidate image respectively to obtain a main body evaluation result of each candidate image;

determining weights of a model evaluation result and a main body evaluation result in the cover evaluation rule based on the target resource type;

and obtaining target evaluation results of the candidate images based on the model evaluation results and the subject evaluation results of the candidate images and the obtained weights.

Optionally, the graph selecting unit 1404 is specifically configured to:

for each candidate image, the following steps are performed:

aiming at a candidate image, carrying out main body detection on the candidate image to obtain a target area where an image main body in the candidate image is located;

and obtaining a subject evaluation result of one candidate image according to the overlapping degree between the target area and the central area of the one candidate image.

Optionally, each model evaluation result includes a recommendation degree of the corresponding candidate image with respect to each resource type; the diagram selecting unit 1404 is specifically configured to:

Optionally, the resource content comprises resource description text of each original image and the target multimedia resource; the classification evaluation unit 1403 is specifically configured to:

extracting first text features from the resource description texts, and extracting corresponding second text features from each image description text respectively;

and predicting the target resource type and each model evaluation result based on the fused text features and the fused image features.

Optionally, the classification evaluation model includes a text encoder and an image encoder, the text encoder and the image encoder sharing weight parameters; the classification evaluation unit 1403 is specifically configured to:

adopting a text encoder to encode the text features of the obtained first text features and each second text feature based on the weight parameters to obtain a fusion text feature;

and (3) adopting an image encoder to perform image feature encoding on each obtained original image feature based on the weight parameters to obtain a fusion image feature.

Optionally, the image extraction unit 1401 is specifically configured to:

aiming at L class clusters, respectively selecting a primary selected image with highest stationarity in each class cluster to obtain L primary selected images, wherein L is smaller than M;

a plurality of candidate images are selected based on the base image indices of the L primary selected images.

Optionally, the image description unit 1402 is specifically configured to:

for each candidate image, the following steps are performed:

image coding is carried out on one candidate image based on an image coding part of the image description model to obtain coding characteristics corresponding to the candidate image;

decoding the coding features based on a plurality of serialization decoding units included in the image description model to obtain a plurality of description words; each serialization decoding unit predicts based on the coding characteristics and the descriptive word output by the previous serialization decoding unit to obtain a corresponding descriptive word;

based on the plurality of descriptive segmentations, an image descriptive text is obtained.

Optionally, the apparatus further comprises a model training unit 1405 for training to obtain a classification evaluation model by:

extracting a plurality of candidate images from original images of each sample multimedia resource respectively, and acquiring corresponding image evaluation grades aiming at each candidate image, wherein the image evaluation grades are characterized in that: the corresponding candidate images determined based on the priori knowledge are recommended as recommendation degrees of the cover images;

and constructing a plurality of training samples based on the obtained sample multimedia resources, the resource classification labels and the image evaluation grades, and training a classification evaluation model based on the plurality of training samples.

Optionally, the model training unit 1405 is specifically configured to:

respectively carrying out forward prediction on each sample multimedia resource by adopting a classification evaluation model to obtain the prediction type of each sample multimedia resource and the prediction evaluation result of each candidate image;

Through the device, when the cover image is extracted from the target multimedia resource, the resource type of the target multimedia resource is accurately predicted through the resource content in various media forms and the image description text of the candidate image, meanwhile, the classification evaluation model is used for evaluating the candidate image and outputting corresponding model evaluation results, further, after the model evaluation results are updated through the resource type, more accurate target evaluation results are obtained, and further, the cover image is selected according to the target evaluation results.

For convenience of description, the above parts are respectively described as being functionally divided into unit modules (or modules). Of course, the functions of each unit (or module) may be implemented in the same piece or pieces of software or hardware when implementing the present application.

Those skilled in the art will appreciate that the various aspects of the application may be implemented as a system, method, or program product. Accordingly, aspects of the application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

The apparatus may be used to perform the methods shown in the embodiments of the present application, and therefore, the description of the foregoing embodiments may be referred to for the functions that can be implemented by each functional module of the apparatus, and the like, which are not repeated.

Referring to fig. 15, based on the same technical concept, the embodiment of the application further provides a computer device. In one embodiment, the computer device may be the server shown in FIG. 1, and as shown in FIG. 15, includes a memory 1501, a communication module 1503, and one or more processors 1502.

A memory 1501 for storing computer programs executed by the processor 1502. The memory 1501 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a program required for running an instant communication function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.

The memory 1501 may be a volatile memory (RAM) such as a random-access memory (RAM); the memory 1501 may also be a nonvolatile memory (non-volatile memory), such as a read-only memory, a flash memory (flash memory), a hard disk (HDD) or a Solid State Drive (SSD); or memory 1501, is any other medium capable of carrying or storing desired program code in the form of instructions or data structures and capable of being accessed by a computer, but is not limited thereto. The memory 1501 may be a combination of the above memories.

The processor 1502 may include one or more central processing units (central processing unit, CPU) or digital processing units, or the like. A processor 1502 for implementing the cover image extraction method described above when calling a computer program stored in the memory 1501.

The communication module 1503 is used for communicating with the terminal device and other servers.

The specific connection medium between the memory 1501, the communication module 1503 and the processor 1502 is not limited in the embodiment of the present application. The embodiment of the present application is illustrated in fig. 15 by the memory 1501 and the processor 1502 being connected by the bus 1504, the bus 1504 being illustrated in fig. 15 by a bold line, and the connection between other components being illustrated only by way of example and not by way of limitation. The bus 1504 may be divided into an address bus, a data bus, a control bus, and the like. For ease of description, only one thick line is depicted in fig. 15, but only one bus or one type of bus is not depicted.

The memory 1501 stores therein a computer storage medium in which computer executable instructions for implementing the cover image extraction method of the embodiment of the present application are stored. The processor 1502 is configured to execute the cover image extraction method of each of the above embodiments.

In another embodiment, the computer device may also be other computer devices, such as the terminal device shown in FIG. 1. In this embodiment, the structure of the computer device may include, as shown in fig. 16: communication component 1610, memory 1620, display unit 1630, camera 1640, sensor 1650, audio circuitry 1660, bluetooth module 1670, processor 1680, and the like.

The communication component 1610 is for communicating with a server. In some embodiments, a circuit wireless fidelity (Wireless Fidelity, wiFi) module may be included, where the WiFi module is a short-range wireless transmission technology, and the computer device may help the user to send and receive information through the WiFi module.

Memory 1620 may be used to store software programs and data. The processor 1680 performs various functions of the terminal device and data processing by executing software programs or data stored in the memory 1620. The memory 1620 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. The memory 1620 stores an operating system that enables the terminal device to operate. The memory 1620 may store an operating system and various application programs, and may also store codes for executing the cover image extraction method according to the embodiment of the present application.

The display unit 1630 may also be used to display information input by a user or information provided to the user and a graphical user interface (graphical user interface, GUI) of various menus of the terminal device. Specifically, the display unit 1630 may include a display screen 1632 disposed on the front surface of the terminal device. The display 1632 may be configured in the form of a liquid crystal display, light emitting diodes, or the like. The display unit 1630 may be used to display a cover image display or a recommended page in the embodiment of the present application.

The display unit 1630 may also be used to receive input numeric or character information, generate signal inputs related to user settings and function control of the terminal device, and in particular, the display unit 1630 may include a touch screen 1631 disposed on the front of the terminal device, and may collect touch operations on or near the user, such as clicking buttons, dragging scroll boxes, and the like.

The touch screen 1631 may cover the display screen 1632, or the touch screen 1631 may be integrated with the display screen 1632 to implement input and output functions of the terminal device, and after integration, the touch screen may be abbreviated as touch screen. The display unit 1630 may display application programs and corresponding operation steps in the present application.

The camera 1640 may be used to capture still images, and a user may post comments on the image captured by the camera 1640 through an application. The camera 1640 may be one or a plurality of cameras. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive elements convert the optical signals to electrical signals, which are then passed to the processor 1680 for conversion to digital image signals.

The terminal device may further include at least one sensor 1650, such as an acceleration sensor 1651, a distance sensor 1652, a fingerprint sensor 1653, a temperature sensor 1654. The terminal device may also be configured with other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, light sensors, motion sensors, and the like.

Audio circuitry 1660, speakers 1661, and microphone 1662 may provide an audio interface between the user and the terminal device. The audio circuit 1660 may transmit the received electrical signal converted from audio data to the speaker 1661, and convert the electrical signal into an audio signal by the speaker 1661 to be output. The terminal device may also be configured with a volume button for adjusting the volume of the sound signal. On the other hand, the microphone 1662 converts the collected sound signals into electrical signals, which are received by the audio circuit 1660 and converted into audio data, which are output to the communication component 1610 for transmission to, for example, another terminal device, or to the memory 1620 for further processing.

The bluetooth module 1670 is used to exchange information with other bluetooth devices having bluetooth modules through bluetooth protocols. For example, the terminal device may establish a bluetooth connection with a wearable computer device (e.g., a smartwatch) that also has a bluetooth module via bluetooth module 1670 for data interaction.

The processor 1680 is a control center of the terminal device, connects various parts of the entire terminal using various interfaces and lines, and performs various functions of the terminal device and processes data by running or executing software programs stored in the memory 1620 and calling data stored in the memory 1620. In some embodiments, the processor 1680 may include one or more processing units; the processor 1680 may also integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., and a baseband processor that primarily handles wireless communications. It will be appreciated that the baseband processor described above may not be integrated into the processor 1680. The processor 1680 of the present application may run an operating system, an application, a user interface display, and a touch response, as well as the cover image extraction method of the present application. In addition, a processor 1680 is coupled to the display unit 1630.

In some possible embodiments, aspects of the cover image extraction method provided by the present application may also be implemented in the form of a program product, which includes program code for causing a computer device to perform the steps of the cover image extraction method according to various exemplary embodiments of the present application described above when the program product is run on the computer device, for example, the computer device may perform the steps of the embodiments.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code and may run on a computing device. However, the program product of the present application is not limited thereto, and in the present application, the readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.

The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's equipment, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.

Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A cover image extraction method, characterized in that the method comprises:

2. The method of claim 1, wherein determining a cover evaluation rule corresponding to the target multimedia resource based on the target resource type, and updating each obtained model evaluation result based on the cover evaluation rule, respectively, to obtain target evaluation results of each candidate image, comprises:

performing cover image evaluation based on the position information of the image subjects in each candidate image respectively to obtain subject evaluation results of each candidate image;

determining weights of the model evaluation result and the main body evaluation result in the cover evaluation rule based on the target resource type;

3. The method of claim 2, wherein performing cover image evaluation based on the position information of the image subjects in the respective candidate images, respectively, to obtain subject evaluation results of the respective candidate images, comprises:

For each candidate image, the following steps are respectively executed:

4. The method of claim 1, wherein each model evaluation result includes a recommendation level for a respective candidate image with respect to a respective resource type;

determining a cover evaluation rule corresponding to the target multimedia resource based on the target resource type, and respectively updating each obtained model evaluation result based on the cover evaluation rule to obtain target evaluation results of each candidate image, wherein the method comprises the following steps:

5. The method of claim 1, wherein the resource content comprises resource description text of the respective original image and the target multimedia resource;

Inputting the resource content of the target multimedia resource in various media forms and the description text of each image into a trained classification evaluation model to respectively obtain the target resource type of the target multimedia resource and the model evaluation result of each candidate image, wherein the method comprises the following steps:

6. The method of claim 5, wherein the classification evaluation model comprises a text encoder and an image encoder, the text encoder and the image encoder sharing weight parameters;

the text feature fusion is performed on the obtained first text feature and each second text feature to obtain a fused text feature, and the image feature fusion is performed on each obtained original image feature to obtain a fused image feature, including:

7. The method according to any one of claims 1 to 6, wherein extracting a plurality of candidate images from respective original images included in the target multimedia asset comprises:

8. The method according to any one of claims 1 to 6, wherein generating the corresponding image description text based on the image content of each candidate image, respectively, comprises:

for each candidate image, the following steps are respectively executed:

9. The method according to any one of claims 1 to 6, wherein the classification evaluation model is trained by:

10. The method of claim 9, wherein training the classification evaluation model based on the plurality of training samples comprises:

11. A cover image extraction apparatus, characterized in that the apparatus comprises:

12. The apparatus of claim 11, wherein the map selecting unit is specifically configured to:

13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that,

the processor, when executing the computer program, implements the steps of the method of any one of claims 1 to 10.

14. A computer storage medium having stored thereon computer program instructions, characterized in that,

which computer program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 10.

15. A computer program product comprising computer program instructions, characterized in that,