CN114625897A

CN114625897A - Multimedia resource processing method and device, electronic equipment and storage medium

Info

Publication number: CN114625897A
Application number: CN202210281719.1A
Authority: CN
Inventors: 康战辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2022-06-14

Abstract

The application relates to a multimedia resource processing method, a multimedia resource processing device, electronic equipment and a storage medium. The method comprises the following steps: acquiring first text information to be searched and a plurality of multimedia resources; respectively carrying out text coding and image coding on second text information and image frames included in each multimedia resource to obtain second text characteristics and image characteristics of each multimedia resource; performing cross-modal feature fusion processing on the second text features and the image features of the multimedia resources to obtain multi-modal features corresponding to the multimedia resources; performing feature correlation processing on the multi-modal feature and the first text feature (text feature corresponding to the first text information), and determining correlation information of each of the plurality of multimedia resources and the first text information; and screening out the target multimedia resource matched with the first text information from the plurality of multimedia resources according to the correlation information. According to the technical scheme of the application, the search precision of the multimedia resources can be improved.

Description

Multimedia resource processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a multimedia resource processing method and apparatus, an electronic device, and a storage medium.

Background

The search of multimedia resources (such as short video search) is the same as the text search of general news, information and the like, a search word query is firstly input, and a search engine background recalls a series of candidate document sets containing the query word according to a pre-established text index (such as text fields of titles, labels, brief introduction and the like of the multimedia resources). And then ranking the candidate documents by calculating text relevance scores of the query and the candidate document set. The searching mode only uses the text correlation characteristic, namely only considers the text correlation of the search query and the title of the multimedia resource, and the searching precision is poor under the condition that the text of the title of the multimedia resource is short or the semantic difference with the video content is large.

Disclosure of Invention

In view of the above technical problems, the present application provides a multimedia resource processing method, an apparatus, an electronic device, and a storage medium.

According to an aspect of the present application, there is provided a multimedia resource processing method, including:

acquiring first text information to be searched and a plurality of multimedia resources;

respectively carrying out text coding and image coding on second text information and image frames included in each multimedia resource to obtain second text characteristics and image characteristics of each multimedia resource;

performing cross-modal feature fusion processing on the second text features and the image features of the multimedia resources to obtain multi-modal features corresponding to the multimedia resources;

performing feature correlation processing on the multi-modal feature and a first text feature, and determining correlation information between each of the plurality of multimedia resources and the first text information, wherein the first text feature is a text feature corresponding to the first text information; the correlation information represents the matching degree of the content of each multimedia resource and the first text information;

and screening out target multimedia resources matched with the first text information from the plurality of multimedia resources according to the correlation information.

According to another aspect of the present application, there is provided a multimedia resource processing apparatus including:

the system comprises an acquisition module, a search module and a search module, wherein the acquisition module is used for acquiring first text information to be searched and a plurality of multimedia resources; the plurality of multimedia resources are recall multimedia resources corresponding to the first text information;

the encoding module is used for respectively carrying out text encoding and image encoding on second text information and image frames included in each multimedia resource to obtain second text characteristics and image characteristics of each multimedia resource;

the multi-modal processing module is used for performing cross-modal feature fusion processing on the second text features and the image features of the multimedia resources to obtain multi-modal features corresponding to the multimedia resources;

the relevance prediction module is used for performing feature relevance processing on the multi-modal feature and a first text feature and determining relevance information of each of the plurality of multimedia resources and the first text information, wherein the first text feature is a text feature corresponding to the first text information; the correlation information represents the matching degree of the content of each multimedia resource and the first text information;

and the target multimedia resource determining module is used for screening out the target multimedia resources matched with the first text information from the plurality of multimedia resources according to the correlation information.

According to another aspect of the present application, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the above method.

According to another aspect of the application, a non-transitory computer-readable storage medium is provided, having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the above-described method.

The method comprises the steps of performing text coding and image coding on second text information and image frames included in multimedia resources to obtain second text features and image features of the multimedia resources, performing cross-modal feature fusion processing on the second text features and the image features of the multimedia resources to obtain multi-modal features corresponding to the multimedia resources, performing feature correlation processing on the multi-modal features and the first text features, determining correlation information of the multimedia resources and the first text information, and determining a target multimedia resource matched with the first text information from the multimedia resources based on the correlation information. The search mode of the multimedia resource enables the search words to be matched with text features and visual features in the multimedia resource at the same time, and improves the search accuracy; and under the condition that texts such as titles, brief introduction and the like of multimedia resources are short or the difference between semantics and video contents is large, the searching precision is not influenced.

Other features and aspects of the present application will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the application and, together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic diagram illustrating an application system according to an embodiment of the present application.

Fig. 2 shows a flowchart of a multimedia resource processing method according to an embodiment of the present application.

Fig. 3 is a flowchart illustrating a method for performing text coding and image coding on second text information and an image frame included in each multimedia resource to obtain a second text feature and an image feature of each multimedia resource according to an embodiment of the present application.

Fig. 4 is a flowchart illustrating a method for extracting second text information corresponding to each multimedia resource from a plurality of image frames corresponding to each multimedia resource based on an optical character recognition technology according to an embodiment of the present application.

FIG. 5 illustrates a schematic diagram of a multi-modal relevance model provided in accordance with an embodiment of the present application.

Fig. 6 is a flowchart illustrating a method for training a relevance prediction model according to an embodiment of the present application.

Fig. 7 is a block diagram of a multimedia resource processing apparatus according to an embodiment of the present application.

Fig. 8 shows a block diagram of an electronic device for multimedia asset processing according to an embodiment of the present application.

Detailed Description

Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present application.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application system according to an embodiment of the present application. The application system can be used for the multimedia resource processing method. As shown in fig. 1, the application system may include at least a server 01 and a terminal 02.

In this embodiment of the application, the server 01 may be used for multimedia resource processing, for example, multimedia resource search processing, and the server 01 may include an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (Content delivery network), and a big data and artificial intelligence platform.

In this embodiment, the terminal 02 may receive and display the target multimedia resource. The terminal 02 may include a smart phone, a desktop computer, a tablet computer, a notebook computer, a smart speaker, a digital assistant, an Augmented Reality (AR)/Virtual Reality (VR) device, a smart wearable device, and other types of entity devices. The physical device may also include software running in the physical device, such as an application program. The operating system running on terminal 02 in this embodiment of the present application may include, but is not limited to, an android system, an IOS system, linux, windows, and the like.

In the embodiment of the present disclosure, the terminal 02 and the server 01 may be directly or indirectly connected by a wired or wireless communication method, and the present disclosure is not limited thereto.

In a specific embodiment, when the server 02 is a distributed system, the distributed system may be a blockchain system, when the distributed system is a blockchain system, the distributed system may be formed by a plurality of nodes (any form of computing device in an access network, such as a server and a user terminal), a Peer-to-Peer (P2P, Peerto Peer) network is formed between the nodes, and the P2P Protocol is an application layer Protocol operating on a Transmission Control Protocol (TCP). In a distributed system, any machine, such as a server or a terminal, can join to become a node, which includes a hardware layer, an intermediate layer, an operating system layer, and an application layer. Specifically, the functions of each node in the blockchain system may include:

1) routing, a basic function that a node has, is used to support communication between nodes.

Besides the routing function, the node may also have the following functions:

2) the application is used for being deployed in a block chain, realizing specific services according to actual service requirements, recording data related to the realization functions to form recording data, carrying a digital signature in the recording data to represent a source of task data, and sending the recording data to other nodes in the block chain system, so that the other nodes add the recording data to a temporary block when the source and integrity of the recording data are verified successfully.

It should be noted that, in the specific implementation manner of the present application, the data related to the user information is required to obtain user permission or consent when the following embodiments of the present application are applied to specific products or technologies, and the collection, use and processing of the related data are required to comply with the relevant laws and regulations and standards of the relevant countries and regions.

Fig. 2 shows a flowchart of a multimedia resource processing method according to an embodiment of the present application. As shown in fig. 2, the method may include:

s201, first text information to be searched and a plurality of multimedia resources are obtained.

In this embodiment of the present specification, the first text information may be a search word input by the terminal side, and accordingly, the search engine may acquire the first text information (search word) to be searched in response to the search request. And then the multimedia resource can be searched based on the first text information. In an example, the plurality of multimedia resources may be recalled multimedia resources corresponding to the first text information, that is, preliminary screening of the multimedia resources may be performed first, for example, text description information of each of the plurality of multimedia resources to be matched may be obtained; it is thus possible to determine the (matched) recalled multimedia asset(s) corresponding to the first text information from among the plurality of multimedia assets to be matched, based on the text description information. I.e. a preliminary screening based on text matching. Specifically, the multimedia resource to be matched corresponding to the text description information matched with the first text information may be regarded as a plurality of multimedia resources.

The text description information of the multimedia resource may refer to description information set for the multimedia resource in advance, and may include, for example, title information, tag information, profile information, and the like, which is not limited in this disclosure. The plurality of multimedia resources to be matched may be all multimedia resources in the multimedia resource platform, or may be a set of multimedia resources for searching, which is not limited in this disclosure. Through preliminary screening, the data volume of subsequent processing can be reduced, and therefore efficiency can be improved.

S203, respectively carrying out text coding and image coding on second text information and image frames included in each multimedia resource to obtain second text characteristics and image characteristics of each multimedia resource;

and S205, performing cross-modal feature fusion processing on the second text features and the image features of the multimedia resources to obtain multi-modal features corresponding to the multimedia resources.

In this embodiment of the present specification, text information may be extracted from an image frame included in each multimedia resource, for example, text (for example, text of a subtitle, a bullet, a comment, and the like) in the image frame is recognized based on an optical character recognition technology ocr (optical character recognition), and the recognized second text information is extracted. Therefore, text coding and image coding can be carried out on the extracted second text information and the image frame, namely, extraction processing of text features and image features (visual features) is carried out, and second text features and image features of the multimedia resources are obtained. And performing cross-modal fusion processing (cross-modal cross processing) on the second text features and the image features of the multimedia resources to obtain multi-modal features, namely the multimedia features, corresponding to each multimedia resource. The multi-modal features may be 512-dimensional vectors, which are not limited by this disclosure.

The second text feature and the image feature may be extracted based on a corresponding machine learning model, for example, the second text information in the multimedia resource may be extracted based on a text feature extraction model, the text feature extraction model may be obtained by fine-tuning a text pre-training model, where the text pre-training model may refer to a natural Language processing (nlp) model, and may include, for example, a BERT (Bidirectional Encoder Representation from Transformers) model, a SBERT model (sequence-BERT, twin network-based BERT model), a visual BERT model, an error (enhanced reproduction) high Knowledge integration model, and the like. The Image features may be extracted from the multimedia resources based on an Image feature extraction model, where the Image feature extraction model may be obtained by fine-tuning an Image Pre-Training model, where the Image Pre-Training model may include a visual feature prediction model (e.g., CLIP model, contextual Language-Image Pre-Training) that connects text and Image. The CLIP model may be obtained by fine-tuning an existing CLIP pre-training model, for example, by training the CLIP pre-training model based on existing massive video cover data.

S207, performing feature correlation processing on the multi-modal feature and the first text feature, and determining correlation information of each of the plurality of multimedia resources and the first text information, wherein the first text feature is a text feature corresponding to the first text information; the correlation information represents the matching degree of the content of each multimedia resource and the first text information.

In one possible implementation, a distance between the multimodal feature and the first text feature of each multimedia asset may be calculated, and thus the distance may be taken as relevance information of each multimedia asset to the first text information. Where the distance may be a number between 0 and 1, a distance of 0 may characterize a strong correlation, and a distance of 1 may characterize an uncorrelated correlation. The present disclosure is not limited thereto.

In another possible implementation manner, the step S207 may include: inputting the first text feature, the first multi-modal feature and the second multi-modal feature into a relevance prediction model, and performing relevance processing on the first text feature and the multi-modal feature to obtain relevance information; the correlation information here may include correlation information of the first multimedia asset with the first text information, and correlation information of the second multimedia asset with the first text information. The relevance information may be a probability value between 0 and 1, and the higher the probability value is, the stronger the relevance can be represented. Alternatively, the relevance information herein may refer to ranking information of any two multimedia resources, i.e. ranking information of the first multimedia resource and the second multimedia resource, for example, the higher the probability value, the higher the ranking.

The first multi-modal feature can be a multi-modal feature corresponding to the first multimedia resource, and the second multi-modal feature can be a multi-modal feature corresponding to the second multimedia resource; the first multimedia asset and the second multimedia asset may be any two multimedia assets of a plurality of multimedia assets. The relevance prediction model may be a ranking model, such as a pairwise model, and the disclosure is not limited thereto.

Optionally, the first text information may be input into the first text processing model to perform text feature extraction processing, so as to obtain a first text feature. The first text processing model may be an NLP model, and based on this, the first text processing model may be obtained by fine-tuning any one of a BERT model, an SBERT model, a visual BERT model, and an enhanced representation throughput integration model, which is not limited in this disclosure. The extraction of text features is carried out through the neural network, and the efficiency and the accuracy of text extraction can be improved.

S209, according to the correlation information, selecting the target multimedia resource matched with the first text information from the plurality of multimedia resources.

In this embodiment of the present specification, the correlation information between each multimedia resource and the first text information is already obtained in S207, and based on this, the multiple multimedia resources may be sorted according to the correlation information, so that the target multimedia resource may be screened out based on the sorting result. The target multimedia asset may be a portion of a plurality of multimedia assets. As an example, when the sorting result is the sorting sequence number, when the height of the relevance information is negatively correlated with the height of the sorting sequence number, that is, the higher the relevance information is, the smaller the sorting requirement is, and the earlier the sorting is, a preset number of multimedia resources may be intercepted from low to high (from small to large) according to the sorting sequence number as the target multimedia resource, so as to obtain the target multimedia resource that is more correlated with the first text information.

Optionally, for multimedia resources other than the target multimedia resource in the plurality of multimedia resources, corresponding modification may also be performed. For example, a multimedia resource to be modified, which is a multimedia resource other than the target multimedia resource among the plurality of multimedia resources, may be obtained; determining target text description information of each multimedia resource to be corrected based on second text information (which can be obtained through the following step S303 and is not described herein again) corresponding to each multimedia resource to be corrected; for example, a keyword may be extracted from the second text information so as to be used as the target text description information. Further, the text description information of the multimedia resource to be corrected can be corrected according to the target text description information. For example, the original text description information of the multimedia resource can be replaced by the target text description information, or the intersection of the target text description information and the original text description information is used as the text description information of the multimedia resource, so that the text description information of the multimedia resource can be updated, the recall accuracy of the text description information such as the title and the brief introduction of the multimedia resource can be improved, the number of recalls can be reduced, the processing pressure can be reduced, and the processing resource can be saved.

The method comprises the steps of performing text coding and image coding on second text information and image frames included in multimedia resources to obtain second text features and image features of the multimedia resources, performing cross-modal feature fusion processing on the second text features and the image features of the multimedia resources to obtain multi-modal features corresponding to the multimedia resources, performing feature correlation processing on the multi-modal features and the first text features, determining correlation information of the multimedia resources and the first text information, and determining a target multimedia resource matched with the first text information from the multimedia resources based on the correlation information. The search mode of the multimedia resources enables the search words to be matched with the text features and the visual features in the multimedia resources at the same time, and the search accuracy is improved; and under the condition that the texts such as the titles, the brief descriptions and the like of the multimedia resources are short or the difference between the semantics and the video content is large, the searching precision is not influenced.

Fig. 3 is a flowchart illustrating a method for performing text coding and image coding on second text information and an image frame included in each multimedia resource to obtain a second text feature and an image feature of each multimedia resource according to an embodiment of the present application. In one possible implementation manner, as shown in fig. 3, the step S203 may include:

s301, a plurality of image frames corresponding to each multimedia resource are extracted from each multimedia resource.

In practical application, all the image frames can be extracted from each multimedia resource as a plurality of image frames corresponding to each multimedia resource. Alternatively, a plurality of image frames may be extracted at random or fixed frequent intervals in consideration of the processing pressure of a machine learning model or the like in a real scene.

And S303, extracting second text information corresponding to each multimedia resource from a plurality of image frames corresponding to each multimedia resource based on an optical character recognition technology.

In this embodiment of the present specification, text recognition may be performed on subtitles, comments, barrages, and the like in a plurality of image frames corresponding to each multimedia resource based on an optical character recognition technology, so that second text information corresponding to each multimedia resource may be obtained, that is, a plurality of second text information may be obtained.

Optionally, considering that the OCR recognized text may have noise, the OCR recognized text may be preprocessed, such as removing duplicate, filtering out text of a preset language category, and the like. The preset categories may refer to non-Chinese and non-English. As an example, as shown in fig. 4, the S303 may include the steps of:

s401, perform region division processing on each image frame to obtain a plurality of region images corresponding to each image frame.

In this embodiment of the present description, based on a preset region type, region division processing may be performed on each image frame to obtain a plurality of region images corresponding to each image frame.

For example, when the multimedia resource is a short video, the region types may include: a top region, a middle region, and a bottom region. The top area can be a text introduction area of a main broadcast, the middle area can be a content text area of a short video, and the bottom area can be a comment and bullet screen area. Thus each image frame can correspond to 3 area images: a top area image, a middle area image, and a bottom area image.

Alternatively, in the case where the multimedia asset is a long video, such as a movie, the region types may include: a non-bottom region and a bottom region. The bottom area may be a subtitle area, and the non-bottom area may be a bullet screen area. Thus each image frame can correspond to 2 area images: a non-bottom area image and a bottom area image.

It should be noted that, the area division is not limited in the present disclosure, and the text may be divided according to the actual application scene as long as the text can be processed in different areas, for example, the text in the subtitle area is put together for processing, and the text in the non-subtitle area is put together for processing.

And S403, performing text extraction processing on the plurality of area images based on the optical character recognition technology to obtain area texts of the plurality of area images.

In this embodiment of the present specification, the text in each area image may be subjected to recognition processing based on an optical character recognition technology, so that the area text of each area image may be extracted. The region text may refer to text in one region image.

Optionally, a plurality of region images corresponding to each image frame may be filtered, for example, a region image with a text proportion smaller than a preset proportion threshold value in the region image may be filtered. For example, the area ratio information of the text area to the area of the region image in each region image may be determined, and the region image of which the area ratio information is greater than or equal to a preset area ratio threshold value in a plurality of region images may be used as the target region image corresponding to each image frame. Therefore, text extraction processing can be carried out on the target area images based on the optical character recognition technology, and the respective area texts of the target area images are obtained. The text area may refer to an area of the text in the region image, and the region image area may refer to an area of the region image.

S405, splicing and preprocessing the region texts in the same region to obtain third text information corresponding to various regions.

In this embodiment of the present specification, text information may be counted in a plurality of image frames of one multimedia resource in a partitioned manner, that is, region texts are clustered based on region types, where a homogeneous region may refer to a region located at the same position in the plurality of image frames, that is, a region with the same region type. For example, the bottom area of each of the plurality of image frames may be regarded as a homogeneous area.

Taking the bottom area as an example, for example, the multimedia resource H includes 20 image frames, so that the area texts of the bottom area images of the 20 image frames, that is, the area texts of the 20 bottom area images, can be spliced, and preprocessing such as deduplication, text filtering of a preset language category, and the like can be performed to obtain third text information of the multimedia resource H in the type of the bottom area, that is, statistical text information of the bottom area of the third text.

And S407, splicing the third text information corresponding to the various regions to obtain second text information corresponding to each multimedia resource.

Further, the third text information corresponding to various regions can be spliced to obtain the second text information corresponding to each multimedia resource. That is, after the text is processed in the regions, the third text information corresponding to each type of region may be spliced, so as to obtain the second text information corresponding to the multimedia resource, i.e., the complete text information of each multimedia resource.

S305, inputting the second text information into a second text processing model for text feature extraction processing, and obtaining second text features corresponding to each multimedia resource.

In this embodiment of the present specification, the second text information may be input into the second text processing model, so as to obtain second text features corresponding to the respective multimedia resources. The second text processing model may be obtained by fine-tuning any one of a BERT model, an SBERT model, a visual BERT model, and an enhanced reconstruction through Knowledge integration model, which is not limited in this disclosure.

And S307, inputting a plurality of image frames corresponding to each multimedia resource into a visual feature prediction model connecting the text and the image to perform image feature extraction processing, so as to obtain the image features corresponding to each multimedia resource.

As an example, the visual feature prediction model may be a CLIP model, and based on this, a plurality of image frames corresponding to each multimedia resource may be input into the CLIP model, and the visual feature extraction processing may be performed to obtain image features corresponding to each multimedia resource.

Accordingly, the step S205 may include: and inputting the second text characteristic and the image characteristic into a multi-modal characteristic fusion model, and performing multi-modal characteristic fusion processing to obtain multi-modal characteristics corresponding to each multimedia resource.

The image features may be 512-dimensional vectors, and the multi-modal features may also be 512-dimensional vectors. As one example, the multimodal feature fusion model may be a transform model, such as a 3-layer transform model. The present disclosure is not limited thereto.

Fig. 5 illustrates a schematic diagram of a multi-modal correlation model provided according to an embodiment of the present application, and as shown in fig. 5, the multi-modal correlation model may include a first text processing model (a first BERT model), a second text processing model (a second BERT model), a visual feature prediction model (CLIP model), and a multi-modal feature fusion model (a transform model, which may be a 3-layer transform model). The multi-modal correlation model may be a two-tower model, the first tower (left tower) may be a first BERT model, and the second tower (right tower) may be a second BERT model, a CLIP model, and a transform model. Based on fig. 5, a multimedia resource processing procedure of the present application is introduced, which specifically includes:

the first text information (query) can be input into a first BERT model for text coding (text feature extraction processing), and the vector representation of the first text information is output: a 512-dimensional feature vector (512-dimensional query vector). For each multimedia resource, a plurality of image frames in the multimedia resource can be extracted, OCR texts, i.e. second text information, can be recognized from the plurality of image frames based on OCR technology, and the second text information can be input into a second BERT model for text coding, so as to obtain a vector characterization of the second text information: a second text feature, OCR embedding; on the right side of the right tower, a plurality of image frames may be directly input to the CLIP model, subjected to a visual encoding process (image encoding process), and output an image feature, i.e., CLIP embedding. And can input a transform model by OCR embedding and CLIP embedding, and output a vector representation of a text-media cross modality: a 512-dimensional multimedia vector, which may correspond to a text-video pair of OCR text of an image frame and an image frame in a multimedia asset.

Further, the relevance of the first text information and each multimedia resource can be predicted through a pair model. While the pairwise model does not focus on the correlation between the accurate predicted text and each multimedia, the order between two multimedia resources is mainly concerned, i.e. the concept of being closer to the order. Based on this, the input of the pairwise model may be the first text feature, and the respective multimodal features of any two multimedia resources, i.e. may be one input triple, for example, any two multimedia resources of the multiple multimedia resources: multimedia M1 and multimedia M2, multimedia M1 being V1, multimedia M2 being V2; the first text feature is represented by Q. Thus the triplets corresponding to the first text feature, multimedia M1 and multimedia M2 can be represented as: (Q, V1, V2). Therefore, (Q, V1, V2) can be input into the pairwise model to obtain the ranking information of the multimedia M1 and the multimedia M2, and the ranking information can be used as the relevance information. The ranking information may characterize the relevance (degree of match) of the multimedia asset to the first text information.

In this way, the triplet is input into the pairwise model in an iterative manner, and when the sequence of the multiple multimedia resources can be determined, the iteration can be ended, so that the overall ranking information of the multiple multimedia resources, namely the target ranking information, can be obtained, and the target multimedia resources matched with the first text information can be screened out from the multiple multimedia resources based on the target ranking information.

It should be noted that the ranking information includes first relevance information of the multimedia M1 and the first text information and second relevance information of the multimedia M2 and the second text information, where the first relevance information and the second relevance information may be a numerical value (score) between 0 and 1, and the numerical value and the relevance may have a positive correlation, that is, the higher the numerical value, the higher the relevance of the multimedia resource and the first text information may be represented. The present disclosure is not limited thereto. Through a multi-mode multi-modal relevance model based on the CLIP, the accuracy of the relevance between the query and the video can be effectively judged in an auxiliary mode through the content of the video frame, and the efficiency of video searching can be improved.

Optionally, the relevance information obtained by inputting (Q, V1, V2) into the pairwise model may be further input into a classification model, such as a softmax model, to obtain classification information, that is, the softmax model may be connected after the pairwise model. The classification information may characterize the ordering of M1 and M2. The classification information may be binary classification information, and may include 1 and 0, where 1 may indicate that the rank of M1 is higher than that of M2, that is, the two multimedia assets corresponding to (Q, V1, V2) are in positive order. 0 may indicate that the rank of M1 is lower than the rank of M2, i.e., (Q, V1, V2) the corresponding two multimedia assets are in reverse order. Based on this, if the corresponding classification information (Q, V1, V2) is 1, it may be determined that the rank of M1 is before M2, i.e., that the correlation of M1 with query is higher than the correlation of M2 with query. If the corresponding classification information (Q, V1, V2) is 0, it may be determined that M1 is ranked after M2, i.e., M1 has lower relevance to query than M2.

The text features and the image features of the multimedia resources are fused by using the transform model and then correlated with the text features of the query, so that the participation degree of the image features in the correlation prediction can be ensured, and the correlation information is more accurate. Optionally, a 512-dimensional query vector, OCR embedding and CLIP embedding may be directly input into a pair model without using a transform model to perform correlation prediction, so that text correlation may be favored in the correlation prediction, but the neural network architecture may be simpler, and the accuracy of the correlation information may also be ensured. Wherein, in case of directly inputting the 512-dimensional query vector, the OCR embedding and the CLIP embedding into the pairwise model, each multimedia asset can be represented by the OCR embedding and the CLIP embedding.

Fig. 6 is a flowchart illustrating a method for training a relevance prediction model according to an embodiment of the present application. As shown in fig. 6, may include:

s601, obtaining a plurality of training samples and sample label information of each training sample, wherein each training sample comprises sample search text information and a corresponding sample multimedia resource pair. The sample label information can represent the relevance of each of the two sample multimedia resources in the sample multimedia resource pair and the sample search text information. The sample tag information may be represented by sample type information, such as may include a positive sample and a negative sample, and in one example, the sample tag information of a sample multimedia resource pair may be ordering information of two sample multimedia resources in the sample multimedia resource pair. For example, a positive sample may refer to the ordering information for the two sample multimedia assets in the sample multimedia pair being in a positive order, and a negative sample may refer to the ordering information for the two sample multimedia assets in the sample multimedia pair being in a negative order. Accordingly, the sample label information herein may refer to the classification information in the online application, and may be represented by 1 and 0, for example, 1 represents a positive sample, and 0 represents a negative sample.

As an example, 5 gears may be preset: gear 1 to gear 5, wherein the gears from gear 1 to gear 5 are incremental, gear 1 may be the lowest gear, and gear 5 may be the highest gear. Taking multimedia resources as video, for example, gear 1 to gear 5 may be as follows:

gear 5: text semantics (namely text description information of the video, which is the same as the text description information) and video content are completely matched, so that the main requirement of searching the query is completely met;

gear 4: the text semantics and the video content are highly matched, and the text semantics and the video content highly meet the query requirement of a user;

gear 3: the text semantics are related to the query part, the video content is related to the query part, the search requirements of a part or a small part are met, and the semantics do not drift;

gear 2: the text semantics and query are partially related (the search requirements are in the rough field), but the semantics of the video content drift and do not accord with the search query intention;

gear 1: neither text semantics nor video content is relevant to the query.

In practical applications, a plurality of training samples can be obtained from the historical search records based on the divided gears. For example, the acquisition mode of a training sample can be divided into positive and negative samples. For the positive sample, one query (denoted by Q1) may be obtained, so that from the above-mentioned gears (such as gear 5 and gear 3), two sample videos under the search of Q1 may be obtained: m1 (from gear 5) and M2 (from gear 3), i.e. M1 has a higher correlation with Q1 than M2 with Q1. These two sample videos can be regarded as a pair of sample multimedia resources, i.e., video, corresponding to Q1. Based on this, M1 from gear 5 may be ranked before M2 from gear 3, resulting in a training sample of positive samples: < Q1, M1, M2 >; the corresponding sample type information is a positive sample, the positive sample can be represented by 1, and the ordering information of the multimedia of the two samples in the positive sample is in a positive order, i.e. the ordering with high correlation to Q1 is in the top.

For negative samples, one query (e.g., Q1) can be obtained, so that two sample videos under the search of Q1 can be obtained from the above-mentioned gears (e.g., gear 3 and gear 1): m3 (from gear 3) and M4 (from gear 1), i.e. M3 has a higher correlation with Q1 than M4 has with Q1. These two sample videos can be regarded as a pair of sample multimedia resources, i.e., video, corresponding to Q1. Based on this, M3 from gear 3 may be ranked after M4 from gear 1, resulting in a training sample of negative samples: < Q1, M4, M3 >; the corresponding sample type information is a negative sample, the negative sample can be represented by 0, and the ordering information of the two sample multimedia in the negative sample is in a reverse order, i.e. the ordering with high correlation to Q1 is in the back.

S603, respectively carrying out text coding and image coding on sample text information and sample images included in each sample multimedia resource in the sample multimedia resource pair to obtain sample text characteristics and sample image characteristics;

s605, performing cross-modal feature fusion processing on the sample text features and the sample image features of the sample multimedia resources to obtain sample multi-modal features corresponding to the sample media resources;

s607, inputting the sample search text characteristics, the first sample multi-modal characteristics and the second sample multi-modal characteristics into an initial correlation model, and performing correlation processing on the sample search text information and the sample multi-modal characteristics to obtain correlation prediction information of each sample multimedia resource and the sample search text information; the sample search text features are text features corresponding to the sample search text information, the first sample multi-modal features are sample multi-modal features corresponding to the first sample multimedia resources, and the second multi-modal features are sample multi-modal features corresponding to the second sample multimedia resources; the first sample multimedia asset and the second sample multimedia asset are two sample multimedia assets in either sample multimedia asset pair.

The implementation manners of S603 to S607 may refer to the corresponding contents of S203 to S207, which are not described herein again.

And S609, obtaining the contrast loss information according to the correlation prediction information of each sample multimedia resource pair and the sample type information of each sample multimedia resource pair.

In an embodiment of the present specification, a contrast Loss function may be selected to determine contrast Loss information (contrast Loss). The correlation prediction information corresponding to one training sample (sample multimedia resource pair) may include first correlation prediction information corresponding to a first sample multimedia resource and second correlation prediction information corresponding to a second sample multimedia resource. So that the difference between the first and second relevance prediction information or the distance between them can be determined, which difference or distance can be used as D_ijThe representation, i and j, may refer to two sample multimedia assets in any training sample. Based on this, D for a plurality of multimedia resource pairs can be expressed by the following formula (1)_ijPerforming superposition to determine contrast loss information

Wherein, y_ij1 represents the ith sample multimedia resource and the jth sampleThe multimedia resource is a positive sample; y is_ijAnd 0 indicates that the ith sample multimedia asset and the jth sample multimedia asset are negative samples. m can be a preset threshold value, and the sequencing of the two sample multimedia in the negative training sample is not expected, so that the distance between the two sample multimedia in the negative training sample is considered to be larger, and based on the fact that the distance between the two sample multimedia in the negative training sample is larger than or equal to m, the case that the prediction of the pairwise model is more accurate can be set, and the training can be ended; when the distance between two multimedia samples in the negative training sample is smaller than m, it is indicated that the prediction of the pairwise model is not accurate enough, and training needs to be continued, that is, m is used for determining the end condition of iterative training. Therefore, the loss calculation formula of the above formula (1) is determined in combination with iterative training considerations in both positive and negative examples.

S611, training the initial correlation model based on the comparison loss information to obtain a correlation prediction model.

In this embodiment of the present description, the initial correlation model may be trained based on the comparison loss information, for example, inverse gradient propagation may be performed based on the comparison loss information, so as to update a model parameter in the initial correlation model, and the initial correlation model when the comparison loss information satisfies the preset condition may be used as the correlation prediction model through iterative training until the comparison loss information satisfies the preset condition. The preset condition may refer to that the comparison loss information is smaller than a loss threshold, or the iteration number reaches a preset number, and the like, which is not limited in the present disclosure.

As shown in fig. 7, the apparatus may include:

an obtaining module 701, configured to obtain first text information to be searched and a plurality of multimedia resources;

the encoding module 703 is configured to perform text encoding and image encoding on the second text information and the image frame included in each multimedia resource, respectively, to obtain a second text feature and an image feature of each multimedia resource;

the multi-modal processing module 705 is configured to perform cross-modal feature fusion processing on the second text feature and the image feature of each multimedia resource to obtain a multi-modal feature corresponding to each multimedia resource;

a relevance prediction module 707, configured to perform feature relevance processing on the multi-modal feature and a first text feature, and determine relevance information of each of the multiple multimedia resources and the first text information, where the first text feature is a text feature corresponding to the first text information; the correlation information represents the matching degree of the content of each multimedia resource and the first text information;

a target multimedia resource determining module 709, configured to screen a target multimedia resource matching the first text message from the multiple multimedia resources according to the correlation information.

In a possible implementation manner, the encoding module 703 may include:

the image frame extraction unit is used for extracting a plurality of image frames corresponding to each multimedia resource from each multimedia resource;

the text recognition unit is used for extracting second text information corresponding to each multimedia resource from a plurality of image frames corresponding to each multimedia resource based on an optical character recognition technology;

the second text feature acquisition unit is used for inputting the second text information into a second text processing model to perform text feature extraction processing to obtain second text features corresponding to the multimedia resources;

and the image characteristic acquisition unit is used for inputting a plurality of image frames corresponding to each multimedia resource into the visual characteristic prediction model of the connection text and the image to perform image characteristic extraction processing so as to obtain the image characteristics corresponding to each multimedia resource.

In one possible implementation, the multi-modal processing module 705 may include:

and the multi-modal feature processing unit is used for inputting the second text features and the image features into the multi-modal feature fusion model, and performing cross-modal feature fusion processing to obtain the multi-modal features corresponding to the multimedia resources.

In a possible implementation manner, the text recognition unit may include:

the image area dividing subunit is used for carrying out area dividing processing on each image frame to obtain a plurality of area images corresponding to each image frame;

the regional text acquisition subunit is used for performing text extraction processing on the plurality of regional images based on an optical character recognition technology to obtain regional texts of the plurality of regional images;

the third text information acquisition subunit is used for splicing and preprocessing the region texts in the same region to obtain third text information corresponding to various regions; the homogeneous region refers to a region at the same position in a plurality of image frames;

and the second text information acquisition subunit is used for splicing the third text information corresponding to each type of area to obtain the second text information corresponding to each multimedia resource.

In a possible implementation manner, the correlation prediction module 707 may include:

the relevance prediction unit is used for inputting the first text characteristic, the first multi-modal characteristic and the second multi-modal characteristic into a relevance prediction model, and performing relevance processing on the first text characteristic and the multi-modal characteristic to obtain relevance information;

the first multi-modal feature is a multi-modal feature corresponding to the first multimedia resource, and the second multi-modal feature is a multi-modal feature corresponding to the second multimedia resource; the first multimedia asset and the second multimedia asset are any two multimedia assets of the plurality of multimedia assets.

In a possible implementation manner, the method may further include:

a modified multimedia acquiring module, configured to acquire a multimedia resource to be modified, where the multimedia resource to be modified is a multimedia resource other than the target multimedia resource in the multiple multimedia resources;

the target text description information determining module is used for determining the target text description information of the multimedia resource to be corrected based on the second text information corresponding to the multimedia resource to be corrected;

and the correcting module is used for correcting the text description information of the multimedia resource to be corrected according to the target text description information.

In a possible implementation manner, the obtaining module 701 may include:

the text description information acquisition unit is used for acquiring the text description information of each of the plurality of multimedia resources to be matched;

and the recall unit is used for determining a recall multimedia resource corresponding to the first text information from the plurality of multimedia resources to be matched based on the text description information.

In one possible implementation, the apparatus may further include:

and the first text feature acquisition module is used for inputting the first text information into a first text processing model to perform text feature extraction processing to obtain a first text feature.

In one possible implementation, the apparatus may further include:

the training sample acquisition module is used for acquiring a plurality of training samples and sample type information of each training sample, wherein each training sample comprises sample search text information and a corresponding sample multimedia resource pair; the sample label information can represent the correlation between two sample multimedia resources in the sample multimedia resource pair and the sample search text information;

the sample coding module is used for respectively carrying out text coding and image coding on sample text information and sample images included in each sample multimedia resource in the sample multimedia resource pair to obtain sample text characteristics and sample image characteristics;

the sample multi-modal characteristic acquisition module is used for performing cross-modal characteristic fusion processing on sample text characteristics and sample image characteristics of each sample multimedia resource to obtain sample multi-modal characteristics corresponding to each sample multimedia resource;

the correlation prediction information acquisition module is used for inputting the sample search text characteristics, the first sample multi-modal characteristics and the second sample multi-modal characteristics into an initial correlation model, and performing correlation processing on the sample search text information and the multi-modal characteristics to obtain correlation prediction information of each sample multimedia resource pair; the sample search text features are text features corresponding to the sample search text information, the first sample multi-modal features are multi-modal features corresponding to a first sample multimedia resource, and the second multi-modal features are multi-modal features corresponding to a second sample multimedia resource; the first sample multimedia asset and the second sample multimedia asset are two sample multimedia assets in any sample multimedia asset pair.

The loss acquisition module is used for acquiring comparative loss information according to the correlation prediction information of each sample multimedia resource pair and the sample type information of each sample multimedia resource pair;

and the training module is used for training the initial correlation model based on the comparison loss information to obtain the correlation prediction model.

With regard to the apparatus in the above-described embodiment, the specific manner in which the respective modules and units perform operations has been described in detail in the embodiment related to the method, and will not be elaborated upon here.

Fig. 8 shows a block diagram of an electronic device for multimedia asset processing according to an embodiment of the present application. The electronic device may be a server, and its internal structure diagram may be as shown in fig. 8. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method of multimedia asset processing.

It will be understood by those skilled in the art that the structure shown in fig. 8 is a block diagram of only a portion of the structure related to the present application, and does not constitute a limitation on the electronic device to which the present application is applied, and a particular electronic device may include more or less components than those shown in the drawings, or combine certain components, or have a different arrangement of components.

In an exemplary embodiment, there is also provided an electronic device including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the multimedia resource processing method as in the embodiment of the present application.

In an exemplary embodiment, there is also provided a storage medium, and when instructions in the storage medium are executed by a processor of an electronic device, the electronic device is enabled to execute the multimedia resource processing method in the embodiment of the present application.

In an exemplary embodiment, a computer program product containing instructions is also provided, which when run on a computer causes the computer to execute the multimedia resource processing method in the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for processing multimedia resources, the method comprising:

2. The method according to claim 1, wherein the text encoding and the image encoding are performed on the second text information and the image frame included in each multimedia resource respectively to obtain a second text feature and an image feature of each multimedia resource, and the method comprises:

extracting a plurality of image frames corresponding to each multimedia resource from each multimedia resource;

extracting second text information corresponding to each multimedia resource from a plurality of image frames corresponding to each multimedia resource based on an optical character recognition technology;

inputting the second text information into a second text processing model to perform text feature extraction processing to obtain second text features corresponding to the multimedia resources;

and inputting a plurality of image frames corresponding to each multimedia resource into a visual feature prediction model connecting the text and the image to perform image feature extraction processing, so as to obtain the image features corresponding to each multimedia resource.

3. The method according to claim 2, wherein the performing cross-modal feature fusion processing on the second text feature and the image feature of each multimedia resource to obtain a multi-modal feature corresponding to each multimedia resource comprises:

and inputting the second text characteristic and the image characteristic into a multi-modal characteristic fusion model, and performing cross-modal characteristic fusion processing to obtain the multi-modal characteristic corresponding to each multimedia resource.

4. The method according to claim 2 or 3, wherein the extracting second text information corresponding to each multimedia resource from a plurality of image frames corresponding to each multimedia resource based on the optical character recognition technology comprises:

carrying out region division processing on each image frame to obtain a plurality of region images corresponding to each image frame;

performing text extraction processing on the plurality of area images based on an optical character recognition technology to obtain respective area texts of the plurality of area images;

splicing and preprocessing the region texts in the same region to obtain third text information corresponding to various regions; the homogeneous region refers to a region at the same position in a plurality of image frames;

and splicing the third text information corresponding to the various regions to obtain the second text information corresponding to each multimedia resource.

5. The method according to any one of claims 1-3, wherein said performing feature correlation processing on said multi-modal features and said first textual features to determine correlation information of each of said plurality of multimedia assets with said first textual information comprises:

inputting the first text feature, the first multi-modal feature and the second multi-modal feature into a relevance prediction model, and performing relevance processing on the first text feature and the multi-modal feature to obtain relevance information;

6. The method of claim 3, further comprising:

acquiring a multimedia resource to be corrected, wherein the multimedia resource to be corrected is a multimedia resource except the target multimedia resource in the plurality of multimedia resources;

determining respective target text description information of the multimedia resources to be corrected based on respective corresponding second text information of the multimedia resources to be corrected;

and correcting the text description information of the multimedia resource to be corrected according to the target text description information.

7. The method of claim 1, wherein the obtaining the plurality of multimedia assets comprises:

acquiring respective text description information of a plurality of multimedia resources to be matched;

and determining the plurality of multimedia resources corresponding to the first text information from the plurality of multimedia resources to be matched based on the text description information.

8. The method of claim 1, further comprising:

and inputting the first text information into a first text processing model to perform text feature extraction processing to obtain a first text feature.

9. The method of claim 5, further comprising:

obtaining a plurality of training samples and sample label information of each training sample, wherein each training sample comprises sample search text information and a corresponding sample multimedia resource pair, and the sample label information represents the correlation between two sample multimedia resources in the sample multimedia resource pair and the sample search text information;

respectively carrying out text coding and image coding on sample text information and sample images included in each sample multimedia resource in the sample multimedia resource pair to obtain sample text characteristics and sample image characteristics;

performing cross-modal feature fusion processing on the sample text features and the sample image features of the sample multimedia resources to obtain sample multi-modal features corresponding to the sample multimedia resources;

inputting the sample search text features, the first sample multi-modal features and the second sample multi-modal features into an initial correlation model, and performing correlation processing on the sample search text information and the sample multi-modal features to obtain correlation prediction information of each sample multimedia resource and the sample search text information; the sample search text features are text features corresponding to the sample search text information, the first sample multi-modal features are sample multi-modal features corresponding to the first sample multimedia resources, and the second multi-modal features are sample multi-modal features corresponding to the second sample multimedia resources; the first sample multimedia asset and the second sample multimedia asset are two sample multimedia assets in any sample multimedia asset pair;

obtaining contrast loss information according to the correlation prediction information and the sample label information;

and training the initial correlation model based on the comparison loss information to obtain the correlation prediction model.

10. A multimedia resource processing apparatus, comprising:

the system comprises an acquisition module, a search module and a search module, wherein the acquisition module is used for acquiring first text information to be searched and a plurality of multimedia resources;

11. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the executable instructions to implement the method of any of claims 1 to 9.

12. A non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method of any one of claims 1 to 9.