CN117540221B

CN117540221B - Image processing method and device, storage medium and electronic equipment

Info

Publication number: CN117540221B
Application number: CN202410029993.9A
Authority: CN
Inventors: 辛毅; 杜俊珑; 鄢科
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-01-09
Filing date: 2024-01-09
Publication date: 2024-04-09
Anticipated expiration: 2044-01-09
Also published as: CN117540221A

Abstract

The application discloses an image processing method and device, a storage medium and electronic equipment. Wherein the method comprises the following steps: the method comprises the steps of obtaining a target image and a set of text information, inputting the target image and the set of text information into a pre-trained target multi-mode matching model to obtain a target image characterization vector and a set of text characterization vector, calculating text prompt information and image prompt information used by the target multi-mode matching model through source prompt information, calculating the similarity between the target image characterization vector and each text characterization vector, and determining the target content category indicated by the text characterization vector meeting the preset similarity condition as the content category of the target image. The method and the device solve the technical problem that the efficiency of image processing by using the multi-mode matching model is low. The embodiment of the application can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like.

Description

Image processing method and device, storage medium and electronic equipment

Technical Field

The present invention relates to the field of computers, and in particular, to an image processing method and apparatus, a storage medium, and an electronic device.

Background

In the related art, the content type of the image to be checked is determined by inputting the image to be checked into the multi-mode matching model, and whether the image to be checked is illegal or not is judged according to the content type of the image to be checked.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides an image processing method and device, a storage medium and electronic equipment, which are used for at least solving the technical problem that the efficiency of image processing by using a multi-mode matching model is low.

According to an aspect of an embodiment of the present application, there is provided an image processing method including: acquiring a target image and a predetermined set of text information, wherein one text information in the set of text information is used for representing one content category in a preset content category set; inputting the target image and the group of text information into a pre-trained target multi-mode matching model to obtain a target image characterization vector and a group of text characterization vectors, wherein the target multi-mode matching model comprises a text encoder, an image encoder, text prompt information and image prompt information, the text prompt information and the image prompt information are obtained by calculating source prompt information, the image prompt information is used for being input into the image encoder together with the target image to obtain the target image characterization vector, and the text prompt information is used for being input into the text encoder together with the group of text information to obtain the group of text characterization vectors; and determining the similarity between the target image representation vector and each text representation vector in the set of text representation vectors through the target multi-mode matching model, and determining the target content category indicated by the text representation vector with the similarity meeting the preset similarity condition as the content category corresponding to the target image representation vector.

According to another aspect of the embodiments of the present application, there is also provided an image processing apparatus including: the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring a target image and a predetermined set of text information, wherein one text information in the set of text information is used for representing one content category in a preset content category set; the training module is used for inputting the target image and the group of text information into a pre-trained target multi-mode matching model to obtain a target image characterization vector and a group of text characterization vectors, wherein the target multi-mode matching model comprises a text encoder, an image encoder, text prompt information and image prompt information, the text prompt information and the image prompt information are obtained by calculating source prompt information, the image prompt information is used for being input into the image encoder together with the target image to obtain the target image characterization vector, and the text prompt information is used for being input into the text encoder together with the group of text information to obtain the group of text characterization vectors; the determining module is used for determining the similarity between the target image representation vector and each text representation vector in the set of text representation vectors through the target multi-mode matching model, and determining the target content category indicated by the text representation vector with the similarity meeting the preset similarity condition as the content category corresponding to the target image representation vector.

Optionally, the device is configured to input the target image and the set of text information into a pre-trained target multi-modal matching model to obtain a target image token vector and a set of text token vectors by: performing matrix transformation processing on the source prompt information and an image scaling matrix to determine the image prompt information, wherein the matrix transformation processing is used for determining an element of a first position in a corresponding matrix of the image prompt information as a product of the element of the first position in the source prompt information and the element of the first position in the image scaling matrix; performing stitching operation on the target image and the image prompt information to determine an image stitching vector; and mapping the image stitching vector to a target embedding space, and determining the target image characterization vector.

Optionally, the device is configured to perform a stitching operation on the target image and the image prompt information, and determine an image stitching vector by: dividing the target image into a plurality of image patches; projecting the plurality of image patches into an image embedding space to obtain a plurality of patch coding vectors; and performing splicing operation on the patch coding vectors and the image prompt information to determine the image splicing vector.

Optionally, the device is configured to input the set of text information into a pre-trained target multi-modal matching model to obtain a set of text token vectors by: performing matrix conversion processing on the source prompt information and a text scaling matrix to determine the text prompt information, wherein the matrix conversion processing is used for determining an element at a second position in a text prompt information corresponding matrix as a product of the element at the second position in the source prompt information and the element at the second position in the text scaling matrix; respectively executing splicing operation on the group of text information and the text prompt information to determine a group of text splicing vectors; and mapping the text splicing vectors to a target embedded space respectively, and determining the text characterization vectors.

Optionally, the device is configured to perform a stitching operation on the set of text information and the text prompt information, and determine a set of text stitching vectors by: performing splicing operation on the group of text information and the text prompt information respectively to determine a group of text splicing vectors, wherein the text information of each execution of the splicing operation is regarded as current text information, and the text splicing vector obtained by each execution of the splicing operation is regarded as current text splicing vector: executing word segmentation operation on the current text information to obtain a group of word segmentation; and projecting the group of component words into a text embedding space to obtain the current text splicing vector.

Optionally, the device is further configured to: acquiring a sample image and a group of sample text information, wherein the sample image is pre-marked with a text information corresponding to a target sample, and the group of sample text information comprises the target sample text information; inputting the sample image and the group of sample text information into an initial multi-mode matching model to be trained to obtain a sample image characterization vector and a group of sample text characterization vector, wherein the initial multi-mode matching model comprises the text encoder, the image encoder, initial text prompt information and initial image prompt information, and the initial text prompt information and the initial image prompt information are obtained by calculating initial source prompt information; determining sample similarity between the sample image characterization vector and each sample text characterization vector in the set of sample text characterization vectors by the initial multi-modal matching model; and calculating a loss parameter based on the sample similarity, and adjusting the initial source prompt information and the initial scaling matrix by using the loss parameter until the initial multi-modal matching model is trained into the target multi-modal matching model.

Optionally, the device is configured to input the sample image and the set of sample text information into an initial multimodal matching model to be trained, to obtain a sample image token vector and a set of sample text token vectors by: acquiring the initial source prompt information, the initial image scaling matrix and the initial text scaling matrix, wherein the initial source prompt information, the initial image scaling matrix and the initial text scaling matrix are parameters which allow adjustment in the process of training the initial multi-modal matching model; performing matrix transformation processing on the initial source prompt information and the initial image scaling matrix to determine the initial image prompt information, wherein the matrix transformation processing is used for determining an element at a third position in the initial image prompt information corresponding matrix as a product of the element at the third position in the initial source prompt information and the element at the third position in the initial image scaling matrix; performing matrix conversion processing on the initial source prompt information and the initial text scaling matrix to determine the initial text prompt information, wherein the matrix conversion processing is used for determining an element at a fourth position in the initial text prompt information corresponding matrix as a product of the element at the fourth position in the initial source prompt information and the element at the fourth position in the initial text scaling matrix; performing a stitching operation on the sample image and the initial image prompt information to determine a sample image stitching vector, and performing a stitching operation on the set of sample text information and the initial text prompt information to determine a set of sample text stitching vectors; and mapping the sample image stitching vector and the set of sample text stitching vectors to an initial embedding space, and determining the sample image characterization vector and the set of sample text characterization vectors.

Optionally, the device is configured to input the target image and the set of text information into a pre-trained target multi-modal matching model to obtain a target image token vector and a set of text token vectors by: setting corresponding source prompt information and a scaling matrix for each layer of image matching submodel under the condition that the target multi-mode matching model comprises K layers of image matching submodels, wherein the output of an ith layer of image matching submodel in the K layers of image matching submodels and the image prompt information corresponding to an (i+1) th layer of image matching submodel are used as the input of the (i+1) th layer of image matching submodel together, K is a positive integer greater than 1, and i is a positive integer smaller than K; setting corresponding source prompt information and the scaling matrix for each layer of text matching sub-model under the condition that the target multi-mode matching model comprises a K-layer text matching sub-model, wherein the output of an ith layer of text matching sub-model in the K-layer text matching sub-model and the text prompt information corresponding to an (i+1) th layer of text matching sub-model are used as the input of the (i+1) th layer of text matching sub-model together.

Optionally, the device is configured to determine, through the target multi-mode matching model, a similarity between the target image token vector and each text token vector in the set of text token vectors, and determine, as a content category corresponding to the target image token vector, a target content category indicated by a text token vector whose similarity satisfies a preset similarity condition: acquiring a first cosine distance between the target image characterization vector and a first text characterization vector and a second cosine distance between the target image characterization vector and a second text characterization vector, wherein the set of text characterization vectors comprises the first text characterization vector and the second text characterization vector; determining the first text token vector as a target text token vector when the first cosine distance is smaller than or equal to a cosine distance threshold, wherein the content category of the target image is the target content category; and under the condition that the second cosine distance is larger than the cosine distance threshold, determining that the second text token vector is not the target text token vector, and the content category of the target image is not the target content category.

Optionally, the device is further configured to: determining that the target image is a violation image under the condition that the target content category belongs to a violation content category set, wherein the preset content category set comprises the violation content category set; and generating violation prompt information in response to determining that the target image is the violation image, wherein the violation prompt information is used for indicating that the target image audit is not passed.

Optionally, the device is further configured to: performing a Cronecker inner product operation on the source prompt information and an image scaling matrix to obtain the image prompt information, wherein the scaling matrix comprises the image scaling matrix; and executing the Cronecker inner product operation on the source prompt information and a text scaling matrix to obtain the text prompt information, wherein the scaling matrix comprises the text scaling matrix.

According to still another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the above-described image processing method when run.

According to yet another aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the image processing method as above.

According to still another aspect of the embodiments of the present application, there is also provided an electronic device including a memory in which a computer program is stored, and a processor configured to execute the image processing method described above by the computer program.

In the embodiment of the application, the target image and a predetermined set of text information are acquired, wherein one text information in the set of text information is used for representing one content category in a preset content category set; inputting a target image and a group of text information into a pre-trained target multi-mode matching model to obtain a target image characterization vector and a group of text characterization vectors, wherein the target multi-mode matching model comprises a text encoder, an image encoder, text prompt information and image prompt information, the text prompt information and the image prompt information are obtained by calculating source prompt information, the image prompt information is used for being input into the image encoder together with the target image to obtain the target image characterization vector, and the text prompt information is used for being input into the text encoder together with the group of text information to obtain the group of text characterization vectors; the similarity between the target image characterization vector and each text characterization vector in the set of text characterization vectors is determined through the target multi-mode matching model, the target content category indicated by the text characterization vector with the similarity meeting the preset similarity condition is determined to be the content category corresponding to the target image characterization vector, the obtained target image and a set of predetermined text information are input into the pre-trained target multi-mode matching model to output the target image characterization vector and a set of text characterization vectors, the similarity is calculated on the target image characterization vector and each text characterization vector in the set of text characterization vectors, and finally the target content category indicated by the target text characterization vector meeting the preset similarity condition is used as the content category of the target image characterization vector.

In addition, the text prompt information and the image prompt information are determined through the same source prompt information in the target multi-mode matching model, the source prompt information and the scaling matrix are utilized for calculation, the text prompt information and the image prompt information are obtained, the relevance between the text prompt information and the image prompt information is improved, the calculated text prompt information and image prompt information can reduce the number of parameters required to be trained by the target multi-mode matching model, and therefore training efficiency and model performance of the target multi-mode matching model are improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a schematic illustration of an application environment of an alternative image processing method according to an embodiment of the present application;

FIG. 2 is a flow chart of an alternative image processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative image processing method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of yet another alternative image processing method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of yet another alternative image processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of yet another alternative image processing method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of yet another alternative image processing method according to an embodiment of the present application;

FIG. 8 is a schematic diagram of yet another alternative image processing method according to an embodiment of the present application;

FIG. 9 is a schematic diagram of yet another alternative image processing method according to an embodiment of the present application;

fig. 10 is a schematic structural view of an alternative image processing apparatus according to an embodiment of the present application;

FIG. 11 is a schematic diagram of the structure of an alternative image processing product according to an embodiment of the present application;

fig. 12 is a schematic structural view of an alternative electronic device according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The present application is described below with reference to examples:

according to an aspect of the embodiment of the present application, there is provided an image processing method, alternatively, in the present embodiment, the above-described image processing method may be applied to a hardware environment constituted by the server 101 and the terminal device 103 as shown in fig. 1. As shown in fig. 1, a server 101 is connected to a terminal device 103 via a network, and can be used to provide services to the terminal device or an application 107 installed on the terminal device, such as a video application, an instant messaging application, a browser application, an educational application, a game application, and the like. The database 105 may be provided on or separate from the server for providing data storage services for the server 101, such as a game data storage server, which may include, but is not limited to: a wired network, a wireless network, wherein the wired network comprises: local area networks, metropolitan area networks, and wide area networks, the wireless network comprising: bluetooth, WIFI and other wireless communication networks, the terminal device 103 may be a terminal configured with an application program, and may include, but is not limited to, at least one of the following: mobile phones (such as Android mobile phones, iOS mobile phones, etc.), notebook computers, tablet computers, palm computers, MID (Mobile Internet Devices ), PAD, desktop computers, smart televisions, smart voice interaction devices, smart home appliances, vehicle terminals, aircrafts, virtual Reality (VR) terminals, augmented Reality (Augmented Reality, AR) terminals, mixed Reality (MR) terminals, and other computer devices, where the servers may be a single server, a server cluster composed of multiple servers, or a cloud server.

As shown in connection with fig. 1, the above-mentioned image processing method may be performed by an electronic device, which may be a terminal device or a server, and the above-mentioned image processing method may be implemented by the terminal device or the server, respectively, or by both the terminal device and the server.

The above is merely an example, and the present embodiment is not particularly limited.

Optionally, as an optional embodiment, as shown in fig. 2, the image processing method includes:

s202, acquiring a target image to be audited and a set of text information which is preset, wherein one text information in the set of text information is used for representing one content category in a preset content category set;

alternatively, in the embodiment of the present application, the target image may include, but is not limited to, an image uploaded by any user, such as a person image, a pet image, a vehicle image, a landscape image, or the like, and the target image may be obtained by, but is not limited to, capturing an image in real time using an image capturing device and uploading the image, or capturing a video by capturing a video frame using the image capturing device, or capturing a video by using the image capturing device, uploading the video to an image processing device, capturing a video frame by the image processing device to obtain the target image, where the image capturing device may include, but is not limited to, a smart phone, a computer, a notebook, a smart wearable device, a video camera, or the like, and the image processing device may include, but is not limited to, a distributed server, a cloud server, a computer, a notebook, a smart phone, a tablet computer, a digital camera, a scanner, a printer, or the like. Of course, an image or the like produced using the image processing apparatus may be used.

The Cloud server is a virtual server based on a Cloud computing technology, computing resources and storage space are provided through a Cloud platform, and the Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, networks and other series of resources in a wide area network or a local area network to realize computing, storage, processing and sharing of data.

Cloud computing (closed computing) refers to the delivery and usage mode of an IT infrastructure, meaning that required resources are obtained in an on-demand, easily scalable manner through a network; generalized cloud computing refers to the delivery and usage patterns of services, meaning that the required services are obtained in an on-demand, easily scalable manner over a network. Such services may be IT, software, internet related, or other services. Cloud Computing is a product of fusion of traditional computer and network technology developments such as Grid Computing (Grid Computing), distributed Computing (distributed Computing), parallel Computing (Parallel Computing), utility Computing (Utility Computing), network storage (Network Storage Technologies), virtualization (Virtualization), load balancing (Load balancing), and the like.

With the development of the internet, real-time data flow and diversification of connected devices, and the promotion of demands of search services, social networks, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Unlike the previous parallel distributed computing, the generation of cloud computing will promote the revolutionary transformation of the whole internet mode and enterprise management mode in concept.

It should be noted that the above-mentioned set of text information may be understood as a plurality of sentences describing different content categories, and the above-mentioned preset content category set may include, but is not limited to, portrait categories, pet categories, landscape categories, etc., for example, a set of text information includes 3 sentences, respectively, "an image including dogs", "an image including cats", "an image including lions", according to which a corresponding content category may be obtained, that is, a set of text information includes 3 sentences as well as a set of text information corresponding to a content category of the above-mentioned preset content category set, where a preset content category set includes 3 content categories, and one text information is "an image including dogs", and obviously, the text information corresponds to a content category of "dogs" in the preset content category set, and the number of content categories in the preset content category set may be greater than the number of text information in the set of text information, and a text information in the set of text information may find a corresponding content category in the preset content category set.

It should be noted that in the field of content auditing, the content categories may include, but are not limited to, offensive categories that are images, such as illegal images and legal images. Of course, further subdivisions may be made based on the type of law or clause specifically violated, such as, for example, the categories of content such as yellow.

Further, fig. 3 is a schematic diagram of an alternative image processing method according to an embodiment of the present application, as shown in fig. 3, in a case where the target image 302 to be audited is obtained, a plurality of text messages exist in a set of text messages 304, where one or more text messages exist to describe the image content of the target image 302 and correspond to one or more content categories in a preset content category set 306, so that one or more content categories are determined according to the one or more text messages, that is, the content category corresponding to the target image 302 is indicated, and the audit of the target image is completed.

S204, inputting a target image and a group of text information into a pre-trained target multi-mode matching model to obtain a target image characterization vector and a group of text characterization vectors, wherein the target multi-mode matching model comprises a text encoder, an image encoder, text prompt information and image prompt information, the text prompt information and the image prompt information are obtained by calculation through source prompt information, the image prompt information is used for being input into the image encoder together with the target image to obtain the target image characterization vector, and the text prompt information is used for being input into the text encoder together with the group of text information to obtain the group of text characterization vectors;

Optionally, in the embodiment of the present application, the target multi-mode matching model may include, but is not limited to, a CLIP (Contrastive Language-Image Pre-tracking) model, where the CLIP model may be based on comparison learning to enable the model to learn a matching relationship between text and Image, where the CLIP model is widely used in various aspects of the vision field, such as Image recognition, target detection, and Image segmentation, and the CLIP model has a strong Pre-training capability, and exhibits excellent zero sample generalization performance in a downstream task, that is, the CLIP model may be migrated to a specific, usually less data, downstream task to improve task processing performance, and the downstream task may include, but is not limited to, image processing, image classification, object detection, image subtitle generation, and the like.

Illustratively, assuming that the target multimodal matching model comprises a CLIP model, the CLIP model may be trained to obtain the pre-trained target multimodal matching model described above, wherein the target multimodal matching model comprises a text encoder and an image encoder, the training of the target multimodal matching model comprises training the text encoder, training the image encoder, and the text encoder and the image encoder may each use different models. Of course, the learning parameters may be preset for the target multi-mode matching model, and the parameters of the text encoder and the image encoder may be fixed, and the learning parameters may be trained to obtain the target multi-mode matching model.

For example, fig. 4 is a schematic diagram of yet another alternative image processing method according to an embodiment of the present application, where the text encoder may include, but is not limited to, a transducer model, or other pre-trained language model, where the training of the text encoder 402, as shown in fig. 4, may be accomplished by, but is not limited to, employing a CoOp (Context Optimization ) model, where the CoOp method may be used to learn language context cues, where text cues 404 are introduced in a language branch of the CLIP model, and where the text cues 404 are embedded in an input set of text information 406, where the text encoder 402 is trained using the introduced text cues 404 as the text encoder 402 is continuously propagated forward, and where further where the text cues are input to the text encoder, may include, but are not limited to:

s1, a text encoder marks a group of input text information by using a transducer containing K layers, wherein K is a positive integer greater than 0, namely, the input text (such as a sentence or a paragraph) is decomposed into smaller text units (such as words, phrases or characters), and the decomposed text units are called a mark;

S2, projecting the mark to the word embeddingWherein W is ₀ Representing the set of textual information, each tag is mapped to a real vector space, each tag is represented by a vector that attempts to capture the semantic information of the tag, and the relationship with other tags,W _k direct input to text encoder +.>Layer L _(k+1) In (a):

wherein,w in (2) ₀ Representing word embedding matrix, W ₀ Corresponding to the fixed text input encoding of the text encoder, the word embedding matrix represents mapping the words, each word being represented as a vector of fixed length, N being the size of the vocabulary, d _l Representing the dimensions of the embedded text.

S4, inputting [ in the text encoder ]]Finish encoding, wherein P _l Representing text hinting information, further introducing the text hinting information into each transducer block of the text encoder:

wherein L is _k Representing the text encoder, [,]represents a splicing operation, "_representsAfter the text prompt information is introduced through the Kth transducer layer, the text prompt output by the previous layer is processed by the subsequent layer, and the text characterization vector z is calculated and is matched with the final transducer layer L through the textProj _k The text projection associated with the last marker of (c) gets z:

the TextProj is a linear mapping operation of text dimensions, and since the dimensions of the characterization vectors generated by the text encoder and the image encoder may not be consistent, one linear mapping is required to obtain text prompt information consistent with the dimensions of the image prompt information, so as to calculate the similarity.

Similarly, fig. 5 is a schematic diagram of another alternative image processing method according to an embodiment of the present application, where the image encoder may include, but is not limited to, a convolutional neural network model, such as a res net model, or a visual transform structure, such as a Vit model, where the ViT (Vision Transformer) model is a deep learning model for computer vision tasks, and unlike a conventional Convolutional Neural Network (CNN), the ViT model uses a transform architecture to divide an image into small blocks and input the small blocks as a sequence, learns relationships and features of different regions in the image through an attention mechanism, and the ViT model has better processing performance on tasks such as image classification, object detection, image division, and particularly, shows prominence on a large-scale data set, and can effectively integrate global information and local information of the image, while having better scalability and generalization capability.

Taking the Vit model as an example, as shown in fig. 5, the purpose of training the image encoder 502 may be achieved by, but not limited to, adopting a VPT (Visual Perceptual Training, visual perception training) method, introducing the image prompt 504 into the image branch of the CLIP model, and embedding the image prompt 504 into the input target image 506, and as the image encoder 502 continuously propagates, training the image encoder 502 using the introduced image prompt 504, and inputting the image prompt at the image encoder may include, but is not limited to:

s1, giving an input imageAn image encoder V consisting of K transform layers divides an image into M fixed-size image patches, where H represents the height of the input image I, W represents the width of the input image I, and 3 is threeColor channels (red, green, blue);

for example, each image patch is A ² The input image I is divided into B number of image patches, B is the product of H and W and A ² The result after the integer divide operation:

B=(H*W)/A ²

s2, projecting the image patch to a d _v Embedding dimensions in space to form an initial image patch code E ₀ The conversion of the pixel values of each image patch to a depth d may be accomplished by a trainable matrix _v Thus, if the image is partitioned into M image patches, initially encode E ₀ There will be M d _v Dimension vector d _v Representing the dimension of the embedded image;

s3, image patch code E _k And any one of the images image information c _k The calculated image prompt information P _v Layer V of the (k+1) th layer inputted together to the image encoder _(k+1) And sequentially processed through subsequent transducer layers:

wherein V is _k Representing the image encoder, the [,]represents a splicing operation, "_representsFurther, after the image prompt information processing is introduced through the Kth transducer layer, the subsequent layer processes the image prompt output by the previous layer, and calculates an image characterization vector x, which is to be combined with the final transducer layer L through the ImageProj _k Performing a projection operation on the image associated with the last marker of (a) to obtain x:

the ImageProj is a linear mapping operation of image dimensions, and because the dimensions of the characterization vectors generated by the text encoder and the image encoder may not be consistent, one linear mapping is required to obtain image prompt information consistent with the dimensions of the text prompt information, so as to calculate the similarity.

Specifically, fig. 6 is a schematic diagram of another alternative image processing method according to an embodiment of the present application, as shown in fig. 6, using a CLIP model as a target multi-mode matching model 602, by using the source hint information 604 and the scaling matrix 606, and the scaling matrix 608 may calculate and obtain the text hint information and the image hint information, respectively, where, assuming that the source hint information is a story line, the text hint information may be a detailed story written around the story line, and the image hint information may be an illustration or a scene graph of the story line, for example, the source hint information is a name of an application program, then based on the source hint information, the generated text hint information may be detailed description text about the application program, and the image hint information may be an image related to the application program, and the obtaining process of the text hint information may include, but is not limited to:

s1, acquiring source prompt information, wherein the source prompt information can comprise, but is not limited to, images, texts, videos and the like;

s2, initializing a scaling matrixB represents the length of the text prompt message, d _l Representing the text branch dimension of the CLIP model, R representing the real number field, representing the matrix M _l The elements belonging to the real number field, M representing the scaling matrix M _l N scaling matrix M _l The number of columns of (a);

s3, generating text prompt information P by using Cronecker inner product _l The following is shown:

wherein the scaling matrix is a linear transformation that can change the position and size of each point in the matrix. General purpose medicineOverscaling matrices, which can scale or enlarge objects along coordinate axes to change the size and shape of objects, are often used in computer graphics and computer vision to adjust the size and scale of images, or to change the position and size of objects in space, and whose roles include mainly changing the size, scale, shape and position of objects, cronecker inner product, also known as direct product or tensor product, is a widely used operation in matrix algebra, multi-linear algebra and abstract algebra, combining two matrices into one large matrix, which is useful for constructing linear transformations in high-dimensional space, e.g., given two matrices A and B, A is oneMatrix, B is a +.>Matrix, croneck product->Representing oneMatrix, each element a _ij (ith row and jth column of A) multiplied by all elements of matrix B to form a new matrix which replaces the original element a in A _ij 。

Similarly, the process of obtaining the image prompt information may include, but is not limited to:

s1, acquiring source prompt information, wherein the source prompt information can comprise, but is not limited to, images, videos and the like;

s2, initializing a scaling matrixR represents the real number domain and represents the matrix M _l The element of (a) belongs to the real number domain, b represents the length of the image prompt information, d _v Representing the image branch dimension of the CLIP model;

s3, generating image prompt information P by using Cronecker inner product _v The following is shown:

illustratively, fig. 7 is a schematic diagram of still another alternative image processing method according to an embodiment of the present application, as shown in fig. 7, after the text prompt 702 and the image prompt 704 are obtained, training of the target multimodal matching model 706 is completed, the target image 708 and the set of text information 710 are input into the target multimodal matching model 706, the target image characterization vector is processed by the image encoder and output by the text encoder, and the set of text characterization vectors are output by the text encoder, where the target image characterization vector may include feature information of the target image, and one text characterization vector in the set of text characterization vectors may include feature information of one text information in the set of text information.

S206, determining the similarity between the target image characterization vector and each text characterization vector in the set of text characterization vectors through the target multi-mode matching model, and determining the target content category indicated by the text characterization vector with the similarity meeting the preset similarity condition as the content category corresponding to the target image characterization vector;

optionally, in the embodiment of the present application, the similarity between the target image token vector and each text token vector may be determined by, but not limited to, calculating a cosine similarity score of each text token vector in the target image token vector and a set of text token vectors, to obtain a set of cosine similarity scores, and taking a content class represented by a text token vector corresponding to the largest cosine similarity score in the set as the content class of the target image, where calculating the cosine similarity score is a method for measuring similarity between two vectors, and may be used to compare the similarity degree between a feature vector of an image and a feature vector of a known class, specifically, given an image, firstly extracting the image feature vector, and then performing cosine similarity calculation on the image feature vector and the text feature vector of each class. Finally, the category with the highest cosine similarity score will be taken as the predicted content category for the given image, assuming that there are two vectors A and B, the cosine similarity of A and B can be calculated by the following formula:

Wherein,representing the dot product of vector a and vector B, the term "a" means the modulus of the vector a, and the term "B" means the modulus of the vector B.

The specific calculation steps are as follows:

s1, calculating the dot product of the vector A and the vector B:wherein n is the dimension of the vector A and the vector B, namely the number of elements contained in the vector A and the vector B, and n represents that the number of elements of the vector A and the vector B is the same in dot product calculation;

s2, respectively calculating the modes of the vector A and the vector B;

s3, calculating cosine similarity of the vector A and the vector B:。

it should be noted that, the range of values of the cosine similarity calculated in S3 may be between [ -1,1], where a value closer to 1 indicates that the two vector directions are more similar, a value closer to-1 indicates that the two vector directions are opposite, and a value of 0 indicates that the two vector directions are perpendicular.

Further, the above-mentioned preset similarity condition may be understood as that, in a case where cosine similarity between the target image token vector and one of the text token vectors in the set of text token vectors is less than or equal to a flexibly set cosine similarity threshold, the content class of the target image is the above-mentioned target content class, which may be determined by the target text token vector, that is, the text token vector whose similarity satisfies the preset similarity condition represents information contained in the target content class, and the target content class may include, but is not limited to, a pet class, a portrait class, a landscape class, a vehicle class, and the like.

Illustratively, fig. 8 is a schematic diagram of yet another alternative image processing method according to an embodiment of the present application, as shown in fig. 8, taking a CLIP model as a target multi-modal matching model 802, for example, the content category corresponding to the target image representation vector may be obtained by the following manner, including but not limited to:

wherein,representing a content category corresponding to one text token vector in a set of text token vectors, sim (,) representing a calculation of cosine similarity, x representing a target image token vector, z _i Representing one text token vector of a set of text token vectors, a +.>Is the learned temperature coefficient 804 in the target multimodal matching model,/for>It should be noted that, the temperature coefficient 804 may be understood as the influence degree of different dimension inputs (such as vision, hearing, touch, etc.) of the target multimodal matching model on learning and memorizing, and the higher the temperature coefficient, the greater the influence of the sensory input on learning and memorizing is indicated.

Further, a target audit result may be generated according to the target content category, where the target audit result is used to indicate whether the target content category belongs to the preset content category set.

Optionally, in this embodiment of the present application, the target review result may include, but is not limited to, a target content category, image information included in a target image, and the like, and the target review result is compared with a content category in the preset content category set, where it is assumed that a content category identical to the target content category exists in the preset content category, that is, the target content category belongs to the preset content category set, and otherwise, a content category identical to the target content category does not exist in the preset content category, that is, the target content category does not belong to the preset content category set.

In an exemplary embodiment, fig. 9 is a schematic diagram of another alternative image processing method according to an embodiment of the present application, where the image processing method set forth in the present application may be applied in a background image auditing system of a social application, and it may be determined, according to a target auditing result finally generated by the method, whether an image uploaded by an object account of a social application may be displayed on a terminal interface of another object account of the social application, and specific implementation steps may be as shown in fig. 9, including, but not limited to:

S902, logging in a social application program by an object account through terminal equipment, and uploading a target image, wherein the terminal equipment can comprise, but is not limited to, a smart phone, a computer, a notebook, a smart wearable device and the like;

s904, detecting a target image by a background image auditing system of the social application program, and further acquiring the target image, wherein the image auditing system can comprise, but is not limited to, a distributed server, a cloud server and the like;

s906, acquiring a set of text information predetermined in an image auditing system, wherein the text information comprises a plurality of descriptive characters, sentences or paragraphs and the like;

s908, inputting a target image and a set of text information into a pre-trained target multi-mode matching model in an image auditing system to obtain a target image characterization vector and a set of text characterization vectors, wherein the target multi-mode matching model can comprise, but is not limited to, a CLIP model and the like;

s910, calculating the similarity between each text token vector in the target image token vector and the text token vector group by using the cosine similarity score;

s912, determining the text token vector with the maximum similarity with the target image token vector calculated in S910 as the text token vector with the similarity meeting the preset similarity condition;

S914, taking the content category indicated by the text characterization vector obtained in S912 as the content category of the target image;

s916, generating a target auditing result according to the content category of the target image.

Further, since the target review result indicates that the content category of the target image does not belong to the preset content category set of the image review system, that is, the target image uploaded by the object account cannot be displayed on the terminal interface of the other object account.

In still another exemplary embodiment, the image processing method provided in the present application may be applied to a background video auditing system of a video playing application, and it may be determined whether a video uploaded by an object account of a video playing application may be displayed on a terminal interface of another object account of the video playing application according to a target auditing result finally generated by the method, and specific implementation steps may be as follows, as shown in fig. 9, including but not limited to:

s1, an object account logs in a video playing application program through terminal equipment, and uploads a target video, wherein the terminal equipment can comprise, but is not limited to, a smart phone, a computer, a notebook, an intelligent wearable device and the like;

S2, detecting a target video by a background video auditing system of the social application program, and further acquiring a target image according to video frames of the target video, wherein the video auditing system can comprise, but is not limited to, a distributed server, a cloud server and the like, the target image can be any video frame image of the target video, a plurality of video frame images can be determined as target images, and the following steps are respectively executed;

s3, acquiring a group of text information which is preset in a video auditing system and comprises a plurality of descriptive characters, sentences or paragraphs and the like;

s4, inputting a target image and a group of text information into a pre-trained target multi-mode matching model in a video auditing system to obtain a target image characterization vector and a group of text characterization vectors, wherein the target multi-mode matching model can comprise, but is not limited to, a CLIP model and the like;

s5, calculating the similarity between the target image characterization vector and each text characterization vector in the set of text characterization vectors by using the cosine similarity score;

s6, determining the text characterization vector with the maximum similarity with the target image characterization vector obtained in the S5 as the text characterization vector with the similarity meeting the preset similarity condition;

S7, taking the content category indicated by the text characterization vector obtained in the S6 as the content category of the target image;

s8, generating a target auditing result according to the content category of the target image.

Further, since the target review result indicates that the content category of the target image does not belong to the preset content category set of the video review system, that is, the target video uploaded by the object account cannot be displayed on the terminal interface of the other object account.

According to the embodiment of the application, the target image to be audited and a set of predetermined text information are acquired, wherein one text information in the set of text information is used for representing one content category in a preset content category set; inputting a target image and a group of text information into a pre-trained target multi-mode matching model to obtain a target image characterization vector and a group of text characterization vectors, wherein the target multi-mode matching model comprises a text encoder, an image encoder, text prompt information and image prompt information, the text prompt information and the image prompt information are obtained by calculating source prompt information, the image prompt information is used for being input into the image encoder together with the target image to obtain the target image characterization vector, and the text prompt information is used for being input into the text encoder together with the group of text information to obtain the group of text characterization vectors; the similarity between the target image characterization vector and each text characterization vector in the set of text characterization vectors is determined through the target multi-mode matching model, the target content category indicated by the text characterization vector with the similarity meeting the preset similarity condition is determined to be the content category corresponding to the target image characterization vector, further, the content category of the target image characterization vector can be used as a basis to generate a target auditing result, the purpose of rapidly distinguishing the content category of the target image is achieved, the distinguishing duration of the content category of the target image is shortened, the technical effect of improving the content auditing efficiency of the target image is achieved, and the technical problem that the efficiency of image processing by using the multi-mode matching model is low is solved.

As an alternative, the inputting the target image and the set of text information into a pre-trained target multi-mode matching model to obtain a target image characterization vector and a set of text characterization vectors, including: performing matrix conversion processing on the source prompt information and the image scaling matrix to determine the image prompt information, wherein the matrix conversion processing is used for determining an element at a first position in the image prompt information corresponding matrix as a product of the element at the first position in the source prompt information and the element at the first position in the image scaling matrix; performing stitching operation on the target image and the image prompt information to determine an image stitching vector; and mapping the image stitching vector to a target embedding space, and determining the target image characterization vector.

Alternatively, in an embodiment of the present application, the image scaling matrix may be expressed asB represents image prompt informationLength d of (d) _v Representing the image branch dimension of the target multi-mode matching model, R represents the real number domain and represents the matrix M _V The elements belonging to the real number field, M representing the scaling matrix M _V N scaling matrix M _V The scaling matrix may include a plurality of scaling matrices, that is, the image scaling matrix, and the first position may be understood as a position of any matrix element in the image prompt information corresponding matrix, where the image prompt information corresponding matrix also includes m rows and n columns of elements, a represents source prompt information, a _ij Element A representing the first position in the source prompt information _ij The position of the element is the first position, i is smaller than m, and j is smaller than n.

The matrix conversion process is illustratively understood to be extracting A in the source prompt information _ij Element, A _ij M in the first position of element and the image scaling matrix _ij The elements undergo accumulation processing including, but not limited to, A _ij Element and M _ij And performing product operation on the elements to obtain a product result, and determining the product result as the element of the first position in the corresponding matrix of the image prompt information.

Further, determining the position of each element in the corresponding matrix of the image prompt information as the first position, obtaining a value of each element in the corresponding matrix of the image prompt information through matrix conversion processing, performing a stitching operation on the target image and the image prompt information to obtain an image stitching vector, where the image stitching vector includes the target image information and the image prompt information, and then mapping the image stitching vector to the target embedded space, which may include, but is not limited to, performing projection processing using ImageProj to generate a target image characterization vector, where the stitching operation may be performed on the target image and the image prompt information in the following manner:

further, image prompt information processing is introduced through the Kth transducer layerThen, the subsequent layer processes the image prompt output by the previous layer, calculates the target image characterization vector x, and uses the image Proj to be matched with the final transducer layer L _k Is embedded and projected into a target embedded space to obtain x:

it should be noted that, taking the CLIP model as an example of the target multi-mode matching model, the target embedding space may be understood as a feature vector space corresponding to an image stitching vector obtained by image proj projection after the last transducer layer of the image encoder in the CLIP model, where the feature vector space represents an abstract representation obtained by encoding an input image, and different image stitching vectors respectively correspond to different specific vector representations.

As an optional solution, the performing a stitching operation on the target image and the image prompt information to determine an image stitching vector includes: dividing the target image into a plurality of image patches; projecting the plurality of image patches into an image embedding space to obtain a plurality of patch coding vectors; and performing a stitching operation on the patch code vectors and the image prompt information to determine the image stitching vector.

In this embodiment of the present application, the image patch may be understood as image information included in different image areas of the target image, after the target image is divided to obtain a plurality of image patches, the plurality of image patches may be projected to an image embedding space by, but not limited to, obtaining a plurality of patch code vectors through ImageProj, for example, one patch code vector is representing the image information included in one image area of the target image, and specifically, after the plurality of patch code vectors are obtained, each patch code vector is separately performed with the image prompt information to obtain an image stitching vector.

According to the embodiment of the application, matrix conversion processing is performed on the source prompt information and the image scaling matrix to obtain the value of each element in the corresponding matrix of the image prompt information, then the splicing operation is performed on the target image and the image prompt information to obtain the image splicing vector, the image splicing vector is mapped to the target embedded space, the reliability of the source of the image prompt information is ensured in a mode of generating the target image characterization vector, and therefore accurate training is performed on the target multi-mode matching model by using more accurate image prompt information, and the technical effect of improving the rate of acquiring the image auditing result according to the target multi-mode matching model is achieved.

As an alternative, the inputting the set of text information into the pre-trained target multi-modal matching model to obtain a set of text token vectors includes: performing matrix conversion processing on the source prompt information and the text scaling matrix to determine the text prompt information, wherein the matrix conversion processing is used for determining an element at a second position in the text prompt information corresponding matrix as a product of the element at the second position in the source prompt information and the element at the second position in the text scaling matrix; respectively executing splicing operation on the group of text information and the text prompt information to determine a group of text splicing vectors; and mapping the set of text splicing vectors to a target embedded space respectively, and determining the set of text characterization vectors.

Alternatively, in the embodiment of the present application, the text scaling matrix is expressed asB represents the length of the text prompt message, d _l Representing the text branch dimension of the target multi-mode matching model, R represents the real number domain and represents the matrix M _l The elements belonging to the real number field, M representing the scaling matrix M _l N scaling matrix M _l The scaling matrix may include a plurality of scaling matrices, that is, the text scaling matrix, and the second position may be understood as a position of any matrix element in the text prompt corresponding matrix, where the text prompt corresponding matrix also includes m rows and n columns of elements, and B represents a source Prompt message B _ij Representing the element in the second position in the source prompt information, B _ij The position of the element is the second position, i is smaller than m, and j is smaller than n.

The matrix conversion process is illustratively understood to be extracting B in the source hint information _ij Element B _ij M in the second position of element and text scaling matrix _ij The elements undergo accumulation processing including, but not limited to, B _ij Element and M _ij And performing product operation on the elements to obtain a product result, and determining the product result as the element of the second position in the text prompt information corresponding matrix.

Further, determining the position of each element in the text prompt information corresponding matrix as the second position, obtaining the value of each element in the text prompt information corresponding matrix through matrix transformation processing, performing a splicing operation on a group of text information and text prompt information to obtain a text splicing vector, wherein the text splicing vector comprises target text information and text prompt information, and then mapping the text splicing vector to a target embedded space, which can include but is not limited to performing projection processing by using TextProj to generate a text characterization vector with similarity meeting a preset similarity condition, wherein the splicing operation can be performed on the target text and the text prompt information by:

Wherein L is _k Representing the text encoder, [,]indicating the splicing operation, further, in the process ofAfter the text prompt information is introduced into the individual transformer layer for processing, the subsequent layer processes the text prompt output by the previous layer, and calculates to obtain a text characterization vector z:

it should be noted that, taking the CLIP model as an example of the target multi-mode matching model, the target embedding space may be understood as a feature vector space corresponding to a text splicing vector obtained by TextProj projection after the last transducer layer of the text encoder in the CLIP model, where the feature vector space represents an abstract representation obtained by encoding an input text, and different text splicing vectors respectively correspond to different specific vector representations. The TextProj is a linear mapping operation, and because the dimensions of the characterization vectors generated by the text encoder and the image encoder may not be consistent, one linear mapping is required to obtain the non-prompt information consistent with the dimensions of the image prompt information, so as to facilitate the calculation of the similarity.

As an alternative, the performing a stitching operation on the set of text information and the text prompt information, to determine a set of text stitching vectors includes: the method comprises the steps of respectively executing splicing operation on the group of text information and the text prompt information to determine a group of text splicing vectors, wherein the text information of each execution of the splicing operation is regarded as current text information, and the text splicing vector obtained by each execution of the splicing operation is regarded as current text splicing vector: executing word segmentation operation on the current text information to obtain a group of word segmentation; and projecting the group of component words into a text embedding space to obtain the current text splicing vector.

Illustratively, in the embodiment of the present application, the above-mentioned set of text stitching vectors may be understood as different text information in the above-mentioned set of text information, and the word segmentation operation is performed on each segmentation target image in the set of text stitching vectors to obtain a set of words, for example, the set of text information includes [ text information 1: "this is an image of a black and white spot dog", text information 2: "this is an image of a black and white striped shirt", text information 3: the method is characterized in that the method comprises the steps of performing word segmentation operation on text information 1, wherein the current text information is text information 1, a group of words including [ "this is", "one", "black and white", "spot", "dog", "image" ], and the like are obtained, and the word segmentation operation is sequentially performed on text information 2 and text information 3 respectively, and it is to be noted that the word segmentation operation on one text information can be performed by flexibly dividing the text information into a group of words, and the content, the form and the like of specific word segmentation are not limited in the application.

Further, after the above-mentioned set of component words is obtained, the set of component words may be projected into the image embedding space by, but not limited to, obtaining text splice vectors using TextProj, for example, one text splice vector representing text information included in one text information in a set of text information, specifically, after obtaining a plurality of text splice vectors, performing a splicing operation on each text splice vector and text prompt information, respectively, so as to obtain a plurality of text splice vectors.

According to the embodiment of the application, matrix conversion processing is performed on the source prompt information and the text scaling matrix to obtain the value of each element in the text prompt information corresponding matrix, then a group of text information and the text prompt information are subjected to splicing operation to obtain text splicing vectors, the text splicing vectors are mapped to the target embedded space, the text characterization vectors with the similarity meeting the preset similarity condition are generated, the reliability of the source of the text prompt information is ensured, and therefore accurate training is performed on the target multi-mode matching model by using more accurate text prompt information, and the technical effect of improving the speed of acquiring text auditing results according to the target multi-mode matching model is achieved.

As an alternative, the method further includes: acquiring a sample image and a group of sample text information, wherein the sample image is pre-marked with a text information corresponding to a target sample, and the group of sample text information comprises the target sample text information; inputting the sample image and the set of sample text information into an initial multi-mode matching model to be trained to obtain a sample image characterization vector and a set of sample text characterization vector, wherein the initial multi-mode matching model comprises the text encoder, the image encoder, initial text prompt information and initial image prompt information, and the initial text prompt information and the initial image prompt information are obtained by calculating initial source prompt information; determining sample similarity between the sample image characterization vector and each sample text characterization vector in the set of sample text characterization vectors by the initial multi-modal matching model; and calculating a loss parameter based on the sample similarity, and adjusting the initial source prompt information and the initial scaling matrix by using the loss parameter until the initial multi-modal matching model is trained into the target multi-modal matching model.

Optionally, in this embodiment of the present application, the sample image may include, but is not limited to, a portrait image, a pet image, a landscape image, and the like, the set of sample text information may include, but is not limited to, a sentence with a descriptive property, a paragraph, a text character, and the like, and the sample image has a label, which indicates that the sample image has a correspondence with the target sample text information, for example, the sample image is labeled with an image category that is a pet, that is, the image content category indicated by the target sample text information is also a pet.

Specifically, after the sample image and the set of sample text information are acquired, training an initial multi-modal matching model is achieved through the sample image and the set of sample text information, the initial multi-modal matching model can include, but is not limited to, a CLIP model, the sample image and the set of sample text information are input to obtain an initial multi-modal matching model, a sample image characterization vector and a set of sample text characterization vectors are obtained, the initial multi-modal matching model is trained based on the sample image characterization vector and the set of sample text characterization vectors, and the training process of the initial multi-modal matching model includes training the text encoder and the image encoder.

Further, the initial source prompt is illustrated as follows, and assuming that the initial source prompt is a story line, the initial text prompt may be a detailed story written around the story line, and the initial image prompt may be an illustration or a scene graph of the story line, for example, the source prompt is a name of an application program, and the generated initial text prompt may be detailed description text about the application program based on the source prompt, and the initial image prompt may be an image related to the application program, and the acquiring process of the initial text prompt may include, but is not limited to:

s1, acquiring initial source prompt information which can include but is not limited to images, texts, videos and the like;

s2, initializing an initial scaling matrixB represents the length of the initial text prompt message, d _l Representing the text branch dimension of the CLIP model, R representing the real number field, representing the matrix M _l The elements belonging to the real number field, M representing the scaling matrix M _l N scaling matrix M _l The number of columns of (a); />

S3, generating initial text prompt information P by using Cronecker inner product _l The following is shown:

Similarly, the process of obtaining the initial image prompt information may include, but is not limited to:

s1, acquiring initial source prompt information, wherein the initial source prompt information can comprise, but is not limited to, images, videos and the like;

s2, initializing an initial scaling matrixR represents the real number domain and represents the matrix M _l The element of (a) belongs to the real number domain, b represents the length of the initial image prompt information, d _v Representing the image branch dimension of the CLIP model;

s3, generating initial image prompt information P by using Cronecker inner product _v The following is shown:

the training of the initial target multimodal matching model is completed after the initial text prompt information and the initial image prompt information are obtained, then a sample image and a set of sample text information are input into the initial multimodal matching model, the sample image characterization vector is processed and output by the image encoder, and the set of sample text characterization vectors is processed and output by the text encoder, wherein the sample image characterization vector can comprise characteristic information of the sample image, and one sample text characterization vector in the set of sample text characterization vectors can comprise characteristic information of one sample text information in the set of sample text information.

Optionally, in this embodiment of the present application, the similarity between the sample image token vector and each sample text token vector may be determined by, but not limited to, calculating a cosine similarity score of each sample text token vector in the sample image token vector and a set of sample text token vectors, using the set of cosine similarity scores as sample similarity, and adjusting the initial source prompt information and the initial scaling matrix by using the loss parameter, that is, adjusting the content of the initial source prompt information and the matrix element and the size of the initial scaling matrix, and so on, where the initial multi-modal matching model corresponding to the initial source prompt information and the initial scaling matrix after the training is finished will be updated to the target modal matching model, and a prediction result of the content category of the target image may be output, and the loss parameter of the initial multi-modal matching model may be calculated by the sample similarity to train the initial multi-modal matching model, including, but not limited to the following steps:

s1, determining a loss function, namely firstly determining a loss function for an initial multi-mode matching model, such as a cross entropy loss function or a mean square error loss function;

S2, calculating the similarity between the sample image characterization vector and each sample text characterization vector can include, but is not limited to, calculating the similarity by a method such as cosine similarity, euclidean distance or correlation coefficient;

s3, adjusting loss parameters, adjusting parameters of a loss function according to the calculated sample similarity, and weighting the loss function according to the similarity so that the contribution of samples with higher similarity to loss is larger and the contribution of samples with lower similarity to loss is smaller;

s4, training the model, namely training the multi-mode matching model by using the adjusted loss parameters, wherein in the training process, the loss function is adjusted according to the sample similarity, namely, the content of the initial source prompt information, the matrix elements, the size and the like of the initial scaling matrix are adjusted.

According to the embodiment of the application, the sample image and a group of sample text information are used for generating the sample image characterization vector and a group of sample text characterization vector, the sample similarity between the sample image characterization vector and each sample text characterization vector is calculated, the loss parameter of the initial multi-mode matching model is calculated according to the sample similarity, the initial source prompt information and the initial scaling matrix are adjusted according to the loss parameter, the purpose of training the initial multi-mode matching model to obtain the target multi-mode matching model is achieved, the content category of the target image is output through the target multi-mode matching model, and further the technical effect of improving the accuracy of the prediction result of the content category of the target image is achieved.

As an alternative, the inputting the sample image and the set of sample text information into an initial multimodal matching model to be trained to obtain a sample image token vector and a set of sample text token vectors includes: acquiring the initial source prompt information, the initial image scaling matrix and the initial text scaling matrix, wherein the initial source prompt information, the initial image scaling matrix and the initial text scaling matrix are parameters which allow adjustment in the process of training the initial multi-mode matching model; performing matrix conversion processing on the initial source prompt information and the initial image scaling matrix to determine the initial image prompt information, wherein the matrix conversion processing is used for determining an element at a third position in the initial image prompt information corresponding matrix as a product of the element at the third position in the initial source prompt information and the element at the third position in the initial image scaling matrix; performing the matrix conversion process on the initial source prompt information and the initial text scaling matrix to determine the initial text prompt information, wherein the matrix conversion process is used for determining an element at a fourth position in the initial text prompt information corresponding matrix as a product of the element at the fourth position in the initial source prompt information and the element at the fourth position in the initial text scaling matrix; performing a stitching operation on the sample image and the initial image prompt information to determine a sample image stitching vector, and performing a stitching operation on the set of sample text information and the initial text prompt information to determine a set of sample text stitching vectors; and mapping the sample image stitching vector and the set of sample text stitching vectors to an initial embedding space, and determining the sample image characterization vector and the set of sample text characterization vectors.

Optionally, in the embodiment of the present application, in training the initial multi-modal matching model, training of the initial multi-modal matching model may be achieved by adjusting the initial image scaling matrix and the initial text scaling matrix.

Further, determining the position of each element in the corresponding matrix of the initial image prompt information as the third position, obtaining a value of each element in the corresponding matrix of the initial image prompt information through matrix transformation processing, performing a stitching operation on the sample image and the initial image prompt information to obtain an initial image stitching vector, where the initial image stitching vector includes sample image information and initial image prompt information, mapping the initial image stitching vector to an initial embedding space, which may include but is not limited to performing projection processing by using ImageProj to generate a sample image characterization vector, processing an image prompt output by a previous layer after the image prompt information is introduced into the image prompt information processing through a kth transducer layer, calculating to obtain a sample image characterization vector, and similarly, obtaining a value of each element in the corresponding matrix of the initial text prompt information through matrix transformation processing, performing a stitching operation on the sample text information and the initial text prompt information to obtain an initial text stitching vector, where the initial text stitching vector includes sample text information and initial text prompt information, mapping the initial text stitching vector to the initial embedding space, which may include but is not limited to performing projection processing by using the kth transducer layer, calculating the sample text prompt vector, and calculating the text prompt vector after the text prompt vector is introduced into the first transducer layer, and calculating the text prompt vector.

As an alternative, the inputting the target image and the set of text information into a pre-trained target multi-mode matching model to obtain a target image characterization vector and a set of text characterization vectors, including: setting the corresponding source prompt information and a scaling matrix for each layer of image matching sub-model under the condition that the target multi-mode matching model comprises K layers of image matching sub-models, wherein the output of an ith layer of image matching sub-model in the K layers of image matching sub-models and the image prompt information corresponding to an (i+1) th layer of image matching sub-model are used as the input of the (i+1) th layer of image matching sub-model together, K is a positive integer greater than 1, and i is a positive integer smaller than K; setting the corresponding source prompt information and the scaling matrix for each layer of text matching sub-model under the condition that the target multi-mode matching model comprises a K-layer text matching sub-model, wherein the output of an ith layer of text matching sub-model in the K-layer text matching sub-model and the text prompt information corresponding to an (i+1) th layer of text matching sub-model are used as the input of the (i+1) th layer of text matching sub-model together.

Optionally, in this embodiment of the present application, the target multi-mode matching model includes K layers of image matching sub-models, each layer of image matching sub-model has one source hint information and a scaling matrix, that is, the number of source hint information is K, the number of scaling matrices is K, it should be noted that the source hint information of each layer may be different or the same, and the scaling matrix of each layer may be different or the same.

Further, the output information of any one of the K-layer image matching sub-models is used as input information together with the image prompt information of the next layer of the image matching sub-model, and the image prompt information of the next layer of the image matching sub-model enters the image matching sub-model of the layer to be processed, that is, the output result of the image matching sub-model of the previous layer influences the image information processing process of the image matching sub-model of the next layer, so that each image matching sub-model of the layer can process and match the input information more deeply, for example, 50 layers of image matching sub-models are shared, at this time, the output result of the 10 th layer of image matching sub-model is A, and the input information of the 11 th layer of image matching sub-model is the image prompt information corresponding to the 11 th layer of image matching sub-model.

Similarly, the target multi-mode matching model includes K layers of text matching sub-models, each layer of text matching sub-model has a source prompt message and a scaling matrix, that is, the number of source prompt messages is K, the number of scaling matrices is K, it should be noted that, the source prompt messages of each layer may be different or the same, the scaling matrices of each layer may be different or the same, the output message of any one layer of text matching sub-models in the K layers of text matching sub-models is used as input message together with the text prompt message of the next layer of text matching sub-models in the K layers of text matching sub-models, the input message enters the next layer of text matching sub-models in the K layers of text matching sub-models, that is, the output result of the last layer of text matching sub-models affects the text message processing process of the next layer of text matching sub-models, so that each layer of text matching sub-models can process and match the input message more deeply, for example, the total 100 layers of text matching sub-models are different or the same, at this time, the output result of the 20 th layer of text matching sub-models is B, and the input message of the 21 th layer of text matching sub-model is the text matching message corresponding to the text prompt message of the corresponding to the layer 21 of the text matching sub-models.

According to the embodiment of the application, the corresponding source prompt information and the scaling matrix are set for each layer of image matching sub-model of the target multi-mode matching model, and the input of each layer of image matching sub-model comprises the output information of the image matching sub-model of the last layer, so that the whole multi-mode matching model can fuse information of different modes layer by layer, the aim of improving the relevance between the image information and the text information is achieved, and therefore the performance and the accuracy of the target multi-mode matching model are improved.

As an optional solution, the determining, by using the target multi-mode matching model, a similarity between the target image token vector and each text token vector in the set of text token vectors, and determining, as a content category corresponding to the target image token vector, a target content category indicated by a text token vector whose similarity satisfies a preset similarity condition, includes: acquiring a first cosine distance between the target image characterization vector and a first text characterization vector and a second cosine distance between the target image characterization vector and a second text characterization vector, wherein the set of text characterization vectors comprises the first text characterization vector and the second text characterization vector; determining the first text token vector as a target text token vector when the first cosine distance is smaller than or equal to a cosine distance threshold, wherein the content category of the target image is the target content category; and if the second cosine distance is greater than the cosine distance threshold, determining that the second text token vector is not the target text token vector, and determining that the content category of the target image is not the target content category.

Optionally, in this embodiment of the present application, the first text token vector and the second text token vector are text token vectors in the set of text token vectors, the first cosine distance may be used to represent a similarity between the target image and one text message in the set of text messages indicated by the first text token vector, the second cosine distance may be used to represent a similarity between the target image and one text message in the set of text messages indicated by the second text token vector, the first cosine distance may be obtained by calculating a cosine similarity score of the target image token vector and the first text token vector, and the second cosine distance may be obtained by calculating a cosine similarity score of the target image token vector and the second text token vector.

For example, after the first cosine distance and the second cosine distance are obtained, if the first cosine distance is smaller than or equal to the cosine distance threshold, the first text token vector may be determined as the target text token vector, the target content category corresponding to the target image may be determined by the first text token vector, and if the second cosine distance is greater than the cosine distance threshold, the second text token vector may not be determined as the target text token vector, and the second text token vector may not be used to determine the target content category corresponding to the target image, where the value of the cosine distance threshold may be flexibly set, which is not specifically limited in this application.

As an alternative, the method further includes: determining that the target image is a violation image when the target content category belongs to a violation content category set, wherein the preset content category set comprises the violation content category set; and generating violation prompt information in response to determining that the target image is the violation image, wherein the violation prompt information is used for indicating that the target image is not approved.

Optionally, in this embodiment of the present application, the set of offensive content categories may include, but is not limited to, a portrait category, a pet category, a landscape category, and the like, and the set of preset content categories includes a set of offensive content categories, that is, assuming that after the content category of the target image is determined according to the target text token vector, the set of offensive content categories includes the content category of the target image, which indicates that the target image is an offensive image, in other words, whether the target image is offensive is determined by determining the content category of the target image.

Specifically, after determining that the target image is the violation image, a target audit result may be generated, where the target audit result includes violation prompt information, indicates that the target image is the violation image, and related violation information and cause of the violation of the target image, and subsequent processing on the target image, for example, a class of dogs is included in a violation content class set, at this time, the target account uploads an image of a black-and-white spot dog through the network, the image outputs a target audit result through the target multi-mode matching model, including an image content class as a dog, prompt information in a form of a popup window may be displayed on a terminal interface of the target account, the content class of the image is represented as a dog, belongs to the violation image, for example, the target account uploads 10 images through the network, where there is an image of a black-and-white spot dog, the 10 images output a target audit result including a content class as a dog, and the content class of an image is represented as a dog through displaying the popup window on the terminal interface of the target account, and the content class of the image is displayed as a dog, and the image belongs to the violation image may be deleted, and the operation may be performed again.

According to the embodiment of the application, the violation prompt information can be generated after the content category of the target image is obtained, and the violation prompt information is utilized to prompt that the image audit uploaded by the object account fails, so that the technical effect of improving the processing rate of the image audit is achieved, and further the object account can adjust the image according to the violation prompt information, and the operation flow of the object account is simplified.

As an alternative, the method further includes: performing a kronecker inner product operation on the source prompt information and an image scaling matrix to obtain the image prompt information, wherein the scaling matrix comprises the image scaling matrix; and executing the Cronecker inner product operation on the source prompt information and a text scaling matrix to obtain the text prompt information, wherein the scaling matrix comprises the text scaling matrix.

Illustratively, performing a kronecker product operation on the source hint information and the image scaling matrix means that each information contained in the source hint information is multiplied by all elements of the image scaling matrix to form a new matrix, which is the image hint information, and similarly performing a kronecker product operation on the source hint information and the text scaling matrix means that each information contained in the source hint information is multiplied by all elements of the text scaling matrix to form a new matrix, which is the text hint information.

In addition, in the structural design of multi-mode alignment prompt learning, the trainable parameters in the text encoder and the image encoder can be reduced through the Cronecker inner product operation, so that the performance of multi-mode alignment prompt learning in processing the downstream task is improved, meanwhile, less storage space can be used for storing training resources in training the text encoder and the image encoder, the technical effect of saving resources is achieved, the image processing method provided by the application can solve the defects that in the related art, the CLIP is finely tuned and the single mode or multi-mode solution space is adjusted, and particularly, the method can realize the tight connection between the text mode and the image through the source mode prompt.

In addition, the image processing method designs a related Cronecker inner product structure so as to reduce the number of trainable parameters and achieve the purpose of improving the training efficiency and performance of the model.

In an exemplary embodiment, the image processing method provided by the application can be applied to an application scene of content auditing, as shown in fig. 9, taking a content auditing system as an example, judging whether the content in the content auditing system has a violation condition, realizing the technical effect of improving the accuracy of the auditing result of the content auditing, achieving the purpose of improving the operation efficiency of the content auditing system, and also carrying out rapid iterative update on an online model to achieve a better model effect.

It should be noted that, the CLIP model is taken as an example of the target multi-mode matching model, and the CLIP model structure may include, but is not limited to, a CLIP image encoder, a CLIP text encoder, and the like, where the image encoder may use a CNN model, such as a res net, or an image transducer structure, such as a ViT modelIn the present application, the ViT model may be used as an image encoder compatible with image cues in which an image is inputAn image encoder V consisting of K transform layers divides an image into M fixed-size patches and projects them to the patch encodingIs a kind of medium. Patch code E _k And a learnable classification mark c _k Layer V of the (k+1) th layer inputted together to the image encoder _k+1 And sequentially processed through subsequent transducer layers:

wherein class labels from the last converter layer of the image encoderBy->Projecting into the target embedding space, representing the image output by the last layer of the image encoder by x,/->。

In addition, text encoders use a transducer that also contains K layers to tag the input word and project it into the word embeddingWherein W is _k Layer k+1L directly input to text encoder _k+1 In (c), as follows:

the final text representation z is to be compared with the final Tran by TextProjsformer layerIs projected into the target embedding space: />

For zero sample prediction, elaborate hints are introduced in the language branch of CLIP by equipping each CLASS name associated with a downstream task, e.g., "a photo of [ CLASS ]]", to reconstruct text input, selecting a category of the plurality of categories having a highest cosine similarity score as a predictive label for the given imageThe method comprises the following steps:

wherein sim (,) represents the calculation of cosine similarity,the temperature coefficient learned by the CLIP is C is the total number of categories, a ViT model can be used as an image encoder, a CoOp model is used as a text encoder, a CoOp model introduces a learnable text prompt, and merely adjusting an image or text can damage a text image matching structure of the CLIP, so that the adaptability of a downstream task is poor, and according to a multi-mode prompt learning method MaPLe, the text prompt can be used for generating the image prompt through the MLP, however, the mode still has limitations in terms of image mode and model efficiency.

Therefore, the application provides an image processing method, and simultaneously generates a text promptSum image cuesAs shown in fig. 8, where b represents the length of the text prompt and the image prompt, and d _l And d _v The dimensions of the text and image branches, respectively, representing the CLIP model, first, initialize the source cues for both modalities:

secondly, initializing the scaling matrices of the two modalities separately:

the text prompt of the text encoder and the image prompt of the image encoder are then generated using the kronecker inner product:

wherein M represents a scaling matrix M _l And M is as follows _V N scaling matrix M _l And M is as follows _v First, the use of the kronecker inner product ensures maximum preservation of source cuesIs advantageous for alignment between text and image cues and, secondly, text cues P in the CLIP model _l And image prompt P _v The number of learnable parameters of (a) is from +.>Reduce to mn +Where K represents the number of transducer layers, a reduction in parameters not only makes the model more efficient, but also reduces the risk of overfitting.

Further, a leachable marker P is introduced in the image branch of the CLIP _v To learn image cues, at which time the input code of the image encoderCan be used in the form of [ [ ]Expressed by E ₀ I.e. fixed image input code in the input image encoder, at the image encoder V _k Further introducing a new learnable tag into each transducer block:

wherein, above [,]refers to a splicing operation, after the Kth transducer layer, a subsequent layer processes the image prompt output by the previous layer and calculates to obtain the final image representationThe calculation formula may be as follows:

similarly, to learn text context hints, a learner tag P may be introduced into the language branch of the CLIP model _l At this time, the form of input encoding of the text encoder may be usedWherein W is ₀ I.e. fixed text input encoding in the input text encoder, text encoder L _k Further introducing a new learnable tag into each transducer block in (a):

wherein [, ] refers to the splicing operation, after the Kth transducer layer, the subsequent layer processes the text prompt output by the previous layer, and calculates to obtain the final text representation z, and the calculation formula can be as follows:

optionally, in an embodiment of the present application, P _S Only fine-tuning of source cues and scaling matrix M _l And M _V The aim of adjusting the target multi-modal matching model can be achieved, that is, the whole target multi-modal matching model does not need to be updated, and the source prompt Ps and the scaling matrix M are utilized _l And M _V Generating text prompts P respectively _l And image prompt P _V Will P _l Embedding text branches to enable training of text encoders, P _V Embedding image branches to achieve training of an image encoder, at which time text hint information P of a target multimodal matching model _l And image prompt information P _V From the number of learnable parameters of (a)Reduce to mn +Wherein->Text prompt P representing CLIP model in related art _l And image prompt information P _V Through the image processing method provided by the application, P is prompted through a source _S Number of parameters determined to be learnable +.>And scaling the matrix parameters m, n such that the text prompt P of the target multimodal matching model _l And image prompt information P _V The number of learnable parameters of the system is greatly reduced, the system is kept basically the same as a completely-adjusted target Multi-mode matching model in terms of performance of downstream tasks, in addition, the inner product of Cronecker can be replaced by MLP (Multi-Layer Percepro) Multi-Layer perception Layer, however, the method can lead to greatly increased learnable parameters, the information of source prompt cannot be kept to the greatest extent, and the image prompt information and the text are ensured The prompt information can be independently initialized and updated, and the structure of matching between the image and the text of the target multi-mode matching model is destroyed under the assumption that the image prompt information and the text prompt information are not connected in the updating process of the target multi-mode matching model.

It will be appreciated that in the specific embodiments of the present application, related data such as user information is referred to, and when the above embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

According to another aspect of the embodiments of the present application, there is also provided an image processing apparatus for implementing the above-described image processing method. As shown in fig. 10, the apparatus includes:

an obtaining module 1002, configured to obtain a target image and a predetermined set of text information, where one text information in the set of text information is used to characterize one content category in a preset content category set;

the training module 1004 is configured to input a target image and a set of text information into a pre-trained target multi-mode matching model to obtain a target image characterization vector and a set of text characterization vectors, where the target multi-mode matching model includes a text encoder, an image encoder, text prompt information and image prompt information, the text prompt information and the image prompt information are obtained by calculating source prompt information, the image prompt information is used to input the image encoder together with the target image to obtain the target image characterization vector, and the text prompt information is used to input the text encoder together with the set of text information to obtain the set of text characterization vectors;

the determining module 1006 is configured to determine, according to the target multimodal matching model, a similarity between the target image token vector and each text token vector in the set of text token vectors, and determine, as a content category corresponding to the target image token vector, a target content category indicated by the text token vector whose similarity satisfies a preset similarity condition.

As an alternative, the apparatus is configured to input the target image and the set of text information into a pre-trained target multi-modal matching model to obtain a target image token vector and a set of text token vectors by: performing matrix conversion processing on the source prompt information and the image scaling matrix to determine the image prompt information, wherein the matrix conversion processing is used for determining an element at a first position in the image prompt information corresponding matrix as a product of the element at the first position in the source prompt information and the element at the first position in the image scaling matrix; performing stitching operation on the target image and the image prompt information to determine an image stitching vector; and mapping the image stitching vector to a target embedding space, and determining the target image characterization vector.

As an alternative, the device is configured to perform a stitching operation on the target image and the image prompt information to determine an image stitching vector by: dividing the target image into a plurality of image patches; projecting the plurality of image patches into an image embedding space to obtain a plurality of patch coding vectors; and performing a stitching operation on the patch code vectors and the image prompt information to determine the image stitching vector.

As an alternative, the apparatus is configured to input the set of text information into a pre-trained target multimodal matching model to obtain a set of text token vectors by: performing matrix conversion processing on the source prompt information and the text scaling matrix to determine the text prompt information, wherein the matrix conversion processing is used for determining an element at a second position in the text prompt information corresponding matrix as a product of the element at the second position in the source prompt information and the element at the second position in the text scaling matrix; respectively executing splicing operation on the group of text information and the text prompt information to determine a group of text splicing vectors; and mapping the set of text splicing vectors to a target embedded space respectively, and determining the set of text characterization vectors.

As an alternative, the apparatus is configured to perform a stitching operation on the set of text information and the text prompt information, respectively, to determine a set of text stitching vectors by: the method comprises the steps of respectively executing splicing operation on the group of text information and the text prompt information to determine a group of text splicing vectors, wherein the text information of each execution of the splicing operation is regarded as current text information, and the text splicing vector obtained by each execution of the splicing operation is regarded as current text splicing vector: executing word segmentation operation on the current text information to obtain a group of word segmentation; and projecting the group of component words into a text embedding space to obtain the current text splicing vector.

As an alternative, the above device is further configured to: acquiring a sample image and a group of sample text information, wherein the sample image is pre-marked with a text information corresponding to a target sample, and the group of sample text information comprises the target sample text information; inputting the sample image and the set of sample text information into an initial multi-mode matching model to be trained to obtain a sample image characterization vector and a set of sample text characterization vector, wherein the initial multi-mode matching model comprises the text encoder, the image encoder, initial text prompt information and initial image prompt information, and the initial text prompt information and the initial image prompt information are obtained by calculating initial source prompt information; determining sample similarity between the sample image characterization vector and each sample text characterization vector in the set of sample text characterization vectors by the initial multi-modal matching model; and calculating a loss parameter based on the sample similarity, and adjusting the initial source prompt information and the initial scaling matrix by using the loss parameter until the initial multi-modal matching model is trained into the target multi-modal matching model.

As an alternative, the apparatus is configured to input the sample image and the set of sample text information into an initial multimodal matching model to be trained, to obtain a sample image token vector and a set of sample text token vectors in the following manner: acquiring the initial source prompt information, the initial image scaling matrix and the initial text scaling matrix, wherein the initial source prompt information, the initial image scaling matrix and the initial text scaling matrix are parameters which allow adjustment in the process of training the initial multi-mode matching model; performing matrix conversion processing on the initial source prompt information and the initial image scaling matrix to determine the initial image prompt information, wherein the matrix conversion processing is used for determining an element at a third position in the initial image prompt information corresponding matrix as a product of the element at the third position in the initial source prompt information and the element at the third position in the initial image scaling matrix; performing the matrix conversion process on the initial source prompt information and the initial text scaling matrix to determine the initial text prompt information, wherein the matrix conversion process is used for determining an element at a fourth position in the initial text prompt information corresponding matrix as a product of the element at the fourth position in the initial source prompt information and the element at the fourth position in the initial text scaling matrix; performing a stitching operation on the sample image and the initial image prompt information to determine a sample image stitching vector, and performing a stitching operation on the set of sample text information and the initial text prompt information to determine a set of sample text stitching vectors; and mapping the sample image stitching vector and the set of sample text stitching vectors to an initial embedding space, and determining the sample image characterization vector and the set of sample text characterization vectors.

As an alternative, the apparatus is configured to input the target image and the set of text information into a pre-trained target multi-modal matching model to obtain a target image token vector and a set of text token vectors by: setting the corresponding source prompt information and a scaling matrix for each layer of image matching sub-model under the condition that the target multi-mode matching model comprises K layers of image matching sub-models, wherein the output of an ith layer of image matching sub-model in the K layers of image matching sub-models and the image prompt information corresponding to an (i+1) th layer of image matching sub-model are used as the input of the (i+1) th layer of image matching sub-model together, K is a positive integer greater than 1, and i is a positive integer smaller than K; setting the corresponding source prompt information and the scaling matrix for each layer of text matching sub-model under the condition that the target multi-mode matching model comprises a K-layer text matching sub-model, wherein the output of an ith layer of text matching sub-model in the K-layer text matching sub-model and the text prompt information corresponding to an (i+1) th layer of text matching sub-model are used as the input of the (i+1) th layer of text matching sub-model together.

As an optional solution, the apparatus is configured to determine, by using the target multimodal matching model, a similarity between the target image token vector and each text token vector in the set of text token vectors, and determine, as a content category corresponding to the target image token vector, a target content category indicated by the text token vector whose similarity satisfies a preset similarity condition: acquiring a first cosine distance between the target image characterization vector and a first text characterization vector and a second cosine distance between the target image characterization vector and a second text characterization vector, wherein the set of text characterization vectors comprises the first text characterization vector and the second text characterization vector; determining the first text token vector as a target text token vector when the first cosine distance is smaller than or equal to a cosine distance threshold, wherein the content category of the target image is the target content category; and if the second cosine distance is greater than the cosine distance threshold, determining that the second text token vector is not the target text token vector, and determining that the content category of the target image is not the target content category.

As an alternative, the above device is further configured to: determining that the target image is a violation image when the target content category belongs to a violation content category set, wherein the preset content category set comprises the violation content category set; and generating violation prompt information in response to determining that the target image is the violation image, wherein the violation prompt information is used for indicating that the target image is not approved.

As an alternative, the above device is further configured to: performing a kronecker inner product operation on the source prompt information and an image scaling matrix to obtain the image prompt information, wherein the scaling matrix comprises the image scaling matrix; and executing the Cronecker inner product operation on the source prompt information and a text scaling matrix to obtain the text prompt information, wherein the scaling matrix comprises the text scaling matrix.

In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function, and works together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

According to one aspect of the present application, a computer program product is provided, the computer program product comprising a computer program.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

Fig. 11 schematically shows a block diagram of a computer system for implementing an electronic device according to an embodiment of the present application.

It should be noted that, the computer system 1100 of the electronic device shown in fig. 11 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 11, the computer system 1100 includes a central processing unit 1101 (Central Processing Unit, CPU) that can execute various appropriate actions and processes according to a program stored in a Read-Only Memory 1102 (ROM) or a program loaded from a storage section 1108 into a random access Memory 1103 (Random Access Memory, RAM). In the random access memory 1103, various programs and data necessary for the system operation are also stored. The cpu 1101, the rom 1102, and the ram 1103 are connected to each other via a bus 1104. An Input/Output interface 1105 (i.e., an I/O interface) is also connected to bus 1104.

The following components are connected to the input/output interface 1105: an input section 1106 including a keyboard, a mouse, and the like; an output portion 1107 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker; a storage section 1108 including a hard disk or the like; and a communication section 1109 including a network interface card such as a local area network card, a modem, and the like. The communication section 1109 performs communication processing via a network such as the internet. The drive 1110 is also connected to the input/output interface 1105 as needed. Removable media 1111, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in drive 1110, so that a computer program read therefrom is installed as needed in storage section 1108.

In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1109, and/or installed from the removable media 1111. The computer program, when executed by the central processor 1101, performs the various functions defined in the system of the present application.

In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1109, and/or installed from the removable media 1111. When executed by the central processor 1101, performs the various functions provided by the embodiments of the present application.

According to still another aspect of the embodiments of the present application, there is also provided an electronic device for implementing the above image processing method, where the electronic device may be a terminal device or a server as shown in fig. 1. The present embodiment is described taking the electronic device as a terminal device as an example. As shown in fig. 12, the electronic device comprises a memory 1202 and a processor 1204, the memory 1202 storing a computer program, the processor 1204 being arranged to perform the steps of any of the method embodiments described above by means of the computer program.

Alternatively, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of the computer network.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the method in the embodiments of the present application by a computer program.

Alternatively, it will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 12 is merely illustrative, and that fig. 12 is not intended to limit the configuration of the electronic device described above. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 12, or have a different configuration than shown in FIG. 12.

The memory 1202 may be used to store software programs and modules, such as program instructions/modules corresponding to the image processing methods and apparatuses in the embodiments of the present application, and the processor 1204 executes the software programs and modules stored in the memory 1202 to perform various functional applications and data processing, that is, implement the image processing methods described above. Memory 1202 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 1202 may further include memory located remotely from the processor 1204, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 1202 may be used for storing information such as a target image, image prompt information, text prompt information, etc., but is not limited to the above. As an example, as shown in fig. 12, the memory 1202 may include, but is not limited to, the acquisition module 1002, the training module 1004, the determination module 1006, and the generation module 1008 in the image processing apparatus. In addition, other module units in the image processing apparatus may be included, but are not limited to, and are not described in detail in this example.

Optionally, the transmission device 1206 is configured to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission means 1206 comprises a network adapter (Network Interface Controller, NIC) that can be connected to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 1206 is a Radio Frequency (RF) module for communicating wirelessly with the internet.

In addition, the electronic device further includes: a display 1208 for displaying the target image, the image prompt information, the text prompt information, etc.; and a connection bus 1210 for connecting the respective module parts in the above-described electronic apparatus.

In other embodiments, the terminal device or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting the plurality of nodes through a network communication. The nodes may form a peer-to-peer network, and any type of computing device, such as a server, a terminal, etc., may become a node in the blockchain system by joining the peer-to-peer network.

According to one aspect of the present application, there is provided a computer-readable storage medium, from which a processor of an electronic device reads the computer instructions, which the processor executes, causing the electronic device to perform the image processing method provided in various alternative implementations of the above-described image processing aspects.

Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a program for executing the method in the embodiments of the present application.

Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or part of the technical solution that contributes to the prior art, in the form of a software product, which is stored in a storage medium, comprising several instructions for causing one or more electronic devices to perform all or part of the steps of the methods described in the various embodiments of the present application.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed application may be implemented in other ways. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims

1. An image processing method, comprising:

acquiring a target image and a predetermined set of text information, wherein one text information in the set of text information is used for representing one content category in a preset content category set;

inputting the target image and the group of text information into a pre-trained target multi-mode matching model to obtain a target image characterization vector and a group of text characterization vectors, wherein the target multi-mode matching model comprises a text encoder, an image encoder, text prompt information and image prompt information, the text prompt information and the image prompt information are obtained by calculating source prompt information, the image prompt information is used for being input into the image encoder together with the target image to obtain the target image characterization vector, and the text prompt information is used for being input into the text encoder together with the group of text information to obtain the group of text characterization vectors;

Determining the similarity between the target image representation vector and each text representation vector in the group of text representation vectors through the target multi-mode matching model, and determining the target content category indicated by the text representation vector with the similarity meeting the preset similarity condition as the content category corresponding to the target image representation vector;

inputting the target image and the set of text information into a pre-trained target multi-mode matching model to obtain a target image characterization vector and a set of text characterization vectors, wherein the method comprises the following steps of: performing matrix transformation processing on the source prompt information and an image scaling matrix to determine the image prompt information, wherein the matrix transformation processing is used for determining an element of a first position in a corresponding matrix of the image prompt information as a product of the element of the first position in the source prompt information and the element of the first position in the image scaling matrix; performing stitching operation on the target image and the image prompt information to determine an image stitching vector; and mapping the image stitching vector to a target embedding space, and determining the target image characterization vector.

2. The method of claim 1, wherein performing a stitching operation on the target image and the image cues to determine an image stitching vector comprises:

dividing the target image into a plurality of image patches;

projecting the plurality of image patches into an image embedding space to obtain a plurality of patch coding vectors;

and performing splicing operation on the patch coding vectors and the image prompt information to determine the image splicing vector.

3. The method of claim 1, wherein said inputting the set of text information into a pre-trained target multimodal matching model results in a set of text token vectors, comprising:

performing matrix conversion processing on the source prompt information and a text scaling matrix to determine the text prompt information, wherein the matrix conversion processing is used for determining an element at a second position in a text prompt information corresponding matrix as a product of the element at the second position in the source prompt information and the element at the second position in the text scaling matrix;

respectively executing splicing operation on the group of text information and the text prompt information to determine a group of text splicing vectors;

And mapping the text splicing vectors to a target embedded space respectively, and determining the text characterization vectors.

4. The method of claim 3, wherein performing a stitching operation on the set of text information and the text prompt information, respectively, determines a set of text stitching vectors, comprising:

performing splicing operation on the group of text information and the text prompt information respectively to determine a group of text splicing vectors, wherein the text information of each execution of the splicing operation is regarded as current text information, and the text splicing vector obtained by each execution of the splicing operation is regarded as current text splicing vector:

executing word segmentation operation on the current text information to obtain a group of word segmentation;

and projecting the group of component words into a text embedding space to obtain the current text splicing vector.

5. The method according to claim 1, wherein the method further comprises:

acquiring a sample image and a group of sample text information, wherein the sample image is pre-marked with a text information corresponding to a target sample, and the group of sample text information comprises the target sample text information;

Inputting the sample image and the group of sample text information into an initial multi-mode matching model to be trained to obtain a sample image characterization vector and a group of sample text characterization vector, wherein the initial multi-mode matching model comprises the text encoder, the image encoder, initial text prompt information and initial image prompt information, and the initial text prompt information and the initial image prompt information are obtained by calculating initial source prompt information;

determining sample similarity between the sample image characterization vector and each sample text characterization vector in the set of sample text characterization vectors by the initial multi-modal matching model;

and calculating a loss parameter based on the sample similarity, and adjusting the initial source prompt information and an initial scaling matrix by using the loss parameter until the initial multi-modal matching model is trained into the target multi-modal matching model.

6. The method of claim 5, wherein said inputting the sample image and the set of sample text information into an initial multimodal matching model to be trained results in a sample image characterization vector and a set of sample text characterization vectors, comprising:

Acquiring the initial source prompt information, an initial image scaling matrix and an initial text scaling matrix, wherein the initial source prompt information, the initial image scaling matrix and the initial text scaling matrix are parameters which allow adjustment in the process of training the initial multi-modal matching model;

performing matrix transformation processing on the initial source prompt information and the initial image scaling matrix to determine the initial image prompt information, wherein the matrix transformation processing is used for determining an element at a third position in the initial image prompt information corresponding matrix as a product of the element at the third position in the initial source prompt information and the element at the third position in the initial image scaling matrix;

performing matrix conversion processing on the initial source prompt information and the initial text scaling matrix to determine the initial text prompt information, wherein the matrix conversion processing is used for determining an element at a fourth position in the initial text prompt information corresponding matrix as a product of the element at the fourth position in the initial source prompt information and the element at the fourth position in the initial text scaling matrix;

Performing a stitching operation on the sample image and the initial image prompt information to determine a sample image stitching vector, and performing a stitching operation on the set of sample text information and the initial text prompt information to determine a set of sample text stitching vectors;

and mapping the sample image stitching vector and the set of sample text stitching vectors to an initial embedding space, and determining the sample image characterization vector and the set of sample text characterization vectors.

7. The method of claim 1, wherein said inputting the target image and the set of text information into a pre-trained target multi-modal matching model results in a target image characterization vector and a set of text characterization vectors, comprising:

setting corresponding source prompt information and a scaling matrix for each layer of image matching submodel under the condition that the target multi-mode matching model comprises K layers of image matching submodels, wherein the output of an ith layer of image matching submodel in the K layers of image matching submodels and the image prompt information corresponding to an (i+1) th layer of image matching submodel are used as the input of the (i+1) th layer of image matching submodel together, K is a positive integer greater than 1, and i is a positive integer smaller than K;

Setting corresponding source prompt information and the scaling matrix for each layer of text matching sub-model under the condition that the target multi-mode matching model comprises a K-layer text matching sub-model, wherein the output of an ith layer of text matching sub-model in the K-layer text matching sub-model and the text prompt information corresponding to an (i+1) th layer of text matching sub-model are used as the input of the (i+1) th layer of text matching sub-model together.

8. The method according to claim 1, wherein determining, by the target multimodal matching model, a similarity between the target image token vector and each text token vector in the set of text token vectors, and determining, as the content category corresponding to the target image token vector, a target content category indicated by the text token vector whose similarity satisfies a preset similarity condition, includes:

acquiring a first cosine distance between the target image characterization vector and a first text characterization vector and a second cosine distance between the target image characterization vector and a second text characterization vector, wherein the set of text characterization vectors comprises the first text characterization vector and the second text characterization vector;

Determining the first text token vector as a target text token vector when the first cosine distance is smaller than or equal to a cosine distance threshold, wherein the content category of the target image is the target content category;

and under the condition that the second cosine distance is larger than the cosine distance threshold, determining that the second text token vector is not the target text token vector, and the content category of the target image is not the target content category.

9. The method according to claim 1, wherein the method further comprises:

determining that the target image is a violation image under the condition that the target content category belongs to a violation content category set, wherein the preset content category set comprises the violation content category set;

and generating violation prompt information in response to determining that the target image is the violation image, wherein the violation prompt information is used for indicating that the target image audit is not passed.

10. The method according to any one of claims 1 to 9, further comprising:

performing a Cronecker inner product operation on the source prompt information and an image scaling matrix to obtain the image prompt information, wherein the scaling matrix comprises the image scaling matrix;

And executing the Cronecker inner product operation on the source prompt information and a text scaling matrix to obtain the text prompt information, wherein the scaling matrix comprises the text scaling matrix.

11. An image processing apparatus, comprising:

the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring a target image and a predetermined set of text information, wherein one text information in the set of text information is used for representing one content category in a preset content category set;

the training module is used for inputting the target image and the group of text information into a pre-trained target multi-mode matching model to obtain a target image characterization vector and a group of text characterization vectors, wherein the target multi-mode matching model comprises a text encoder, an image encoder, text prompt information and image prompt information, the text prompt information and the image prompt information are obtained by calculating source prompt information, the image prompt information is used for being input into the image encoder together with the target image to obtain the target image characterization vector, and the text prompt information is used for being input into the text encoder together with the group of text information to obtain the group of text characterization vectors;

The determining module is used for determining the similarity between the target image characterization vector and each text characterization vector in the set of text characterization vectors through the target multi-mode matching model, and determining the target content category indicated by the text characterization vector with the similarity meeting the preset similarity condition as the content category corresponding to the target image characterization vector;

12. The apparatus of claim 11, wherein the apparatus is configured to determine an image stitching vector by performing a stitching operation on the target image and the image prompt information by:

dividing the target image into a plurality of image patches;

13. The apparatus of claim 11, wherein the apparatus is configured to input the set of textual information into a pre-trained target multimodal matching model to obtain a set of text token vectors by:

14. The apparatus of claim 13, wherein the apparatus is configured to perform a stitching operation on the set of text messages and the text prompt message, respectively, to determine a set of text stitching vectors by:

15. The apparatus of claim 11, wherein the apparatus is further configured to:

16. The apparatus of claim 15, wherein the apparatus is configured to input the sample image and the set of sample text information into an initial multimodal matching model to be trained to obtain a sample image characterization vector and a set of sample text characterization vectors by:

17. The apparatus of claim 11, wherein the apparatus is configured to input the target image and the set of text information into a pre-trained target multi-modal matching model to obtain a target image characterization vector and a set of text characterization vectors by:

18. The apparatus of claim 11, wherein the apparatus is configured to determine a similarity between the target image token vector and each text token vector in the set of text token vectors by the target multimodal matching model, and determine a target content category indicated by the text token vector for which the similarity satisfies a preset similarity condition as the content category corresponding to the target image token vector by:

19. The apparatus of claim 11, wherein the apparatus is further configured to:

20. The apparatus according to any one of claims 11 to 19, further adapted to:

21. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored computer program, wherein the computer program is executable by an electronic device to perform the method of any one of claims 1 to 10.

22. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 10 by means of the computer program.