CN117197569A

CN117197569A - Image auditing method, image auditing model training method, device and equipment

Info

Publication number: CN117197569A
Application number: CN202311169765.3A
Authority: CN
Inventors: 蔡俊贤; 何俊烽; 陈曦; 黄展鹏
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-09-08
Filing date: 2023-09-08
Publication date: 2023-12-08

Abstract

The application relates to an image auditing method, an image auditing model training method, an image auditing device, computer equipment, a storage medium and a computer program product, and relates to computer vision and natural language processing technology. The image auditing method comprises the following steps: acquiring an image to be inspected and an inspection condition set comprising the respective tag matching conditions of a plurality of candidate tags; based on image elements contained in the image to be checked, carrying out semantic analysis on the image to be checked to obtain semantic information of the image to be checked; determining target conditions matched with the semantic information from the tag matching conditions; according to the semantic information, carrying out context coding processing on the text representing the target condition to obtain a matching reason of the image to be checked and the target condition; and determining an auditing result of the image to be audited based on the target label corresponding to the target condition in each candidate label and the matching reason. By adopting the method, the working efficiency of the image auditing and processing process can be improved.

Description

Image auditing method, image auditing model training method, device and equipment

Technical Field

The present application relates to the field of computer technology, and in particular, to an image auditing method, an image auditing model training method, an apparatus, a computer device, a storage medium, and a computer program product.

Background

With the rapid development of computer technology, a content interaction platform for spreading image information is more and more popular and interesting for people. To avoid inclusion of offending content in the propagated image information, it is often necessary to audit prior to the image being brought online.

In the traditional image auditing method, for each image to be audited, auditing personnel and a platform need to interact for many times to finish auditing. In the face of a huge number of images to be inspected, the inspection process depends on long-time human-computer interaction, and more calculation and processing resources and labor cost are required to be consumed. Therefore, there is a problem in that the working efficiency is low in the process.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an image auditing method, an image auditing model training method, an apparatus, a computer device, a storage medium, and a computer program product that can improve the working efficiency.

In a first aspect, the present application provides an image review method. The method comprises the following steps:

acquiring an audit condition set and an image to be audited; the auditing condition set comprises a plurality of candidate labels and respective label matching conditions of each candidate label; the label matching condition refers to a condition which meets the image of the candidate label;

Based on the image elements contained in the image to be checked, carrying out semantic analysis on the image to be checked to obtain semantic information of the image to be checked;

determining target conditions matched with the semantic information from the tag matching conditions;

performing context coding processing on the text representing the target condition according to the semantic information to obtain a matching reason of the image to be checked and the target condition;

and determining an auditing result of the image to be audited based on the target label corresponding to the target condition in each candidate label and the matching reason.

In a second aspect, the application further provides an image auditing model training method. The method comprises the following steps:

acquiring a service condition set and a service image sample carrying a service tag; the service condition set comprises candidate matching conditions of each of a plurality of candidate service tags;

determining a selected condition matched with the service tag from the candidate matching conditions;

determining a matching reason of the business image sample and the selected condition;

determining an input part based on the business image sample and the selected condition, determining an output part based on the matching reason and the business label, and constructing an instruction fine tuning training sample;

Using a training data set comprising an instruction fine tuning data set and a graphic dialogue general data set to perform instruction fine tuning training on the pre-training graphic dialogue model to obtain an image auditing model; the instruction fine adjustment data set comprises instruction fine adjustment training samples corresponding to a plurality of business image samples.

In a third aspect, the application further provides an image auditing device. The device comprises:

the image to be checked is obtained by the checking condition set and the image to be checked; the auditing condition set comprises a plurality of candidate labels and respective label matching conditions of each candidate label; the label matching condition refers to a condition which meets the image of the candidate label;

the semantic analysis module is used for carrying out semantic analysis on the image to be checked based on the image elements contained in the image to be checked to obtain semantic information of the image to be checked;

the target condition determining module is used for determining target conditions matched with the semantic information from the tag matching conditions;

the encoding module is used for carrying out context encoding processing on the text representing the target condition according to the semantic information to obtain the matching reason of the image to be checked and the target condition;

And the auditing result determining module is used for determining the auditing result of the image to be audited based on the target label corresponding to the target condition in the candidate labels and the matching reason.

In a fourth aspect, the application further provides an image auditing model training device. The device comprises:

the image sample acquisition module is used for acquiring a service condition set and a service image sample carrying a service tag; the service condition set comprises candidate matching conditions of each of a plurality of candidate service tags;

the condition matching module is used for determining a selected condition matched with the service tag from the candidate matching conditions;

the matching reason determining module is used for determining the matching reason of the business image sample and the selected condition;

the training sample construction module is used for determining an input part based on the service image sample and the selected condition, determining an output part based on the matching reason and the service label, and constructing an instruction fine-tuning training sample;

the instruction fine tuning training module is used for performing instruction fine tuning training on the pre-training graphic dialogue model by using a training data set comprising an instruction fine tuning data set and a graphic dialogue general data set to obtain an image auditing model; the instruction fine adjustment data set comprises instruction fine adjustment training samples corresponding to a plurality of business image samples.

In a fifth aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the above method when the processor executes the computer program.

In a sixth aspect, the present application also provides a computer readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the above method.

In a seventh aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the above method.

The image auditing method, the image auditing model training method, the device, the computer equipment, the storage medium and the computer program product acquire an image to be audited and an auditing condition set comprising label matching conditions of each of a plurality of candidate labels; based on image elements contained in the image to be checked, carrying out semantic analysis on the image to be checked to obtain semantic information of the image to be checked; determining target conditions matched with the semantic information from the tag matching conditions; according to the semantic information, carrying out context coding processing on the text representing the target condition to obtain a matching reason of the image to be checked and the target condition; based on the target label corresponding to the target condition in each candidate label and the matching reason, the auditing result of the image to be audited is determined, so that the image auditing for various label matching conditions can be automatically completed, and the work efficiency is improved. And the auditing result is determined based on the matching reason of the image to be audited and the target condition, so that the auditing result can contain larger information quantity, on one hand, the image material provider can be convenient for targeted rectification, and on the other hand, the processing efficiency of the subsequent auditing result check can be improved. Therefore, the working efficiency of the image auditing processing process can be improved by adopting the method.

Drawings

FIG. 1 is a diagram of an application environment for an image review method and an image review model training method in one embodiment;

FIG. 2 is a flow diagram of an image review method in one embodiment;

FIG. 3 is a schematic diagram of candidate tags for ecological audit in one embodiment;

FIG. 4 is a schematic diagram of an image to be reviewed in one embodiment;

FIG. 5 is a block diagram of a transducer in one embodiment;

FIG. 6 is a flow chart of an image review method according to another embodiment;

FIG. 7 is a flow chart of an image review model training method in one embodiment;

FIG. 8 is a schematic diagram of an image review model training process in one embodiment;

FIG. 9 is a diagram showing the composition of a training sample for fine tuning of instructions in one embodiment;

FIG. 10 is a schematic diagram of the structure of a pre-trained teletext model in one embodiment;

FIG. 11 is a diagram of a process for fine tuning training of instructions in one embodiment;

FIG. 12 is a graph showing a comparison of performance of an original visual GLM model and an image review model obtained after instruction fine-tuning training in one embodiment;

FIG. 13 is a schematic diagram of a display interface of an image review model in one embodiment;

FIG. 14 is a block diagram of an image review device in one embodiment;

FIG. 15 is a block diagram of an image review model training apparatus in one embodiment;

FIG. 16 is an internal block diagram of a computer device in one embodiment;

fig. 17 is an internal structural view of a computer device in another embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The image auditing method and the image auditing model training method provided by the application can be based on artificial intelligence, for example, the image auditing model in the application can be a neural network model, and the image auditing method in the application can be a process of image auditing of the image to be audited by using the image auditing model. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises research directions of computer vision technology, voice processing technology, natural language processing technology and the like.

Computer Vision (CV) is a science of researching how to make a machine "look at", and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition, following and measurement on a target, and further perform graphic processing, so that the Computer processes the target into an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

The scheme provided by the embodiment of the application relates to an artificial intelligence computer vision and natural language processing technology, and is specifically described by the following embodiments:

the image auditing method and the image auditing model training method provided by the embodiment of the application can be applied to an application environment shown in figure 1. The application scenario may include a terminal 102 and a server 104, where the terminal 102 and the server 104 may communicate via a communication network. The communication network may be a wired network or a wireless network. Accordingly, the terminal 102 and the server 104 may be directly or indirectly connected through wired or wireless communication. For example, the terminal 102 may be indirectly connected to the server 104 through a wireless access point, or the terminal 102 may be directly connected to the server 104 through the internet, although the application is not limited in this respect.

The terminal 102 may be, but not limited to, various desktop computers, notebook computers, mobile phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The embodiment of the application can be applied to a plurality of scenes associated with image auditing, including but not limited to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving, audio and video and the like. The terminal 102 may have installed thereon a client associated with the authoring of image content, which may be software (e.g., a browser, image or video software, etc.), web pages, applets, etc. The server 104 is a background server corresponding to software, web pages, applets, etc., or a server dedicated to performing image review or image review model training, and in some embodiments, the image review or image review model training may also be implemented by the same server, which is not limited in this disclosure. Further, the server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like. The data storage system may store data that the server 104 needs to process. The data storage system may be provided separately, may be integrated on the server 104, or may be located on a cloud or other server.

It should be noted that, the image review method and the image review model training method in the embodiment of the present application may be executed by the terminal 102 or the server 104 separately, or may be executed by the terminal 102 and the server 104 together. Taking the case that the server 104 is separately executed as an example, in the process of training by the server 104 to obtain the image review model, a service condition set including candidate matching conditions of each of a plurality of candidate service tags and a service image sample carrying the service tag may be obtained from the terminal 102. Then, determining a selected condition matched with the service label from the candidate matching conditions; determining the matching reason of the business image sample and the selected condition; taking a service image sample and a service label as input, and taking a selected condition and a matching reason as output, and constructing an instruction fine adjustment training sample; and performing instruction fine tuning training on the pre-training graphic dialogue model by using a training data set comprising the instruction fine tuning data set and the graphic dialogue general data set to obtain an image auditing model. The instruction fine adjustment data set comprises instruction fine adjustment training samples corresponding to the business image samples.

After training to obtain the image review model, the server 104 may obtain the image to be reviewed and a review condition set including the tag matching conditions of each of the plurality of candidate tags, and apply the image review model to perform image review. The set of audit conditions may include candidate matching conditions for the training process, as well as newly added matching conditions that are not trained. In the process of implementing image auditing by the server 104 by applying the image auditing model: based on image elements contained in the image to be checked, carrying out semantic analysis on the image to be checked to obtain semantic information of the image to be checked; determining target conditions matched with the semantic information from the tag matching conditions; according to the semantic information, carrying out context coding processing on the text representing the target condition to obtain a matching reason of the image to be checked and the target condition; and determining an auditing result of the image to be audited based on the target label corresponding to the target condition in each candidate label and the matching reason.

In one embodiment, as shown in fig. 2, there is provided an image auditing method, which may be performed by a computer device, which may be a terminal or a server shown in fig. 1, and in this embodiment, the method is applied to the server in fig. 1, and is described as an example, and includes the following steps:

and step S202, acquiring an audit condition set and an image to be audited.

The image to be checked is an image which needs to be checked, and the purpose of checking the image is to determine the checking result of the image to be checked. The image to be checked can be, for example, a video frame image in a video or a picture. The audit condition set includes a plurality of candidate tags, and a respective tag matching condition for each candidate tag. The candidate tag refers to an image tag as a candidate. In practical applications, multiple levels of image labels may be configured, and one label of a previous level includes multiple labels of a next level. By way of example, the primary labels "ecologically extremely high risk" may include secondary labels of "offending marketing," web storm, "and" social negatives. It can be understood that in this case, the tag matching conditions of the first-level tag "ecologically extremely high risk" are satisfied, including the tag matching conditions respectively corresponding to the second-level tags of "offending marketing", "net storm", and "social negativity". For example, the tag matching condition of the candidate tag "induced discomfort" may include "containing a blood fishy or violent element". In a specific embodiment, the server may construct a label tree according to a hierarchical relationship between image labels, and determine the image labels in the leaf nodes in the label tree as candidate labels, so that more fine-grained label matching can be achieved, and accuracy of an image auditing result is improved.

The tag matching condition refers to a condition that the image conforming to the candidate tag needs to satisfy. For example, the audit condition set may include a candidate tag a and a candidate tag B, and a tag matching condition a that is satisfied by an image of the candidate tag a, and a tag matching condition B that is satisfied by an image of the candidate tag B. In practical application, audit rule combing can be performed from multiple dimensions such as safety, ecology and quality, and audit condition sets of candidate labels comprising the multiple dimensions such as safety, ecology and quality are determined. The ecological audit condition set may cover candidate tags as shown in fig. 3, and different candidate tags may correspond to different risk gears. In a particular embodiment, the tag match condition may include a plurality of sub-match conditions. By way of example, the label matching conditions for a "dazzling fubey" label may include a plurality of sub-matching conditions of "show a large number of banknotes", "show a large number of luxury cars", "show a luxury package, a list of names, gold, jewelry", etc.; the label matching condition of the 'low custom' label can comprise a plurality of sub-matching conditions such as 'highlighting sensitive organs of human body', 'large-area naked' and the like.

Specifically, the audit rules may be combed and summarized prior to image audit, determining an audit condition set including a plurality of candidate tags, and respective tag matching conditions for each candidate tag. Then, the server may acquire the auditing condition set and the image to be audited, and audit the image to be audited based on the auditing condition set. Further, the specific manner of acquiring the audit condition set and the image to be audited by the server may be active acquisition or passive reception, which is not limited herein.

In a specific embodiment, the server may use a trained image review model to review the images to be reviewed. In the case of this embodiment, the set of audit conditions obtained by the server may include a set of business conditions under a specific business scenario used in the image audit model training process, and a newly added set of conditions not used in the image audit model training process. The service condition set can comprise label matching conditions corresponding to service labels such as 'bid identification', 'Fubaijin', 'low custom' and the like; the newly added condition set may include, for example, tag matching conditions corresponding to newly added tags such as "two-dimensional code", "child soft pornography", and the like.

Step S204, carrying out semantic analysis on the image to be checked based on the image elements contained in the image to be checked to obtain semantic information of the image to be checked.

Wherein the image elements may characterize objects contained in the image under examination, such as pedestrians, umbrellas, buildings, etc. The semantic information of the image to be checked refers to the meaning of the image to be checked. The semantic information may be expressed in a language, which may include natural language, symbolic language, mathematical language, and the like. That is, the expression of semantic information may include all the ways the human visual system understands of an image. For example, for a cat image, the image semantics may include the word "cat" or a symbol representing the cat image. In a particular embodiment, the semantic information may be keywords, or sentences that can characterize the semantics of the image under review. For example, as shown in fig. 4, if the image element included in the image to be checked is a banknote, the semantic information of the image to be checked may be "money", "banknote", "a large number of banknotes", or "a banknote with a mountain in the figure", or the like.

In one embodiment, the server may use an image semantic segmentation technique to classify each pixel in the image to be processed and determine the element to which each pixel belongs, thereby segmenting the image to be examined into a plurality of regions, each region containing a class of image elements. Then, the server performs semantic analysis on each image element to determine the object characterized by each image element, and determines the semantic information of the image to be checked by combining each object contained in the image to be checked. For example, for an image to be checked including pedestrians, umbrellas, buildings and vehicles, the semantic information of the image to be checked may include keywords such as "pedestrians", "umbrellas", "buildings", "vehicles", etc., and may also include sentences such as "streets in rain" or "rainy days, pedestrians on streets such as weaving traffic such as shuttles", etc.

In one embodiment, the server may use a trained Image Encoder (Image Encoder) to perform semantic analysis on the Image to be audited, and determine semantic information commonly represented by each Image element in the Image to be audited. The specific network structure of the image encoder can be, for example, a convolutional neural network (Convolutional Neural Networks, CNN) or a cyclic neural network (Recurrent Neural Network, RNN) or the like.

In one embodiment, the image to be reviewed carries text information. In the case of this embodiment, the server may perform image semantic analysis on the image to be checked based on the image elements included in the image to be checked, to obtain the image semantic of the image to be checked; based on text information carried by the image to be checked, carrying out text semantic analysis on the image to be checked to obtain text semantics of the image to be checked; and fusing the image semantics and the text semantics, and determining the semantic information of the image to be checked.

Step S206, determining target conditions matched with the semantic information from the tag matching conditions.

Specifically, the server may determine respective condition semantics of each tag matching condition, calculate semantic similarity between each condition semantics and semantic information of the image to be checked, and determine a tag matching condition with the highest semantic similarity as a target condition matched with the semantic information. The semantic similarity may be characterized, for example, by cosine similarity or euclidean distance, etc.

In a specific embodiment, step S206 includes: acquiring respective characteristic words of the tag matching conditions; respectively carrying out similarity comparison on the semantic information and each characteristic word, and determining a target word with the highest similarity with the semantic information; and determining the tag matching condition of the target vocabulary as a target condition matched with the semantic information.

The feature vocabulary of the tag matching condition is a vocabulary which can represent the semantics of the tag matching condition in the tag matching condition. In one possible implementation, the feature vocabulary may include a noun in a tag matching condition, and a connective of the noun. The adjective may be, for example, an adjective, and the like. For example, the feature word in the tag matching condition "show a large number of luxury vehicles" may be "a large number of luxury vehicles" including the noun "vehicle", the adjective "luxury", and the adjective "large number".

Specifically, the server may perform feature vocabulary extraction processing on each tag matching condition, to obtain respective feature vocabularies of each tag matching condition. And then, respectively carrying out similarity comparison on the semantic information and each characteristic word, determining a target word with the highest similarity with the semantic information from each characteristic word, and determining the tag matching condition of the target word as a target condition matched with the semantic information.

In a specific application, the semantic information may include a plurality of keywords, and the server may compare each keyword with each feature word in similarity, determine a target word that each keyword matches from each feature word, and then determine a tag matching condition that includes the most target word as a target condition that matches the semantic information.

In the embodiment, the target condition is determined through feature vocabulary matching, so that the image auditing algorithm can be further simplified on the basis of ensuring the semantic matching of the target condition and the image to be audited, and the working efficiency in the image auditing process is improved.

And step S208, carrying out context coding processing on the text representing the target condition according to the semantic information to obtain the matching reason of the image to be checked and the target condition.

Context coding is an important way to extend text, among other things. Specifically, the server may determine the text representing the target condition as a text to be processed, and perform context coding around semantic information of the image to be checked on the basis of the text to be processed, so as to obtain a matching reason of the image to be checked and the target condition. The server may search the vocabulary most related to the text to be processed from the word stock by using an unsupervised learning technology, splice the vocabulary with the text to be processed, and take the spliced result as a new text to be processed until the spliced result can form a complete sentence, or the number of words contained in the text obtained by splicing reaches a set number of words, or the spliced result can completely represent the matching reason of the image to be audited and the target condition. By way of example, the reason for the matching of fig. 4 to the label matching condition of "dazzle fubey" may be that "displaying a large number of real banknotes would normally be considered to have a tendency to dazzle, as people would normally tie money to money and dazzle to what normally means displaying their own financial effort".

Step S210, determining an auditing result of the image to be audited based on the target label corresponding to the target condition in each candidate label and the matching reason.

Specifically, the server may determine, as an audit result of the image to be audited, a target tag corresponding to the target condition and a matching reason in each candidate tag; and the auditing result of the image to be audited can be determined by combining other information on the basis of the target label corresponding to the target condition in each candidate label and the matching reason. This other information may include, for example, a risk gear of the target tag, an associated tag of the target tag, and so on. The associated label of the target label can comprise at least one of a first-level label or a next-level label on the target label.

In the image auditing method, an image to be audited and an auditing condition set comprising label matching conditions of a plurality of candidate labels are obtained; based on image elements contained in the image to be checked, carrying out semantic analysis on the image to be checked to obtain semantic information of the image to be checked; determining target conditions matched with the semantic information from the tag matching conditions; according to the semantic information, carrying out context coding processing on the text representing the target condition to obtain a matching reason of the image to be checked and the target condition; based on the target label corresponding to the target condition in each candidate label and the matching reason, the auditing result of the image to be audited is determined, so that the image auditing for various label matching conditions can be automatically completed, and the work efficiency is improved. And the auditing result is determined based on the matching reason of the image to be audited and the target condition, so that the auditing result can contain larger information quantity, on one hand, the image material provider can be convenient for targeted rectification, and on the other hand, the processing efficiency of the subsequent auditing result check can be improved. Therefore, the working efficiency of the image auditing processing process can be improved by adopting the method.

In one embodiment, the image to be reviewed carries text information. In the case of this embodiment, step S204 includes: based on the image elements contained in the image to be checked, carrying out image semantic analysis on the image to be checked to obtain the image semantic of the image to be checked; based on text information carried by the image to be checked, carrying out text semantic analysis on the image to be checked to obtain text semantics of the image to be checked; and fusing the image semantics and the text semantics, and determining the semantic information of the image to be checked.

The text information carried in the image to be checked can comprise text information such as an image title, an image description and the like which are independent of the image to be checked, and also can comprise text information corresponding to the Chinese elements in the image to be checked. In a specific embodiment, the text information carried in the to-be-checked image may further include prompt information in the dialogue scene, for example, "is this picture related to or hits to be dazzled? ". Further, the meaning of image semantics, namely the meaning of image content, refers to the meaning characterized by the image elements in the image to be checked; the text semantic meaning, namely meaning of text content, is the semantic characterized by the text information carried by the image to be checked.

Specifically, on one hand, the server may perform image semantic analysis on the image to be checked based on the image elements included in the image to be checked, so as to obtain the image semantic of the image to be checked. The server may perform segmentation on the image elements of the image to be checked by using an image semantic segmentation technique, determine the image elements included in the image to be checked, and further perform image semantic analysis on each image element to obtain image semantics of the image to be checked. The server can also use the trained image encoder to carry out semantic analysis on the image to be checked, and determine semantic information commonly represented by each image element in the image to be checked. On the other hand, the server can perform text semantic analysis on the image to be checked based on text information carried by the image to be checked to obtain text semantics of the image to be checked. The server can perform vocabulary-level semantic analysis or sentence-level semantic analysis on text information carried by the image to be checked based on natural language processing technology, and determine text semantics of the image to be checked; and text semantic analysis can be performed on the image to be checked by using the trained semantic analysis model, so that the text semantic of the image to be checked is obtained.

After the image semantics and the text semantics are obtained, the server can perform fusion processing on the image semantics and the text semantics to determine the semantic information of the image to be audited. The specific algorithm for fusing the image semantics and the text semantics can comprise at least one of splicing, overlaying and other algorithms. In one embodiment, the image semantics of the image to be reviewed include the respective element semantics of the image elements. The server can analyze the relativity of the semanteme of each element and the semanteme of the text, determine the related elements related to the semanteme of the text in each image element, and determine the semantic information of the image to be checked by combining the number of the related elements in the image to be checked and the semanteme of each related element. For example, in the case where the text semantic is "fujingzhujin", the image element "banknote" included in the image to be checked is an associated element of the text semantic. According to the number of the associated elements in the image to be checked, the semantic information of the image to be checked can be determined to comprise a large number of notes, a small number of notes or the like.

In the embodiment, the semantic extraction is performed on the information of the image mode and the text mode in the image to be checked, so that the multi-mode semantic information extraction of the image to be checked can be realized, and the accuracy of the semantic information is improved.

In one embodiment, performing image semantic analysis on the image to be inspected based on the image elements contained in the image to be inspected to obtain image semantics of the image to be inspected, including: splitting an image to be checked into a plurality of image blocks; based on the image elements contained in each image block, respectively carrying out image semantic analysis on each image block to obtain the respective image block semantics of each image block; image semantics including the semantics of each image block are determined.

Specifically, a network for image segmentation of the ultraviolet image or a traditional image segmentation mode can be adopted to segment the image to be audited, so as to obtain a plurality of image blocks. Furthermore, before the image segmentation is carried out on the image to be audited, background removal processing can be carried out on the image to be audited, an effective area in the image to be audited is extracted, and then image segmentation processing is further carried out on the effective area. Specific algorithms for performing background removal processing on the image to be audited can be an Otsu algorithm, an OpenCV algorithm and the like. Illustratively, the blank area in fig. 4 is a background, and the area outside the background is an effective area. After obtaining the plurality of image blocks, the server may perform image semantic analysis on each image block based on the image elements included in each image block, to obtain respective image block semantics of each image block, so as to determine the image semantics including the image block semantics. In practical application, the server can combine adjacent image blocks of the image block to perform image semantic analysis on the image block, so that the accuracy of the image block semantic can be ensured under the condition that the image elements in the image block are incomplete.

In the embodiment, the image to be checked is segmented into a plurality of image blocks, and image semantic analysis is performed on each image block respectively, so that finer granularity image block semantics can be extracted, and the accuracy of the image semantics is improved.

In one embodiment, fusing image semantics and text semantics, determining semantic information for an image to be reviewed includes: performing correlation analysis on the text semantics and each image block semantics respectively, and determining associated image blocks of the text semantics from the image blocks; the semantic information of the image to be checked is determined based on the number of the associated image blocks and the image block semantics of the associated image blocks.

Specifically, the server can respectively perform correlation analysis on the text semantics and each image block semantics by calculating the respective semantic similarity of the text semantics and each image block semantics, and determine the image blocks with the semantic similarity meeting the correlation condition as the associated image blocks of the text semantics. The semantic similarity satisfies the correlation condition, which may mean that the semantic similarity is greater than a similarity threshold, or that the semantic similarity is greater than or equal to the similarity threshold. Then, the server determines semantic information of the image to be checked based on the number of the associated image blocks and the image block semantics of the associated image blocks. For example, in the case where the text semantic is "dazzle fubayesian", the more associated image blocks of the "dazzle fubayesian" contained in each image block in the image to be checked, the higher the degree of dazzle fubayesian, and the higher the degree of poor guidance. Assuming that the image block semantics of each associated image block include "banknote", it is possible to determine that the semantic information of the image to be checked includes "a large number of banknotes" or "a small number of banknotes" or the like, depending on the number of associated image blocks in the image to be checked.

It can be understood that, since the text information carried by the image is generally summary of the subject of the image, the semantic information of the image to be checked is determined according to the number of the associated image blocks of the text semantics in each image block and the image block semantics of the associated image blocks, so that the influence of the non-associated image blocks can be eliminated, the semantic information with higher relativity with the text semantics can be extracted from the image to be checked, and the accuracy of the semantic information can be improved.

In one embodiment, step S208 includes: determining a text to be processed similar to the semanteme represented by the semantic information from the text representing the target condition; and carrying out context coding processing on the text to be processed to obtain the matching reason of the image to be checked and the target condition.

Specifically, the server may determine, from the text characterizing the target condition, a text to be processed that is similar to the semantic characterized by the semantic information, according to the condition semantic characterized by the target condition. And then, carrying out context coding processing on the text to be processed to obtain the matching reason of the image to be checked and the target condition.

In one specific application, the server may determine, as the text to be processed, a vocabulary or sentence in the target condition that has similar characterized semantics to the semantic information of the image to be examined. For example, in the case where the target condition includes "show a large number of banknotes", "show a large number of haunches", "show a luxury package, a name list, gold, jewelry", the text to be processed similar to the semantic represented by the semantic information of fig. 4 may include "banknote". The "banknote" is subjected to a context encoding process, and the matching reason obtained may be "show banknote will normally be considered to have a tendency to become dazzling".

In one particular application, where the target condition includes a plurality of sub-match conditions, the condition semantics characterized by the target condition may include the respective sub-condition semantics of each sub-match condition. Therefore, the server can determine the text to be processed similar to the semantic represented by the semantic information from the texts representing each sub-matching condition according to each sub-condition semantic. By way of example, in the case where the target condition includes "show a large number of banknotes", "show a large number of luxury cars", "show luxury packages, celebrities, gold, jewelry", the text to be processed similar to the semantic meaning represented by the semantic information of fig. 4 may be "show a large number of banknotes". The context-encoding process for "show a large number of banknotes" may be matched for the reason that "show a large number of banknotes would normally be considered to have a tendency to glare, as people would normally tie money to money and glare would normally mean show their own financial effort".

In the above embodiment, the text to be processed similar to the semantic represented by the semantic information is determined from the text representing the target condition, and then the text to be processed is context-coded to obtain the matching reason of the image to be checked and the target condition, so that the influence of the content irrelevant to the semantic information in the target condition can be eliminated, and the accuracy of the matching reason is improved.

In one embodiment, performing context encoding processing on the text to be processed to obtain a matching reason between the image to be checked and the target condition, including: word segmentation conversion processing is carried out on the text to be processed, so that a plurality of word characteristics are obtained; according to the positions of the words characterized by each word characteristic in the text to be processed, sequentially carrying out mask self-coding processing on the text to be processed, and determining the contextual characteristics of the text to be processed; splicing the text to be processed and the context information obtained by decoding the context features, and determining the spliced text as the matching reason of the image to be checked and the target condition.

Wherein the contextual characteristics include at least one of a contextual characteristic of a first word and a contextual characteristic of a last word in the text to be processed. That is, at least one of the above information and the below information of the text to be processed can be obtained by encoding, and the text expansion based on the text to be processed can be realized, thereby obtaining the matching reason.

Specifically, the server may perform word segmentation on the text to be processed by using a dictionary-based, understanding-based or statistics-based word segmentation method to obtain a plurality of words, and perform feature extraction on each word to obtain respective word features of each word. And then, according to the positions of the words respectively represented by each word characteristic in the text to be processed, sequentially carrying out mask self-coding processing on the text to be processed, determining the context characteristics of the text to be processed, and carrying out decoding processing on the context characteristics to obtain context information. And finally, the server splices the text to be processed and the context information, and determines the spliced text as the matching reason of the image to be checked and the target condition. The dictionary-based word segmentation method needs to compile a dictionary in advance, match the text to be processed with the vocabulary entries in the dictionary, successfully match the vocabulary entries in the fields if the text to be processed is scanned, segment a vocabulary, and so on until the segmentation cannot be continued, and a plurality of vocabularies contained in the text to be processed are obtained. The word segmentation method based on understanding simulates understanding of a person to a text by utilizing an artificial intelligence technology and combining grammar, semantics and psychological knowledge, and segments the text to be processed into a plurality of words. The word segmentation method based on statistics is to determine continuous words with larger probability of adjacent occurrence as a vocabulary by calculating the probability of adjacent occurrence between words.

In the above embodiment, the context encoding processing is performed by combining the word segmentation and the mask self-encoding algorithm, which corresponds to the expansion of the matching reason from the viewpoint of the lexicon, and the accuracy of the generated matching reason can be ensured.

The specific manner of performing the mask self-encoding process is not unique. In a particular embodiment, the contextual characteristics include contextual characteristics of a first word in the text to be processed. The server can determine the upper position of the vocabulary characterized by each word characteristic in the text to be processed as a mask position; according to the mask positions corresponding to the features of each word, sequentially performing mask self-coding processing on the text to be processed to obtain the respective above features of the features of each word; decoding the above characteristics of the first vocabulary in the text to be processed to obtain the above information of the text to be processed, splicing the above information and the text to be processed, and determining the spliced text as the matching reason of the image to be checked and the target condition. Wherein the expected context information characterized by the respective context feature of each word feature except the first word feature and the context information of the word feature characterized vocabulary in the text to be processed meet the text similarity condition.

In one embodiment, the contextual characteristics include the contextual characteristics of the last word in the text to be processed. In the case of this embodiment, mask self-encoding processing is sequentially performed on the text to be processed according to the position of the vocabulary characterized by each word feature in the text to be processed, and the determining the contextual feature of the text to be processed includes: corresponding to each word feature, determining the lower text position of the vocabulary represented by the word feature in the text to be processed as a mask position corresponding to the word feature; according to the mask positions corresponding to the features of each word, sequentially performing mask self-coding processing on the text to be processed to obtain the respective following features of the features of each word; from the respective following features, the following feature of the last word in the text to be processed is determined.

Wherein the expected context information characterized by the respective context features of each word feature except the last word feature and the context information of the vocabulary characterized by the word feature in the text to be processed meet the text similarity condition. Specifically, the server may determine, for each word feature, a position of a word represented by the word feature in a text to be processed, as a mask position corresponding to the word feature. That is, for each word, only the word preceding the word is visible during the masking self-encoding process. And then, the server sequentially carries out mask self-coding processing on the text to be processed according to the mask positions corresponding to each word feature to obtain the respective following features of each word feature, and determines the following features of the last word in the text to be processed. In the mask self-encoding process, expected context information characterized by the respective context features of each word feature except the last word feature meets the text similarity condition with the context information of the word feature characterized vocabulary in the text to be processed. The code learning is carried out in the mask code self-coding process corresponding to each word feature except the last word feature, so that the prediction capability of the following features can be improved through multiple codes, and the accuracy of the finally determined matching reason is further improved.

In a specific embodiment, as shown in fig. 5, a transform structure is used to perform context encoding processing on the text to be processed, so as to obtain a matching reason between the image to be checked and the target condition. Specifically, for an input text to be processed, each word in the text is first converted into a word vector x= (X) ₁ ，x ₂ ，...，x _n ) Word vectors are then encoded by a masked multi-headed self-attention mechanism to let each word vector obtain context information, the masked multi-headed self-attention mechanism being calculated as follows:

Q＝W ^q X，K＝W ^k X，V＝W ^v X

in the above formula, W ^q 、W ^k And W is ^v Is three trainable parameter matrices, namely a query matrix, a key matrix and a value matrix. d, d _k Is the latitude of the Q vector and the K vector, M is the mask position of the tokenIs used for the mask matrix of (a). In this embodiment, the mask position of a word is the word following the word. It will be appreciated that in other embodiments, the mask position may be the word preceding the word, or may include both the word preceding the word and the word following the word. As shown in fig. 5, the output of the multi-headed self-attention mechanism is added to the input vector and input to the normalization layer. The normalized vector will enter the forward layer, which may consist of a linear layer and an active layer. Finally, we get the encoded vector after addition and normalization again. The calculation process can be repeated for a plurality of times, the expression capacity of the transducer can be enhanced by the plurality of times of encoding, and the accuracy of the finally output predicted text, namely the context information obtained by decoding the context characteristics, is improved.

Further, in this embodiment, the code of the last word in the text to be processed may be used to represent h _n Predicting words to be generated in the following calculation modes: p (P) _n ＝softmax(W ^h h _n )。P _n Is a predicted probability distribution of a plurality of candidate words, W ^h Is h _n Attention matrix of (1) including W ^q 、W ^k And W is ^v . In practical applications, the predicted vocabulary with the highest probability can be selected and spliced to the back of the input text, and the whole sentence is decoded circularly.

In the above embodiment, the expected context information represented by the respective context feature of each word feature except the last word feature and the context information of the word feature represented vocabulary in the text to be processed satisfy the text similarity condition, which is equivalent to that in the mask self-encoding process corresponding to each word feature except the last word feature, code learning is performed, so that the prediction capability of the context feature can be improved through multiple times of encoding, and further, the accuracy of the finally determined matching reason is improved.

In one embodiment, as shown in fig. 6, there is provided an image auditing method, which may be performed by a computer device, which may be a terminal or a server as shown in fig. 1, taking the computer device as an example a server, in this embodiment, the method includes the steps of:

Step S601, acquiring an audit condition set and an image to be audited;

the auditing condition set comprises a plurality of candidate labels and respective label matching conditions of each candidate label; the label matching condition refers to a condition which is met by the image conforming to the candidate label; the image to be checked carries text information;

step S602, segmenting an image to be checked into a plurality of image blocks;

step S603, based on the image elements contained in each image block, respectively performing image semantic analysis on each image block to obtain the respective image block semantics of each image block;

step S604, determining the image semantics including the semantics of each image block;

step S605, text semantic analysis is carried out on the image to be checked based on text information carried by the image to be checked, so as to obtain text semantics of the image to be checked;

step S606, the text semantics are respectively subjected to correlation analysis with each image block semantics, and the associated image block of the text semantics is determined from the image blocks;

step S607, determining semantic information of the image to be checked based on the number of the associated image blocks and the image block semantics of the associated image blocks;

step S608, obtaining respective characteristic words of the tag matching conditions;

Step S609, respectively carrying out similarity comparison on the semantic information and each characteristic word, and determining a target word with the highest similarity with the semantic information;

step S610, determining the tag matching condition of the target vocabulary as a target condition matched with the semantic information;

step S611, determining a text to be processed similar to the semantic represented by the semantic information from the text representing the target condition;

step S612, word segmentation conversion processing is carried out on the text to be processed, so as to obtain a plurality of word characteristics;

step S613, corresponding to each word feature, determining the position of the word represented by the word feature in the text to be processed as the mask position corresponding to the word feature;

step S614, according to the mask positions corresponding to each word feature, sequentially performing mask self-coding processing on the text to be processed to obtain the respective following feature of each word feature;

the expected context information characterized by the respective context features of each word feature except the last word feature and the context information of the word feature characterized vocabulary in the text to be processed meet the text similarity condition;

and step S615, splicing the text to be processed and the context information obtained by decoding the context characteristics of the last vocabulary in the text to be processed, and determining the spliced text as the matching reason of the image to be checked and the target condition.

The image auditing method can automatically complete image auditing for various label matching conditions, and is beneficial to improving the working efficiency. And the auditing result is determined based on the matching reason of the image to be audited and the target condition, so that the auditing result can contain larger information quantity, on one hand, the image material provider can be convenient for targeted rectification, and on the other hand, the processing efficiency of the subsequent auditing result check can be improved. Therefore, the working efficiency of the image auditing processing process can be improved by adopting the method.

The image auditing method can be realized through an image auditing model, specifically, the server can input the acquired image to be audited into the image auditing model, and the image auditing model audits the image to be audited based on the method to obtain auditing results of the image to be audited. The training process of the image review model is described below.

In some embodiments, as shown in fig. 7, the present application further provides an image auditing model training method, which may be executed by a computer device, and the computer device may be a terminal or a server shown in fig. 1, where in this embodiment, the method is applied to the server in fig. 1, and is described by taking as an example, the method includes the following steps:

Step S702, a service condition set and a service image sample carrying a service tag are obtained.

The service condition set comprises candidate matching conditions of each of a plurality of candidate service labels, and the service label refers to one of the candidate service labels. The candidate service labels may refer to labels in a specific service scenario, and the service image samples refer to service images serving as samples in the service scenario. The candidate matching condition of the candidate service label refers to a condition which is satisfied by the image conforming to the candidate service label.

Specifically, the server may obtain a set of business conditions, and a business image sample carrying a business label. For example, as shown in fig. 8, the set of business conditions may include an ecological audit rule for an image, and the business image sample is a picture labeled with an ecological audit label. The ecological auditing rule may include a "dazzling rule" corresponding to a candidate business label "dazzling fubey gold": displaying a plurality of banknotes; displaying a large number of luxury vehicles; exhibiting luxury goods bags, famous watches, gold, jewelry.

Step S704, determining a selected condition matching the service tag from the candidate matching conditions.

Specifically, the server may determine, from the candidate matching conditions, a selected condition matching the service tag according to a mapping relationship between the candidate service tag and the candidate matching condition in the service condition set. For example, in the case where the service condition set includes "dazzle fubayesian", "illegal operation" and "network storm" to wait for selecting a service tag, and the service tag carried in the service image sample is "dazzle fubayesian", the server may determine the candidate matching condition of "dazzle fubayesian" as the selected condition matching with the service tag in the service image sample.

Step S706, determining the matching reason of the business image sample and the selected condition.

Specifically, the server can perform image semantic analysis on the service image sample, determine semantic information of the service image sample, perform context coding processing on the selected condition around the semantic information, and determine a matching reason of the service image sample and the selected condition; the method can also guide a model developer to sort the matching reason of the business image sample and the selected condition on the basis of the prompt information by providing the prompt information comprising the selected condition.

Taking the case that the service image sample is fig. 4 as an example, the prompt information may be:

"now you are an ecological auditor, dazzle the rules:

a) Displaying a plurality of banknotes;

b) Displaying a large number of luxury vehicles;

c) Exhibiting luxury packages, famous watches, gold, jewelry, and the like;

is this picture related to or hits dazzling the rich? ".

The matching reasons may be:

"YES (YES), this picture hits the following rule of dazzling:

a) A large number of banknotes are presented. Displaying a large number of real banknotes is often considered to be rich in nature, as people often tie money to money and are rich in financial effort, which often means displaying themselves. ".

In step S708, the instruction fine-tuning training sample is constructed by taking the business image sample and the selected condition as inputs and the matching reason and the business label as outputs.

The instruction fine tuning is a process of collecting instruction fine tuning training samples according to the requirements of a downstream task and further training on a pre-trained basic model by using an instruction fine tuning data set comprising a plurality of instruction fine tuning training samples. The instruction fine tuning training sample comprises an input part and an output part, wherein the input part is an instruction issued by a person to a machine or asks a problem of the machine, the instruction fine tuning training sample can also contain background knowledge, and the output part is a reply of the machine to the instruction or the problem.

Specifically, the server may determine the input portion based on the business image samples and the selected conditions, determine the output portion based on the matching reason and the business labels, and construct the instruction fine tuning training samples. In a specific embodiment, the input part may include a service image sample and a prompt message, where the prompt message includes a service tag carried by the service image sample and a selected condition matched with the service tag. The server can determine prompt information of the instruction fine-tuning training sample according to the service label and the selected condition. For example, as shown in fig. 9, the hint information of the instruction fine tuning training samples corresponding to fig. 4 may be a question determined based on the business label and the selected condition. In a specific embodiment, as shown in fig. 9, the output portion may include a matching reason and a traffic label.

Further, in the case that the business image sample is a negative sample of the business label, the server may further determine a reason why the business image sample does not match the selected condition, and determine the output portion based on the reason for the mismatch and the business label, to construct the instruction fine-tuning training sample. For example, the reason for the mismatch of the negative sample of "dazzling fubaying" may be that "the man in this picture wears glasses and masks, wears blue shirts and jeans, wears a backpack, looks like a leisure, and does not involve dazzling fubaying.

Step S710, using the training data set comprising the instruction fine tuning data set and the graphic dialog general data set, performing instruction fine tuning training on the pre-training graphic dialog model to obtain an image auditing model.

The instruction fine adjustment data set comprises instruction fine adjustment training samples corresponding to the business image samples. The teletext general data sets may comprise a dialog data set, a view talk data set, etc. The teletext general data set may for example comprise a lliva data set, a Lion data set, a teletext pair data and a COCO (Common Objects in Context, common object in context) data set, etc.

The pre-training image-text dialogue model refers to a model with a certain image-text dialogue function after pre-training. The pre-training image dialogue model is not unique in structure, and may be, for example, a visual glm-based language model (Large Language Models, LLMs). Visual GLM-6B is an open source, supports a multimodal conversational language model of image, chinese and English, and has 62 hundred million parameters based on ChatGLM-6B; the image part builds a bridge of the visual model and the language model by training the BLIP2-Qformer, and the whole model has 78 hundred million parameters. As shown in fig. 10, BLIP-2 (Bootstrapping Language-Image Pre-training) arbitrarily combines and leverages the two Pre-trained visual encoders and LLM by introducing a new visual language Pre-training paradigm without Pre-training the entire architecture end-to-end. This allows us to achieve the most advanced results over multiple visual language tasks while significantly reducing the number of training parameters and pre-training costs. As shown in fig. 10, in the case where the input includes a sunset picture and a prompt text "write romantic information to this picture", the model output may be "love like sunset, it is difficult to see its arrival, but when it comes, it is so beautiful.

Specifically, the server may mix the instruction fine-tuning dataset and the teletext general dataset according to a certain data ratio, resulting in a training dataset. The data ratio can be flexibly set according to specific business scenes, and can also be determined according to expert experience. And then, using the training data set to conduct instruction fine tuning training on the pre-trained graphic dialogue model to obtain an image auditing model. Therefore, the obtained image auditing model has the image auditing function while having the image-text dialogue function. For example, in the instruction fine tuning training process, the output information predicted by the model and the output part of the instruction fine tuning training sample may be subjected to loss calculation, and the instruction fine tuning training is completed under the condition that the loss function converges. The loss function may be, for example, a cross entropy loss function.

The specific method of performing the quality fine tuning training is not exclusive and may include, for example, finetune, loRA, QLoRA, P-tuning. The LoRA method mainly injects a trainable module into a model, the model comprises a plurality of dense layers for matrix multiplication after the pre-training is converged, the layers are usually full-rank, the actual change amount in the fine tuning process is smaller, the low-rank change is expressed in the matrix multiplication, the purpose of injecting the trainable layer is that the low-rank change of the fine tuning is learned by the trainable layer, other parts of the model are frozen, and the training parameters of the model are greatly reduced. LoRA defaults of visual GLM LoRA fine tuning with rank=10 is added at layers 0 and 14 of the ChatGLM model, and layer_range and lora_rank parameters can be adjusted according to specific situations and data volume. The principle of QLoRA and LoRA is basically consistent, 4-bit quantization is performed on the linear layer of ChatGLM, and fine adjustment can be performed only by 9.8GB video memory.

P-tuning v2 is simply an improvement of soft sample, which only acts in the embellishment layer, the interactive ability is weakened when the actual test is carried out only in the embellishment layer, and all parameters of the frozen model are used for learning to insert the token, so that the effect is sometimes unstable and worse than fine tuning due to small change. p-tuning v2 is not directed to the ebedding layer only, but rather inserts a continuous token into each layer, increasing the amount of change and interactivity. The P-tuning v2 soft sample comparison depends on the model parameter quantity, and on the model with parameter quantity exceeding 10B, the effect catches up with the fine-tune, but the P-tuning v2 is more suitable for a model with smaller quantity because each layer is inserted with a token, and the change quantity of model training is increased.

According to the image auditing model training method, the instruction fine adjustment training sample is constructed based on the service condition set and the service image sample, the high-quality instruction fine adjustment training sample can be automatically constructed, then the training data set comprising a plurality of instruction fine adjustment training samples and the image dialogue general data set is used for carrying out instruction fine adjustment training on the pre-training image-text dialogue model to obtain the image auditing model, so that the finally obtained image auditing model can adapt to different auditing rules, the development time of the image auditing model in a specific service scene is greatly shortened, and the development and service cost is reduced. And the finally obtained image auditing model can output the matching reason of the image and the condition, so that the image material provider can be conveniently and purposefully rectified, and the processing efficiency of the subsequent auditing result check can be improved. Therefore, the working efficiency of the image auditing processing process can be improved by adopting the method.

In one embodiment, step S710 includes: determining a training dataset comprising an instruction trim dataset and a teletext general dataset; determining fine tuning parameters matched with the training data set according to the sample number of the training data set; and based on the training data set and the fine tuning parameters, performing instruction fine tuning training on the pre-training graphic dialogue model to obtain an image auditing model.

Among other things, the trim parameters may include Rank size (Rank), learning rate, batch size (batch size), and so forth. In particular, the server may determine a training data set comprising an instruction trim data set and a teletext general data set, and determine trim parameters matching the training data set based on the number of samples of the training data set. For example, the rank size may have a positive correlation with the number of samples. The learning rate can be properly regulated down, so that the model does not lose the knowledge of the original large model while learning the knowledge of the service scene, and the optional range of the learning rate is as follows: 5 e-5 e-4. Within the optional range, the learning rate may be in positive correlation with the number of samples. The batch size and the sample number can also be in positive correlation, and the larger the batch size is, the smaller the occupation of the machine video memory is, so that the machine can process a plurality of pictures at the same time, and the efficiency is improved. In a specific embodiment, the trimming parameters may include super parameters. As shown in fig. 11, the server may search for an optimal configuration of parameters in the hyper-parametric grid, and perform optimization according to the number of samples of the training dataset, to determine the fine tuning parameters. After the fine tuning parameters are determined, the server quantizes the model parameters to reduce the occupation of the video memory in the application process of the model, and then carries out instruction fine tuning training on the pre-training image-text dialogue model based on the training data set and the quantized model parameters to obtain an image auditing model. In one possible implementation, the LoRA fine tuning training parameters determined from the data size of the training sample set may include: loRA fine tuning with rank=12 is added at layers 0, 5, 10 and 14, and learning rate is 5 e-5, and Batchsize=20.

In the above embodiment, the fine tuning parameters are determined according to the data scale of the training sample set, so that the quality of the image auditing model obtained by training can be improved.

In one embodiment, the teletext model comprises a pre-trained image encoder, a query transformer and a pre-trained language model, connected in sequence. In the case of this embodiment, based on the training data set and the tuning parameters, performing instruction tuning training on the pre-trained teletext model to obtain an image review model, including: freezing the pre-training image encoder and the pre-training language model, and performing instruction fine-tuning training on the query converter by adopting fine-tuning parameters to obtain an updated converter; an image review model is determined that includes a pre-trained image encoder, an update transformer, and a pre-trained language model.

Specifically, as shown in fig. 10, the teletext model may be a visual glm-based language model, including a pre-trained image encoder and a language model, and a query transformer connecting the image encoder and the language model. The modal gap between visual and language models is bridged by adding a lightweight query transformer (Query Transformer, Q-Former) between the pre-trained image encoder and the pre-trained language model. Therefore, in the process of performing instruction fine tuning training, the server can freeze the pre-training image encoder and the pre-training language model, and perform instruction fine tuning training on the query transformer by adopting fine tuning parameters to obtain an updated transformer, so as to determine an image auditing model comprising the pre-training image encoder, the updated transformer and the pre-training language model.

In this embodiment, the pre-trained image encoder and the language model are frozen, and only the Q-Former connected to the pre-trained image encoder and the pre-trained language model is subjected to the instruction fine tuning training, so that the training task amount can be significantly reduced, and the efficiency is improved.

After the training of the picture auditing large model is completed, the model can be evaluated in order to verify the validity and generalization of the model. As shown in FIG. 8, the specific ways of performing model evaluation may include supervised task evaluation, zeroshot task (task not involved in training, i.e., unsupervised task) evaluation, and general comprehension capability evaluation. In the reasoning stage, the model can input a picture and a corresponding question, and then the answer, namely the output, of the model to the question can be obtained through processing the input. By means of keyword matching and text similarity comparison and the like of answers of the model and structured label data generated by pre-labeling, the accuracy and recall rate of the model can be calculated. The accuracy rate represents the proportion of the number of samples with correct model prediction to the total number of samples, and the recall rate represents the proportion of samples with actual positive cases predicted by the model as positive cases. In order to evaluate generalization of the model, as shown in fig. 12, the application also evaluates some business data which do not participate in training (i.e. zeroshot scene), and compares the effect of the application with that of native visual glm. As can be seen from fig. 12:

(1) The image auditing model can obtain good performance indexes under the supervised service scene, and some types (for example, under the service scene corresponding to the label C) can reach more than 95% of accuracy; in the business which is not seen by some models, for example, the business scene corresponding to the label F, the large picture auditing model can achieve a certain effect, which proves that the generalization performance of the model is very good, and the deviation or the catastrophic forgetting in the fine tuning process is avoided. The label C may be "dazzling fubaijin", and the label F may be "picture two-dimensional code".

(2) Compared with the original visual GLM, the image auditing model obtained by training the application achieves overwhelming advantages on various business data indexes, the original visual GLM can not well play some roles in processing specific auditing business, but the application obtains excellent performance after the instruction of the structured data is finely tuned.

In practical application, the auditing rules are complex and changeable, such as ecological security auditing business, and the auditing rules cover more than 30 primary labels, more than 50 secondary labels and more than 100 tertiary labels. The rule of the label is very complex, and the auditing requirements of different business parties are different, so that the difficulty of algorithm development is increased if the labels are developed one by one. The traditional model development modes are all that a plurality of small models need to be trained to respectively deal with different auditing rules, so that the algorithm development and service cost is greatly increased, and the improvement of related business is hindered. The application can automatically adapt to a plurality of auditing rules and flexibly adapt to different services, thereby reducing the algorithm development and service cost. Meanwhile, the trained image auditing model can provide a quick and excellent content auditing service. Further, as shown in fig. 13, the online Demo of the picture review model may be developed using a Gradio tool to interact with the presentation in real time. Specifically, the developer can input text prompt information in the text box, submit image prompt information in the graphic box, click and submit, and then output an answer by the model.

The image auditing model obtained by training can be matched with various image auditing rules by using a single model, and has good performance on different auditing rules. Therefore, the image auditing model in the application can replace the image auditing small model corresponding to each auditing rule (namely label matching condition) in the traditional technology, and can reduce the algorithm development cost while ensuring good auditing effect. In practical application, the auditing rule is always changed continuously along with time. By adopting the scheme of the application, new auditing rules can be flexibly adapted without training. Compared with the data collection, model training and model online training modes of the traditional technology, the method and the device can carry out relevant auditing by modifying the auditing rule information which generates variation, and greatly shorten the algorithm development period. And the matching reason of the image to be checked and the target condition is output while the target label matched with the image to be checked is output, so that the image material provider can conveniently carry out targeted correction, and the processing efficiency of the subsequent checking result is improved.

In one embodiment, the application further provides a scene applying the image auditing method and the image auditing model training method. The scene may be, for example, an image material review scene of a content interaction platform. Before auditing the image materials, the content server can construct an instruction fine adjustment data set based on the service condition set and the service image sample carrying the service tag, and use the instruction fine adjustment data set to conduct instruction fine adjustment training on the pre-training image-text dialogue model so as to obtain an image auditing model. And then, auditing the image material by applying the image auditing model. The content server may obtain, during the auditing of the image material, the image material to be audited, and an auditing condition set including tag matching conditions for each of the plurality of candidate tags. The set of audit conditions may include candidate matching conditions for the training process, as well as newly added matching conditions that are not trained. Based on the image elements contained in the image material to be checked, carrying out semantic analysis on the image material to be checked to obtain semantic information of the image material to be checked; determining target conditions matched with the semantic information from the tag matching conditions; performing context coding processing on the text representing the target condition according to the semantic information to obtain a matching reason of the image material to be checked and the target condition; and determining an auditing result of the image material to be audited based on the target label corresponding to the target condition in each candidate label and the matching reason.

In one embodiment, the application further provides another scene applying the image auditing method and the image auditing model training method. The scene may be, for example, a pre-annotation data screening scene. In the early stage of algorithm development, the concentration of data annotation can be improved by screening illegal data from a large amount of normal data, the data annotation cost is reduced, and the data annotation period is shortened. Specifically, before labeling an image to be labeled, the server can construct an instruction fine adjustment data set based on a service condition set and a service image sample carrying a service tag, and use the instruction fine adjustment data set to perform instruction fine adjustment training on a pre-training image-text dialogue model to obtain an image auditing model. And then, labeling the image to be labeled by applying the image auditing model. In the process of labeling the image to be labeled, the server can acquire the image to be labeled and an audit condition set comprising the label matching conditions of each of the plurality of candidate labels. The set of audit conditions may include candidate matching conditions for the training process, as well as newly added matching conditions that are not trained. Then, carrying out semantic analysis on the image to be annotated based on the image elements contained in the image to be annotated to obtain semantic information of the image to be annotated; determining target conditions matched with the semantic information from the tag matching conditions; performing context coding processing on the text representing the target condition according to the semantic information to obtain a matching reason of the image to be marked and the target condition; and determining an auditing result of the image to be annotated based on the target label corresponding to the target condition in each candidate label and the matching reason, and taking the target label as the annotation information of the image to be annotated to realize the annotation of the image to be annotated.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an image auditing device for realizing the image auditing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the image review device or devices provided below may refer to the limitation of the image review method hereinabove, and will not be repeated herein.

In one embodiment, as shown in FIG. 14, there is provided an image review device 1400 comprising: an image acquisition module 1402 to be audited, a semantic analysis module 1404, a target condition determination module 1406, an encoding module 1408, and an audit result determination module 1410, wherein:

an image to be audited acquisition module 1402, configured to acquire an audit condition set and an image to be audited; the auditing condition set comprises a plurality of candidate labels and respective label matching conditions of each candidate label; the label matching condition refers to a condition which is met by the image conforming to the candidate label;

the semantic analysis module 1404 is configured to perform semantic analysis on the image to be inspected based on the image elements included in the image to be inspected, to obtain semantic information of the image to be inspected;

a target condition determining module 1406 for determining a target condition matching the semantic information from the tag matching conditions;

the encoding module 1408 is configured to perform context encoding processing on the text representing the target condition according to the semantic information, so as to obtain a matching reason between the image to be checked and the target condition;

the auditing result determining module 1410 is configured to determine an auditing result of the image to be audited based on the target tag corresponding to the target condition in each candidate tag and the matching reason.

In one embodiment, the image to be reviewed carries text information. In the case of this embodiment, the semantic analysis module 1404 includes: the image semantic analysis unit is used for carrying out image semantic analysis on the image to be checked based on the image elements contained in the image to be checked to obtain the image semantic of the image to be checked; the text semantic analysis unit is used for carrying out text semantic analysis on the image to be checked based on the text information carried by the image to be checked to obtain the text semantic of the image to be checked; the semantic fusion unit is used for fusing the image semantic and the text semantic and determining semantic information of the image to be checked.

In one embodiment, the image semantic analysis unit is specifically configured to: splitting an image to be checked into a plurality of image blocks; based on the image elements contained in each image block, respectively carrying out image semantic analysis on each image block to obtain the respective image block semantics of each image block; image semantics including the semantics of each image block are determined.

In one embodiment, the semantic fusion unit is specifically configured to: performing correlation analysis on the text semantics and each image block semantics respectively, and determining associated image blocks of the text semantics from the image blocks; the semantic information of the image to be checked is determined based on the number of the associated image blocks and the image block semantics of the associated image blocks.

In one embodiment, the target condition determination module 1406 is specifically configured to: acquiring respective characteristic words of the tag matching conditions; respectively carrying out similarity comparison on the semantic information and each characteristic word, and determining a target word with the highest similarity with the semantic information; and determining the tag matching condition of the target vocabulary as a target condition matched with the semantic information.

In one embodiment, the encoding module 1408 includes: the text to be processed determining unit is used for determining a text to be processed, which is similar to the semantic represented by the semantic information, from the text representing the target condition; and the encoding unit is used for carrying out context encoding processing on the text to be processed to obtain the matching reason of the image to be checked and the target condition.

In one embodiment, the encoding unit includes: the word segmentation component is used for carrying out word segmentation conversion processing on the text to be processed to obtain a plurality of word characteristics; the coding component is used for sequentially carrying out mask self-coding processing on the text to be processed according to the positions of the words characterized by each word characteristic in the text to be processed, and determining the context characteristics of the text to be processed; and the splicing component is used for splicing the text to be processed and the context information obtained by decoding the context characteristics, and determining the spliced text as the matching reason of the image to be checked and the target condition. Wherein the contextual characteristics include at least one of a contextual characteristic of a first word and a contextual characteristic of a last word in the text to be processed.

In one embodiment, the contextual characteristics include the contextual characteristics of the last word in the text to be processed. In the case of this embodiment, the coding assembly is specifically for: corresponding to each word feature, determining the lower text position of the vocabulary represented by the word feature in the text to be processed as a mask position corresponding to the word feature; according to the mask positions corresponding to the features of each word, sequentially performing mask self-coding processing on the text to be processed to obtain the respective following features of the features of each word; from the respective following features, the following feature of the last word in the text to be processed is determined. Wherein the expected context information characterized by the respective context features of each word feature except the last word feature and the context information of the vocabulary characterized by the word feature in the text to be processed meet the text similarity condition.

The modules in the image auditing device may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

Based on the same inventive concept, the embodiment of the application also provides an image auditing model training device for realizing the image auditing model training method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the image review model training device or devices provided below may be referred to the limitation of the image review model training method hereinabove, and will not be described herein.

In one embodiment, as shown in FIG. 15, there is provided an image review model training apparatus 1500 comprising: an image sample acquisition module 1502, a condition matching module 1504, a match reason determination module 1506, a training sample construction module 1508, and an instruction fine training module 1510, wherein:

an image sample acquiring module 1502, configured to acquire a service condition set and a service image sample carrying a service tag; the service condition set comprises candidate matching conditions of each of a plurality of candidate service tags;

a condition matching module 1504, configured to determine a selected condition matching the service tag from the candidate matching conditions;

a matching reason determining module 1506, configured to determine a matching reason of the service image sample and the selected condition;

A training sample construction module 1508 for determining an input portion based on the business image samples and the selected conditions, and determining an output portion based on the matching reason and the business labels, constructing an instruction fine tuning training sample;

an instruction fine tuning training module 1510 for performing instruction fine tuning training on the pre-trained teletext model using a training data set comprising an instruction fine tuning data set and a teletext general data set, resulting in an image review model; the instruction fine adjustment data set comprises instruction fine adjustment training samples corresponding to the plurality of business image samples.

In one embodiment, the instruction trim training module 1510 includes: a training data set determining unit for determining a training data set comprising an instruction fine tuning data set and a teletext general data set; the fine tuning parameter determining unit is used for determining fine tuning parameters matched with the training data set according to the number of samples of the training data set; and the instruction fine tuning training unit is used for carrying out instruction fine tuning training on the pre-training graphic dialogue model based on the training data set and the fine tuning parameters to obtain an image auditing model.

In one embodiment, the teletext model comprises a pre-trained image encoder, a query transformer and a pre-trained language model, connected in sequence. In the case of this embodiment, the instruction fine training unit is specifically configured to: freezing the pre-training image encoder and the pre-training language model, and performing instruction fine-tuning training on the query converter by adopting fine-tuning parameters to obtain an updated converter; an image review model is determined that includes a pre-trained image encoder, an update transformer, and a pre-trained language model.

The modules in the image auditing model training apparatus may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 16. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data involved in the image review or image review model training process. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by the processor to implement an image review method or an image review model training method.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 17. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by the processor to implement an image review method or an image review model training method. The display unit of the computer equipment is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device, wherein the display screen can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on a shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structures shown in fig. 16 or 17 are merely block diagrams of portions of structures associated with the present inventive arrangements and are not limiting of the computer device to which the present inventive arrangements may be implemented, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory having a computer program stored therein and a processor that implements the steps of the method described above when the computer program is executed.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, implements the steps of the above method.

In an embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, implements the steps of the above method.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related territories and regions. Moreover, the object can choose not to authorize the object information and related data, and can reject or conveniently reject the push information, etc.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. An image review method, the method comprising:

2. The method of claim 1, wherein the image to be reviewed carries text information; the semantic analysis is performed on the image to be checked based on the image elements contained in the image to be checked to obtain semantic information of the image to be checked, and the method comprises the following steps:

based on the image elements contained in the image to be checked, performing image semantic analysis on the image to be checked to obtain the image semantic of the image to be checked;

based on text information carried by the image to be checked, carrying out text semantic analysis on the image to be checked to obtain text semantics of the image to be checked;

and fusing the image semantics and the text semantics, and determining semantic information of the image to be checked.

3. The method according to claim 2, wherein the performing image semantic analysis on the to-be-inspected image based on the image elements included in the to-be-inspected image to obtain the image semantic of the to-be-inspected image includes:

dividing the image to be checked into a plurality of image blocks;

based on the image elements contained in each image block, respectively carrying out image semantic analysis on each image block to obtain the respective image block semantics of each image block;

image semantics including each of the image block semantics are determined.

4. A method according to claim 3, wherein said fusing the image semantics and the text semantics to determine semantic information of the image to be reviewed comprises:

performing correlation analysis on the text semantics and each image block semantics respectively, and determining associated image blocks of the text semantics from the image blocks;

and determining semantic information of the image to be checked based on the number of the associated image blocks and the image block semantics of the associated image blocks.

5. The method of claim 1, wherein said determining a target condition matching the semantic information from each of the tag matching conditions comprises:

Acquiring respective characteristic words of the tag matching conditions;

respectively carrying out similarity comparison on the semantic information and each characteristic word, and determining a target word with the highest similarity with the semantic information;

and determining the tag matching condition of the target vocabulary as a target condition matched with the semantic information.

6. The method according to any one of claims 1 to 5, wherein the performing, according to the semantic information, a context encoding process on the text that characterizes the target condition, to obtain a matching reason of the image to be checked and the target condition, includes:

determining a text to be processed similar to the semantic represented by the semantic information from the text representing the target condition;

and carrying out context coding processing on the text to be processed to obtain the matching reason of the image to be checked and the target condition.

7. The method of claim 6, wherein the performing the context encoding process on the text to be processed to obtain a matching reason for the image to be checked and the target condition includes:

performing word segmentation conversion processing on the text to be processed to obtain a plurality of word characteristics;

According to the positions of the words characterized by each word characteristic in the text to be processed, mask self-encoding processing is sequentially carried out on the text to be processed, and the context characteristics of the text to be processed are determined; the contextual characteristics comprise at least one of the contextual characteristics of the first word and the contextual characteristics of the last word in the text to be processed;

and splicing the text to be processed and the context information obtained by decoding the context characteristics, and determining the spliced text as a matching reason of the image to be checked and the target condition.

8. The method of claim 7, wherein the contextual characteristics comprise contextual characteristics of a last word in the text to be processed;

the method comprises the steps of sequentially carrying out mask self-coding processing on the text to be processed according to the positions of the words respectively represented by each word characteristic in the text to be processed, and determining the context characteristics of the text to be processed, wherein the method comprises the following steps:

corresponding to each word feature, determining the lower text position of the word characterized by the word feature in the text to be processed as a mask position corresponding to the word feature;

According to the mask positions corresponding to the word features, sequentially performing mask self-coding processing on the text to be processed to obtain the respective following features of the word features; the expected context information characterized by the respective context features of each word feature except the last word feature and the context information of the vocabulary characterized by the word feature in the text to be processed meet the text similarity condition;

and determining the following characteristics of the last vocabulary in the text to be processed from the following characteristics.

9. A method for training an image review model, the method comprising:

10. The method of claim 9, wherein the performing the instruction fine tuning training on the pre-trained teletext model using the training dataset comprising the instruction fine tuning dataset and the teletext general dataset to obtain the image review model comprises:

determining a training dataset comprising an instruction trim dataset and a teletext general dataset;

determining a fine tuning parameter matched with the training data set according to the sample number of the training data set;

and carrying out instruction fine tuning training on the pre-training graphic dialogue model based on the training data set and the fine tuning parameters to obtain an image auditing model.

11. The method of claim 10, wherein the teletext model comprises a pre-trained image encoder, a query transformer, and a pre-trained language model connected in sequence;

the step of performing instruction fine tuning training on the pre-training image-text dialogue model based on the training data set and the fine tuning parameters to obtain an image auditing model comprises the following steps:

Freezing the pre-training image encoder and the pre-training language model, and performing instruction fine-tuning training on the query transformer by adopting the fine-tuning parameters to obtain an updated transformer;

an image review model is determined that includes the pre-trained image encoder, the update transformer, and the pre-trained language model.

12. An image review device, the device comprising:

13. An image review model training apparatus, the apparatus comprising:

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 11 when the computer program is executed.

15. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 11.

16. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 11.