CN116935110A

CN116935110A - Image processing method and device, electronic equipment and storage medium

Info

Publication number: CN116935110A
Application number: CN202310870414.9A
Authority: CN
Inventors: 潘子琦; 兰钧; 孟昌华; 王维强
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-07-14
Filing date: 2023-07-14
Publication date: 2023-10-24

Abstract

The specification discloses an image processing method, an image processing device, an electronic device and a storage medium. The method comprises the following steps: based on the attention neural network aiming at the target content entity, generating an initial attention diagram corresponding to the target image, wherein a pixel in the initial attention diagram reflects the probability of the pixel to be subjected to feature extraction by the attention neural network through the value size under the target pixel parameter. And setting the pixels of the initial attention force diagram, the target pixel parameters of which reach the preset value standard, as target gray scales, and obtaining the processed attention force diagram corresponding to the initial attention force diagram. A content detection area is determined based on the pixels of the target gray in the post-processing attention map. And detecting the content of the target content entity in the corresponding content detection area in the target image.

Description

Image processing method and device, electronic equipment and storage medium

Technical Field

The document belongs to the technical field of artificial intelligence, and particularly relates to an image processing method, an image processing device, electronic equipment and a storage medium.

Background

With the development of artificial intelligence technology, applications for processing images based on machines are becoming more and more popular. Content detection is a common image processing technique. Content detection, as the name implies, is the detection of content in an image by means of artificial intelligence models.

At this stage, if it is desired to enable the model to accurately detect the content in the image, it is necessary to train the model to recognize the content position using a sample image that minutely marks the content position. The model training needs to prepare more sample images as support, which results in great labeling workload and makes the application of content detection have a higher threshold.

Therefore, how to solve the problem that the content detection model relies on sample image training to identify the content position is the technical problem to be solved by the application.

Disclosure of Invention

The embodiment of the specification provides an image processing method, an image processing device, electronic equipment and a storage medium, which can enable a content detection model to determine the content position in an image under the condition of not depending on sample image training.

For the above purpose, the embodiments of the present specification are implemented as follows:

in a first aspect, an image processing method is provided, including:

generating an initial attention map corresponding to a target image based on an attention neural network aiming at a target content entity, wherein a pixel in the initial attention map reflects the probability of the pixel being subjected to feature extraction by the attention neural network through the value size under a target pixel parameter;

setting pixels, of which target pixel parameters reach preset value standards, in the initial attention force diagram as target gray scales, and obtaining a processed attention force diagram corresponding to the initial attention force diagram;

determining a content detection area based on pixels of a target gray in the post-processing attention map;

and detecting the content of the target content entity in the corresponding content detection area in the target image.

In a second aspect, an image processing apparatus is provided, comprising:

an attention map generating module, which is used for generating an initial attention map corresponding to a target image based on an attention neural network aiming at a target content entity, wherein a pixel in the initial attention map reflects the probability of the pixel being subjected to feature extraction by the attention neural network through the value size under a target pixel parameter;

the attention force diagram processing module is used for setting pixels, of which target pixel parameters reach preset value standards, in the initial attention force diagram as target gray scales, and obtaining a processed attention force diagram corresponding to the initial attention force diagram;

a detection area determining module for determining a content detection area based on the pixels of the target gray in the post-processing attention map;

and the detection execution module is used for detecting the content of the target content entity in the content detection area corresponding to the target image.

In a third aspect, an electronic device is provided, comprising: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to:

In a fourth aspect, a computer-readable storage medium is presented for storing computer-executable instructions that, when executed by a processor, perform the following:

When the content detection of the target content entity is performed on the target image, the attention neural network of the target content entity is used for generating an initial attention map corresponding to the target image, wherein the probability that the pixel is subjected to feature extraction by the attention neural network is reflected by the pixel in the initial attention map through the value size under the target pixel parameter. Because the pixel extracted by the high probability of the attention neural network by the features is more likely to belong to the target content entity, only the pixel of which the target pixel parameter in the initial attention map reaches the preset value standard is required to be set as the target gray level so as to be converted into the processed attention map, and the accurate content detection area can be determined according to the pixel of the target gray level in the processed attention map; finally, content detection can be completed only by using the content detection model to attempt to identify the target content entity for the corresponding content detection area in the target image. The whole scheme does not need to train a content detection model by using sample images with fine labeling of content positions, so that the practicability of related applications of content detection is greatly improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. Attached at

In the figure:

fig. 1 is a flowchart of an image processing method according to an embodiment of the present disclosure.

Fig. 2 is a schematic diagram of a first application of the image processing method of the embodiment of the present specification.

Fig. 3 is a schematic diagram of a second application of the image processing method of the embodiment of the present specification.

Fig. 4 is a schematic diagram of a third application of the image processing method of the embodiment of the present specification.

Fig. 5 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present specification.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

For the purposes, technical solutions and advantages of this document, the technical solutions of this specification will be clearly and completely described below with reference to specific embodiments of this specification and corresponding drawings. It will be apparent that the embodiments described are only some, but not all, of the embodiments of this document. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

As previously mentioned, it is currently desirable to enable a model to accurately detect content in an image, which requires the ability to train the model to identify content locations using sample images that are finely labeled with the content locations. The model training needs to prepare more sample images as support, which results in great labeling workload and makes the application of content detection have a higher threshold.

Therefore, the present specification aims to propose a completely new content detection scheme, which can train the ability of a model to recognize the content position in an image without relying on a sample image.

In one aspect, the present specification embodiments provide an image processing method that may be performed by a corresponding apparatus below. Wherein, fig. 1 is a flow chart of an image processing method, comprising the following steps:

s102, generating an initial attention diagram corresponding to the target image based on the attention neural network aiming at the target content entity, wherein the pixel in the initial attention diagram reflects the probability of the pixel to be subjected to feature extraction by the attention neural network through the value size under the target pixel parameter.

In this embodiment, the attention neural network of the target content entity is pre-constructed. The step can input the data of the target image into the attention neural network, namely the attention value of the pixels in the target image can be determined by the attention neural network; then, the magnitude of the attention value is reflected by the target pixel parameter, and an initial attention map corresponding to the target image is generated.

The target pixel parameter is here exemplified as gray scale. The present embodiment may characterize the attention value in gray scale. That is, the larger the gradation, the larger the attention value; the smaller the gray scale, the smaller the attention value. Because the attention value of the pixel reflects the probability that the pixel is extracted by the attention neural network, for the attention neural network aiming at the target content entity, the probability that the pixel in the region of the target content entity is obviously extracted by the feature is larger, and the larger the corresponding gray value is, the closer to white is; and the probability that the pixels not in the target content entity area are extracted by the features is smaller, the corresponding gray values are smaller and are more black.

For example, as shown in fig. 2, assuming that the target image presents a moon, the target image is guided to the attention neural network for the moon, and the closer to the white pixel, the easier it is to be extracted by the attention neural network in the initial attention map generated. It can be seen that the initial attention in fig. 2 seeks to present a general outline of the moon.

In practical applications, the attention neural network of the target content entity of the present embodiment may come from an existing model. For example, artificial intelligence generates a text-to-image (text-to-image) model of content AIGC (AI-Generated Content, AIGC) technology.

Taking the open source Stable Diffusion model as an example. The current Stable diffration model mainly has the following two functions.

One function is: generating a corresponding authored image according to the description prompt of the text guide information;

another function is: an existing image is modified based on descriptive cues of the text guidance.

For the latter function, the input of the Stable presentation model is both text guide information and images. The general principle of the Stable Diffusion model is to encode the text guiding information and the image respectively, and to fuse and then decode the two encodings according to the cross-modal cross-attention mechanism between the text and the image, so as to create the content corresponding to the text guiding information on the basis of the input image,

for the embodiment, only the text guiding information for describing the image for generating the target content entity and the target image are required to be imported into a text-to-image model, and the initial attention map of the target content entity corresponding to the target image can be generated by the attention neural network of the Stable description model according to the cross-modal cross-attention between the text identifier corresponding to the target content entity in the text guiding information and the target image.

For example, as shown in fig. 2, the implementation may input the natural language "a photo of Moon celestialbody" and the target image into the Stable Diffusion model, and extract the initial attention map generated by the attention neural network of the Stable Diffusion model according to the cross-modal cross-attention between the Moon text identifier "Moon" and the target image.

It should be noted that, the gray scale of fig. 2 is only used to characterize the attention so as to facilitate the human eye to see the probability difference of the feature extraction of the pixels in the initial attention map. In practice, the machine need not be visually distinguished, and thus the target pixel parameter may also be brightness, color, etc., not specifically defined herein.

S104, setting the pixels with the target pixel parameters reaching the preset value standard in the initial attention force diagram as target gray scale, and obtaining the processed attention force diagram corresponding to the initial attention force diagram.

For ease of understanding, attention is drawn to a binary image as an example. For the initial attention diagram shown in fig. 2, this step may set the pixels having a gray scale (probability of being extracted by the features) lower than the preset value standard to the lowest gray value of 0, that is, the pixels having a low probability of belonging to "moon" are uniformly set to black; meanwhile, the pixels with gray levels reaching the preset value standard are set as the most gray level 255, that is, the pixels with high probability of belonging to the moon are uniformly set as white, so that the modified attention map shown in fig. 2 can be obtained.

As can be seen from fig. 2, the modified attention seeks to make it possible to present the outline of the content entity "moon" quite clearly.

Similarly, fig. 2 uses a binary image as a post-processing attention-seeking example only to facilitate the human eye in seeing the distinction between pixels belonging to the "moon" and other pixels. In practical application, the machine does not need to distinguish by naked eyes, so that only the pixels with the target pixel parameters reaching the preset value standard are required to be set as target gray levels.

S106, determining a content detection area based on the pixel of the target gray in the post-processing attention map.

Also for example in fig. 2, the post-processing attention map has clearly presented the outline of "moon", so this step is only required to determine the pixel region of the target gradation in the post-processing attention map as the content detection region.

Here, assuming that the content detection area is a frame, the minimum abscissa of the frame is the minimum abscissa corresponding to the pixel of the target gray in the post-processing attention map, and the maximum abscissa of the frame is the maximum abscissa corresponding to the pixel of the target gray in the post-processing attention map; similarly, the minimum ordinate of the frame is the minimum ordinate corresponding to the pixel of the target gray in the post-processing attention map, and the maximum ordinate of the frame is the maximum ordinate corresponding to the pixel of the target gray in the post-processing attention map.

In addition, in the present embodiment, if there are not less than two pixel areas of the target gradation in the post-processing attention map, the pixel area of each target gradation may be regarded as one content detection area.

In addition, given that there is a certain error in the attention neural network, which leads to the pixel regions of small-area target gray scales in the attention diagram after processing, the pixel regions of small-area target gray scales can be regarded as noise points for filtering. That is, the present embodiment removes a pixel region of the target gradation that does not reach the preset size standard in the post-processing attention map, and determines a pixel region of the target gradation remaining in the post-processing attention map as a content detection region.

S108, detecting the content of the target content entity in the corresponding content detection area in the target image.

Also for example, in fig. 2, after the post-processing attention map of the content entity "moon" is generated, the identification of "moon" may be performed based on the content detection area outlined in the post-processing attention map.

Based on the above, it can be seen that: when the content detection of the target content entity is performed on the target image, the method of the embodiment of the specification uses the attention neural network of the target content entity to generate an initial attention map corresponding to the target image, wherein the probability that the pixel is extracted by the attention neural network is reflected by the pixel in the initial attention map through the value size under the target pixel parameter. Because the pixel extracted by the high probability of the attention neural network by the features is more likely to belong to the target content entity, only the pixel of which the target pixel parameter in the initial attention map reaches the preset value standard is required to be set as the target gray level so as to be converted into the processed attention map, and the accurate content detection area can be determined according to the pixel of the target gray level in the processed attention map; finally, content detection can be completed only by using the content detection model to attempt to identify the target content entity for the corresponding content detection area in the target image. The whole scheme does not need to train a content detection model by using sample images with fine labeling of content positions, so that the practicability of related applications of content detection is greatly improved.

Further, the method of the embodiment can be applied to a scene of auditing the offensive content image.

The application scenario of checking whether the target image shows the illegal content is introduced by taking the case that the target content entity belongs to the illegal content classification as an example.

The application scene is used for identifying whether target content entities belonging to illegal content classification appear in the target image through a content detection technology.

Here, it is assumed that the picture of the target image presentation is that target entity content (contraband) is placed at the desk. The flow is as follows:

stage one, generating initial attention diagram of target image based on Stable Diffusion model

The stage firstly generates a natural language 'a photo of Illegalitems on table' of a Stable Diffusion model according to a picture presented by a target image.

Then, the natural language "a photo of Illegalitems on table" and the target image are imported into the Stable Diffuse model, which is guided to create another image about "target content entity placed on desk".

For the application scene, the image authored by the Stable diffration model needs to be concerned, and only the attention force diagram of the attention neural network needs to be acquired in the calculation process.

As previously described, the Stable distribution model is based on cross-modal cross-attention between text identification and images for feature extraction. Thus, the Stable Diffuse model generates an initial attention profile corresponding to each text identifier in "a photo of Illegalitems on table" as shown in FIG. 3, as well as an initial attention profile for some unique identifier bits generated by the Stable Diffuse model for identifying natural language (initial attention profiles for identifier bits are not contemplated herein).

Based on the capabilities of the current Stable distribution model, initial attentiveness of different text identifiers is presented with initial attentiveness attempts related to their text meanings. Assuming that the initial attention generated in fig. 3 represents a probabilistic visualization of the feature extraction of pixels, an initial attention attempt like text identification "table" would present the outline of the table, while this stage only requires the extraction of an initial attention attempt of offending item text identification "illegalites".

Stage two, creating a modified attention map of the target image to determine the content detection area

Referring to fig. 4, this stage performs a binary image conversion on the initial attention map of the text identifier "illegalites", that is, the pixels whose attention reaches a certain threshold in the initial attention map of "illegalites" are set to white, and the remaining pixels are set to white, so as to obtain a modified attention map that clearly presents the outline of "illegalites".

After that, the content detection area is constructed in accordance with the modified attention of "illegalites" in an attempt to white pixels.

For example, a frame may be constructed based on the maximum coordinate value and the minimum coordinate value of the white pixel in the modified attention map, and the area selected by the frame, that is, the content detection area. Alternatively, the white pixel region in the modified attention map may be directly used as the content detection region, that is, the content detection region and the shape of "illegalites" may be always used.

And step three, identifying the target content entity in the corresponding content detection area in the target image.

This stage may map the content detection area to the target image after determining the content detection area. And then, extracting the image of the content detection area in the target image to input the image into a classifier for identifying the target content entity, and obtaining the content detection result of whether the illegal object appears in the target image.

The above is an introduction to examine whether the target image shows illegal contents. For the application scenario, after the content detection result corresponding to the target image indicates that the target content entity is contained, performing matched violation processing operation on the target image.

For example, if the target image is published on the network platform, then the violation processing operation includes at least one of:

deleting metadata of the target image in the network platform;

sealing and banning the network platform account for releasing the target image;

and performing authority degradation on the network platform account for releasing the target image.

For another example, if the target image is carried in the target service request, the violation processing operation includes:

refusing to accept the target service request.

On the other hand, the embodiment of the present specification provides an image processing apparatus corresponding to the method shown in fig. 1. Wherein fig. 5 is a schematic structural diagram of image processing, including:

the attention profile generation module 510 generates an initial attention profile corresponding to the target image based on the attention neural network for the target content entity, wherein a pixel in the initial attention profile reflects a probability of the pixel being feature extracted by the attention neural network by a value size under a target pixel parameter.

The attention attempt processing module 520 sets a pixel in the initial attention attempt, where the target pixel parameter reaches a preset value standard, as a target gray level, so as to obtain a processed attention attempt corresponding to the initial attention attempt.

The detection area determining module 530 determines a content detection area based on the pixels of the target gray in the post-processing attention map.

The detection execution module 540 performs content detection for the target content entity on the corresponding content detection area in the target image.

When the device in the embodiment of the present disclosure detects the content of the target content entity for the target image, the attention neural network of the target content entity is used to generate an initial attention map corresponding to the target image, where a pixel in the initial attention map reflects the probability that the pixel is feature-extracted by the attention neural network through the value size under the target pixel parameter. Because the pixel extracted by the high probability of the attention neural network by the features is more likely to belong to the target content entity, only the pixel of which the target pixel parameter in the initial attention map reaches the preset value standard is required to be set as the target gray level so as to be converted into the processed attention map, and the accurate content detection area can be determined according to the pixel of the target gray level in the processed attention map; finally, content detection can be completed only by using the content detection model to attempt to identify the target content entity for the corresponding content detection area in the target image. The whole scheme does not need to train a content detection model by using sample images with fine labeling of content positions, so that the practicability of related applications of content detection is greatly improved.

Optionally, the attention neural network is from a text-to-image model; the attention profile generation module 510 generates an initial attention profile for a target content entity for a target image based on an attention neural network for the target content entity, comprising: and importing text guide information used for describing the generation of the image of the target content entity and the target image into the text-to-image model so as to generate an initial attention map of the target image corresponding to the target content entity according to the cross-modal cross-attention of the target content entity between the text identification corresponding to the text guide information and the target image by the attention neural network.

Optionally, the detection area determining module 530 determines the content detection area based on the pixels of the target gray in the post-processing attention map, including: and determining the pixel region of the target gray in the post-processing attention map as a content detection region.

Optionally, if the pixel areas of the target gray scale in the post-processing attention map are not less than two, the detecting area determining module 530 determines the content detecting area from the pixel areas of the target gray scale in the post-processing attention map, including: and removing the pixel areas of the target gray scales which do not reach the preset size standard in the post-processing attention map, and determining the pixel areas of the residual target gray scales in the post-processing attention map as content detection areas.

Optionally, the target content entity belongs to an offending content classification; the device of this embodiment further includes:

and the violation processing module is used for executing matched violation processing operation on the target image if the content detection result corresponding to the target image indicates that the target content entity is contained.

Optionally, the target image is published on a network platform, and the violation processing operation includes at least one of:

deleting metadata of the target image in the network platform;

Optionally, the target image is carried in a target service request, and the violation processing operation includes:

refusing to accept the target service request.

It should be understood that the image processing apparatus according to the embodiments of the present disclosure may be used as an execution body of the method shown in fig. 1, so that corresponding steps and functions in the method can be implemented, and detailed descriptions thereof are omitted herein.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. Referring to fig. 6, at the hardware level, the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, network interface, and memory may be interconnected by an internal bus, which may be an ISA (Industry Standard Architecture ) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 6, but not only one bus or type of bus.

And the memory is used for storing programs. In particular, the program may include program code including computer-operating instructions. The memory may include memory and non-volatile storage and provide instructions and data to the processor.

The processor reads the corresponding computer program from the nonvolatile memory to the memory and then runs the computer program to form the exception handling device of the service system on a logic level. The processor is used for executing the programs stored in the memory and is specifically used for executing the following operations:

based on the attention neural network aiming at the target content entity, generating an initial attention diagram corresponding to the target image, wherein a pixel in the initial attention diagram reflects the probability of the pixel to be subjected to feature extraction by the attention neural network through the value size under the target pixel parameter.

And setting the pixels of the initial attention force diagram, the target pixel parameters of which reach the preset value standard, as target gray scales, and obtaining the processed attention force diagram corresponding to the initial attention force diagram.

A content detection area is determined based on the pixels of the target gray in the post-processing attention map.

When the electronic device in the embodiment of the present disclosure detects the content of the target content entity for the target image, an attention neural network of the target content entity is used to generate an initial attention map corresponding to the target image, where a pixel in the initial attention map reflects a probability that the pixel is feature-extracted by the attention neural network through a value size under a target pixel parameter. Because the pixel extracted by the high probability of the attention neural network by the features is more likely to belong to the target content entity, only the pixel of which the target pixel parameter in the initial attention map reaches the preset value standard is required to be set as the target gray level so as to be converted into the processed attention map, and the accurate content detection area can be determined according to the pixel of the target gray level in the processed attention map; finally, content detection can be completed only by using the content detection model to attempt to identify the target content entity for the corresponding content detection area in the target image. The whole scheme does not need to train a content detection model by using sample images with fine labeling of content positions, so that the practicability of related applications of content detection is greatly improved.

The method of the embodiment shown in fig. 1 of the present specification may be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in one or more embodiments of the present description may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with one or more embodiments of the present disclosure may be embodied directly in a hardware decoding processor or in a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

The electronic device may also perform the method described in fig. 1, which is not described in detail herein.

Of course, in addition to the software implementation, the electronic device in this specification does not exclude other implementations, such as a logic device or a combination of software and hardware, that is, the execution subject of the following process is not limited to each logic unit, but may also be hardware or a logic device.

The present specification embodiment also proposes a computer-readable storage medium storing one or more programs.

Wherein the one or more programs include instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the method of the embodiment of fig. 1, and in particular to:

In summary, the foregoing description is only a preferred embodiment of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of one or more embodiments of the present disclosure, is intended to be included within the scope of one or more embodiments of the present disclosure.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

Claims

1. An image processing method, comprising:

2. The method according to claim 1,

the attention neural network is from a text-to-image model;

the generating an initial attention profile of a target image corresponding to a target content entity based on an attention neural network for the target content entity comprises:

and importing text guide information used for describing the generation of the image of the target content entity and the target image into the text-to-image model so as to generate an initial attention map of the target image corresponding to the target content entity according to the cross-modal cross-attention of the target content entity between the text identification corresponding to the text guide information and the target image by the attention neural network.

3. The method according to claim 1,

determining a content detection area based on pixels of a target gray in the post-processing attention map, comprising:

and determining the pixel region of the target gray in the post-processing attention map as a content detection region.

4. A method according to claim 3,

if the pixel areas of the target gray scale in the post-processing attention map are not less than two, the determining the content detection area by the pixel areas of the target gray scale in the post-processing attention map includes:

and removing the pixel areas of the target gray scales which do not reach the preset size standard in the post-processing attention map, and determining the pixel areas of the residual target gray scales in the post-processing attention map as content detection areas.

5. The method according to claim 1 to 4,

the target content entity belongs to illegal content classification; the method further comprises the steps of:

and if the content detection result corresponding to the target image indicates that the target content entity is contained, executing matched violation processing operation on the target image.

6. The method according to claim 5,

the target image is published on a network platform, and the violation processing operation comprises at least one of the following steps:

deleting metadata of the target image in the network platform;

7. The method according to claim 6, wherein the method comprises,

the target image is carried in a target service request, and the violation processing operation comprises the following steps:

refusing to accept the target service request.

8. An image processing apparatus comprising:

9. An electronic device, comprising: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to:

10. A computer-readable storage medium for storing computer-executable instructions that when executed by a processor perform the operations of: