CN116994021A - Image detection method, device, computer readable medium and electronic equipment - Google Patents

Image detection method, device, computer readable medium and electronic equipment Download PDF

Info

Publication number
CN116994021A
CN116994021A CN202211447191.7A CN202211447191A CN116994021A CN 116994021 A CN116994021 A CN 116994021A CN 202211447191 A CN202211447191 A CN 202211447191A CN 116994021 A CN116994021 A CN 116994021A
Authority
CN
China
Prior art keywords
image
text
model
sample
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211447191.7A
Other languages
Chinese (zh)
Inventor
许剑清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202211447191.7A priority Critical patent/CN116994021A/en
Publication of CN116994021A publication Critical patent/CN116994021A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

The application discloses an image detection method, an image detection device, a computer readable medium and electronic equipment, wherein the method comprises the following steps: extracting image features of the image to be detected through the first image model to obtain the image features to be detected; the first image model is obtained through training of first training data, and the first training data comprises a first sample image which does not have a sample text label and belongs to the same service scene as the detection target and a second sample image which has the sample text label; matching the image features to be detected with target text features to obtain a matching result, wherein the target text features are text features related to the detection target; and determining whether the image to be detected is an image containing the detection target according to a matching result. The embodiment of the application can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like. The technical scheme of the application can improve the image detection accuracy of the service scene where the detection target is.

Description

Image detection method, device, computer readable medium and electronic equipment
Technical Field
The application belongs to the technical field of image processing, and particularly relates to an image detection method, an image detection device, a computer readable medium and electronic equipment.
Background
Pictures are a widely used form of content distribution in the internet world, and in some cases, it is necessary to detect and identify the content contained in the pictures. The manual auditing can detect and identify a small number of pictures, but is not suitable for massive picture data generated by the Internet, and currently, a pre-trained picture classification model is adopted to identify picture contents under most conditions. The detection mode can only output the detection result of the picture compliance or the picture non-compliance, the detection result is single, and the accuracy of the picture classification model is required to be improved.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the application and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.
Disclosure of Invention
The application aims to provide an image detection method, an image detection device, a computer readable medium and electronic equipment, so as to optimize the problem of low image detection accuracy in the related technology.
Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.
According to an aspect of an embodiment of the present application, there is provided an image detection method including:
Extracting image features of the image to be detected through the first image model to obtain the image features to be detected; the first image model is obtained through training of first training data, and the first training data comprises a first sample image which does not have a sample text label and belongs to the same service scene as a detection target and a second sample image which has the sample text label;
matching the image features to be detected with target text features to obtain a matching result, wherein the target text features are text features related to the detection target; the target text features are obtained by extracting text features through a first text model, and the first text model is obtained by carrying out joint training on the basis of the first training data and the first image model data;
and determining whether the image to be detected is an image containing the detection target according to the matching result.
According to an aspect of an embodiment of the present application, there is provided an image detection apparatus including:
the image feature extraction module is used for extracting image features of the image to be detected through the first image model to obtain the image features to be detected; the first image model is obtained through training of first training data, and the first training data comprises a first sample image which does not have a sample text label and belongs to the same service scene as a detection target and a second sample image which has the sample text label;
The matching module is used for matching the image feature to be detected with a target text feature to obtain a matching result, wherein the target text feature is a text feature related to the detection target; the target text features are obtained by extracting text features through a first text model, and the first text model is obtained by carrying out joint training on the basis of the first training data and the first image model data;
and the image detection module is used for determining whether the image to be detected is an image containing the detection target according to the matching result.
In one embodiment of the application, the apparatus further comprises:
the second training data acquisition module is used for acquiring second training data, wherein the second training data comprises a second sample image and a sample text label corresponding to the second sample image;
the second model training module is used for training a preset network model through the second training data to obtain a second model; the preset network model comprises a preset text network model and a preset image network model, and the second model comprises a second text model and a second image model;
the first training data acquisition module is used for acquiring first training data, wherein the first training data comprises the second training data and the first sample image;
The first model training module is used for training the second model through the first training data to obtain a first model; the first model includes a first text model and a first image model.
In one embodiment of the present application, the second model training module includes:
the second text feature extraction unit is used for extracting text features of sample texts in the second training data through the preset text network model to obtain second sample text features;
the second image feature extraction unit is used for extracting image features of a second sample image in the second training data through the preset image network model to obtain second sample image features;
a second model loss calculation unit, configured to calculate a feature matrix according to a transposed feature of one of the second sample text feature and the second sample image feature and the other of the second sample text feature and the second sample image feature, and calculate a second model loss according to the feature matrix;
and the second model parameter updating unit is used for updating the model parameters of the preset text network model and the preset image network model according to the second model loss, and continuing training based on the preset text network model and the preset image network model after parameter updating until a second model convergence condition is reached.
In one embodiment of the present application, the second model loss calculation unit is specifically configured to:
calculating a second image feature matrix according to the transposed features of the second sample image features and the second sample text features, and calculating to obtain a second image loss according to the second image feature matrix;
calculating a second text feature matrix according to the transposed features of the second sample text features and the second sample image features, and calculating a second text loss according to the second text feature matrix;
and obtaining the second model loss according to the sum of the second image loss and the second text loss.
In one embodiment of the present application, the first model training module includes:
the first text feature extraction unit is used for extracting text features of sample texts in the first training data through the second text model to obtain first text sample features;
the first image feature extraction unit is used for extracting image features of the sample images in the first training data through the second image model to obtain first sample image features; the sample image includes the second sample image and the first sample image;
A first model loss calculation unit for calculating a first model loss from the first sample image feature and the first sample text feature;
and the first model parameter updating unit is used for updating the model parameters of the second text model and the second image model according to the first model loss, and continuing training based on the second text model and the second image model after parameter updating until a first model convergence condition is reached.
In one embodiment of the present application, the first model loss calculation unit is specifically configured to:
calculating a first image loss from a transposed feature of the first sample image feature and the first sample text feature;
calculating a first text loss from the transposed feature of the first sample image feature and the first sample image feature;
calculating a third image loss according to the distance between the first sample image features;
and obtaining the first model loss according to the sum of the first image loss, the first text loss and the third image loss.
In one embodiment of the present application, the first training data acquisition module includes:
the data enhancement unit is used for carrying out image data enhancement processing on the first sample image to obtain an enhanced sample image;
And the first training data acquisition unit is used for taking the first sample image, the enhanced sample image and the second training data as first training data.
In one embodiment of the application, the image data enhancement processing includes at least one of:
random clipping, random occlusion, gaussian blur, rotation, addition of noise, and edge gradient extraction. In one embodiment of the application, the apparatus further comprises:
a text description information acquisition module for acquiring a plurality of text description information related to the detection target;
and the text feature extraction module is used for extracting text features of the text description information through the first text model to obtain a plurality of target text features.
In one embodiment of the present application, the target text features are plural, and the matching result includes a similarity between the image feature to be detected and each of the target text features; the image detection module is specifically used for:
when at least one similarity larger than a preset similarity threshold exists in the matching result, determining that the image to be detected is an image containing the detection target;
And when the similarity larger than a preset similarity threshold does not exist in the matching result, determining that the image to be detected is not the image containing the detection target.
According to an aspect of the embodiments of the present application, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements the image detection method as in the above technical solution.
According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein execution of the executable instructions by the processor causes the electronic device to perform the image detection method as in the above technical solution.
According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the image detection method as in the above technical solution.
In the technical scheme provided by the embodiment of the application, the image features of the image to be detected are extracted through the first image model, and then the image features to be detected are matched with the target text features extracted by the first text model, so that whether the image to be detected is an image containing a detection target or not is determined; the image detection accuracy of the service scene where the detection target is can be improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 schematically shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied.
Fig. 2 schematically illustrates a schematic diagram of an image detection method according to an embodiment of the present application.
Fig. 3 schematically shows a flowchart of an image detection method according to an embodiment of the present application.
Fig. 4 schematically illustrates a schematic diagram of an image detection method according to an embodiment of the present application.
Fig. 5 schematically shows a block diagram of an apparatus for training a preset network model according to an embodiment of the present application.
Fig. 6 schematically shows a block diagram of an apparatus for training a second model according to an embodiment of the application.
Fig. 7 schematically shows a flowchart of an image detection method according to an embodiment of the present application.
Fig. 8 schematically shows a flowchart of an image detection method according to an embodiment of the present application.
Fig. 9 schematically shows a block diagram of an image detection apparatus provided by an embodiment of the present application.
Fig. 10 schematically shows a block diagram of a computer system suitable for use in implementing embodiments of the application.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the application may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
The technical scheme of the application relates to a computer vision technology. Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.
The scheme provided by the embodiment of the application relates to an image detection method of a computer vision technology, and is specifically described by the following embodiments:
fig. 1 schematically shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied.
As shown in fig. 1, system architecture 100 may include a terminal device 110, a network 120, and a server 130. Terminal device 110 may include a cell phone, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle terminal, an aircraft, and the like. The server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. Network 120 may be a communication medium of various connection types capable of providing a communication link between terminal device 110 and server 130, and may be, for example, a wired communication link or a wireless communication link.
The system architecture in embodiments of the present application may have any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 130 may be a server group composed of a plurality of server devices. In addition, the technical solution provided in the embodiment of the present application may be applied to the terminal device 110, or may be applied to the server 130, or may be implemented by the terminal device 110 and the server 130 together, which is not limited in particular.
For example, the technical scheme of the present application is implemented by the server 130. Firstly, the server 130 extracts image features of an image to be detected through a first image model, so as to obtain the image features to be detected, wherein the first image model is obtained through training of first training data, and the first training data comprises a first sample image which does not have a sample text label and belongs to the same service scene as a detection target and a second sample image which has the sample text label; the image to be detected may be an image uploaded to the server 130 by the terminal device 110 through the network 120. Then, the server 130 matches the image feature to be detected with the target text feature to obtain a matching result, wherein the target text feature is a text feature related to the detection target; the target text features are obtained by extracting text features through a first text model, and the first text model is obtained by carrying out combined training on the basis of first training data and first image model data. Finally, the server 130 determines whether the image to be detected is an image containing a detection target according to the matching result, and when the matching result shows that the target text feature has higher similarity with the image feature to be detected, the image represented by the image feature to be detected and the image represented by the target text feature may belong to the same type of image, so that it can be determined that the image to be detected is an image containing the detection target.
For example, taking a content sharing platform as an example, the content sharing platform needs to perform validity detection on image content before publishing the image, and in the embodiment of the present application, the content sharing platform may set a detection target as an illegal object in the image. The content sharing platform can perform joint training on the image model and the text model in advance according to the first training data to obtain a trained first image model and a trained first text model; the first training data comprises a first sample image which does not have a sample text label and belongs to the same service scene as the detection target, and a second sample image which has the sample text label. Before an image is released, the content sharing platform uses a first text model to extract text characteristics of text description information of a detection target, and therefore the text characteristics corresponding to the detection target are obtained. In the process of releasing the image, the content sharing platform uses the first image model to extract image features of the image to be detected to obtain the image features to be detected, then matches the image features to be detected with target text features, and further determines whether the image to be detected is an image containing a detection target according to a matching result.
In another exemplary embodiment, taking a content sharing platform as an example, the content sharing platform needs to perform validity detection on the image content before publishing the image, and in this embodiment of the present application, the content sharing platform may set a detection target as an illegal object in the image. The content sharing platform can obtain a plurality of text description information of the detection target in advance, namely, describes illegal objects from a plurality of aspects or angles; and then, respectively extracting the characteristics of the text description information to obtain a plurality of target text characteristics, wherein the target text characteristics are characteristic representations of illegal objects. When the content sharing platform issues the image, the image to be issued is the image to be detected, and the image feature extraction is carried out on the image to be detected to obtain the image feature to be detected. And finally, the content sharing platform respectively matches the image features to be detected with the target text features to obtain a plurality of matching results, and determines whether the image to be detected is legal or not according to the plurality of matching results. When one matching result in the plurality of matching results indicates that the feature of the image to be detected is similar to the corresponding target text feature, the content of the image to be detected can be described as an illegal object described by the corresponding target text feature, so that the image to be detected is illegal, and the illegal type of the image to be detected can be determined to be the illegal type described by the corresponding target text feature. Therefore, image content detection before image release in the content sharing platform is realized.
The image detection method provided by the application is described in detail below with reference to the specific embodiments. The embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent transportation, auxiliary driving and the like.
Fig. 2 schematically illustrates a schematic diagram of an image detection method according to an embodiment of the present application, which may be implemented by a server or a terminal device, such as the terminal device 110 or the server 130 shown in fig. 1. As shown in fig. 2, the image detection method provided in the embodiment of the present application includes steps 210 to 230, which are specifically as follows:
step 210, extracting image features of an image to be detected through a first image model to obtain the image features to be detected; the first image model is obtained through training of first training data, and the first training data comprises a first sample image which does not have a sample text label and belongs to the same service scene as the detection target and a second sample image which has the sample text label.
Specifically, the image to be detected is an image that is currently required to be subjected to content detection, for example, an image that is required to be released on a content sharing platform. Image feature extraction is achieved by a first image model, which may be achieved by a convolutional neural network (Convolutional Neural Network, CNN) or a network structure with a transducer type. For example, the first image model first performs convolution (convolution) calculation on an image to be detected to obtain a convolution feature; and then performing nonlinear activation calculation and pooling calculation on the convolution characteristics to obtain image characteristics to be detected, wherein the image characteristics to be detected can also be regarded as a vector.
In the embodiment of the application, the first image model is obtained through training of first training data, and the first training data comprises a first sample image and a second sample image, wherein the second sample image does not have a corresponding sample text label, but belongs to the same service scene as the detection target; the first sample image has a corresponding sample text label, but no specific traffic scenario restrictions. The detection target is an object or event that needs to be detected in an image in a service scene, for example: animals, flowers, trees, buildings, characters, running dogs, car accidents, and the like. The service scene refers to a scene in which image content needs to be detected, for example, an image sharing scene of a certain content sharing platform, an image transmission scene of a chat room, and the like. Different business scenarios may have different detection targets, one business scenario may have multiple detection targets, and of course, the same detection target may correspond to multiple business scenarios.
The second sample image without specific service scene limitation can increase the scene universality of the first image model, so that the first image model can be widely applied to image detection in various service scenes. The second sample image belonging to the same service scene with the detection target enables the first image model to be more suitable for the service scene where the detection target is located, and is equivalent to improving the image detection accuracy of the first image model in the model deployment service scene. Therefore, the first image model in the embodiment of the application not only has universality, but also can improve the image detection accuracy of the service scene where the detection target is located.
Step 220, matching the image features to be detected with target text features to obtain a matching result, wherein the target text features are text features related to the detection target; the target text features are obtained by extracting text features through a first text model, and the first text model is obtained by carrying out combined training on the basis of first training data and first image model data.
Specifically, the target text feature is a text feature related to the inspection target extracted by the first text model, that is, the target text feature embodies a related characteristic of the inspection target from a text aspect. Matching the image feature to be detected with the target text feature is equivalent to matching an unknown object contained in the image to be detected with a known detection target, so that a matching result indicates whether the unknown object contained in the image to be detected is identical with the known detection target, and whether the image to be detected contains the detection target can be determined.
The first text model is also obtained through training of the first training data, and meanwhile the first text model is obtained through combined training with the first image model, namely the text model and the image model are combined trained through the first training data, and the first text model and the first image model are obtained. The first text model and the first image model are jointly trained, so that the first text model can refer to corresponding image features when text features are extracted, and the extracted text features are more accurate; the first image model can refer to text features when the image features are extracted, so that the extracted image features are more accurate. It can be seen that the joint training of the first image model and the first text model can improve the accuracy of the image.
In one embodiment of the present application, the matching between the image feature to be detected and the target text feature may be achieved by calculating the similarity between the image feature to be detected and the target text feature. For example, cosine similarity between the image feature to be detected and the target text feature is used as a matching result between the two.
Step 230, determining whether the image to be detected is an image containing a detection target according to the matching result.
Specifically, when the matching result shows that the matching degree of the image feature to be detected and the target text feature is higher, for example, the similarity between the image feature to be detected and the target text feature is greater than a preset threshold, the object represented by the image feature to be detected and the detection target represented by the target text feature can be considered to be the same, and then the image to be detected is determined to be the image containing the detection target. Then, when the matching result shows that the matching degree of the feature of the image to be detected and the feature of the target text is low, for example, the similarity between the feature of the image to be detected and the feature of the target text is smaller than a preset threshold value, it is determined that the image to be detected is not the image containing the detection target.
In the technical scheme provided by the embodiment of the application, the image features of the image to be detected are extracted through the first image model, and then the image features to be detected are matched with the target text features extracted by the first text model, so that whether the image to be detected is an image containing a detection target or not is determined; the image detection accuracy of the service scene where the detection target is can be improved.
Therefore, the method and the device realize the detection of the image content in a mode of combining text description information and image characteristics, not only can detect whether the image contains a detection target, but also can detect the specific condition of the detection target contained in the image, can output detailed detection results, and improve the precision and accuracy of image detection.
Fig. 3 schematically illustrates a schematic diagram of an image detection method according to an embodiment of the present application, which may be implemented by a server or a terminal device, such as the terminal device 110 or the server 130 shown in fig. 1. As shown in fig. 3, the image detection method provided in the embodiment of the present application includes steps 310 to 350, which are specifically as follows:
step 310, acquiring a plurality of text description information related to the detection target.
In particular, the method comprises the steps of,
the text description information of the detection target refers to text information for describing the detection target, and one detection target may have a plurality of text description information for describing the detection target in different scenes or environments. For example, taking a dog as a detection target, the corresponding text description information may include: dogs under blue sky and white clouds, dogs in forests, husky running, puppies with cats, and the like. It can be seen that each text description represents a scene or environment in which the detection target is located.
In one embodiment of the present application, when obtaining text description information of a detection target, the corresponding text description information may be obtained by using the detection target as a keyword and then generating a sentence according to the keyword. For example, a text corpus is constructed in advance, when text description information of a detection target is acquired, text data which can form sentences with the detection target is searched in the text corpus according to the detection target, and then the text data and the detection target are spliced to obtain the text description information of the detection target.
In one embodiment of the present application, when the detection target is more complex, the detection target may be split first to obtain a plurality of keywords, and then sentence generation is performed based on the plurality of keywords to obtain a plurality of text description information. Specifically, target text data describing a detection target is first acquired; then, word segmentation processing is carried out on the target text data to obtain a plurality of keywords; next, a part of keywords are selected from the plurality of keywords to form keyword groups, and then text description information is generated according to each keyword group.
For example, the target text data is "cat on road beside building", and the keywords such as "on", "building", "beside", "road", "upper", "cat" and the like are obtained through word segmentation, and then part of the keywords are selected to form a keyword group, for example, the "building", "road", "cat" is used as one keyword group, and text description information is generated based on the keyword group, and the text description information can be text information containing all the keywords in the keyword group selected from a corpus.
In one embodiment of the present application, some words with low importance may be included in the target text data, for example, words such as "on", etc. in the above example, the importance of the words may be determined by calculating TF-IDF (Term Frequency-inverse document Frequency) of the words. When generating a keyword group, the keyword with lower importance can be removed, and the keyword group with higher importance can be selected to form the keyword group. Optionally, after the text description information is generated according to the keywords, the similarity between the generated text description information and the target text data can be calculated, and when the similarity is greater than a threshold value, the meaning of the generated text description information and the meaning of the target text data are indicated to be similar, so that the generated text description information can be used as the finally required text description information.
Through the method, more text description information of the detection target can be obtained, so that the text description of the detection target is finer and more comprehensive.
And 320, respectively extracting text features of the text description information through the first text model to obtain a plurality of target text features.
Specifically, the first text model is a pre-trained model for extracting text features, text feature extraction is performed on text description information through the first text model, and the obtained features are marked as target text features. When the text feature extraction is performed, the text description information can be converted into text vectors, and then the feature extraction is performed on the text vectors through the first text model to obtain target text features.
In one embodiment of the present application, the text vector acquisition process includes: firstly, word segmentation processing is carried out on the text description information to obtain a plurality of words, and the words are arranged according to the position sequence of each word in the text description information. And then, coding each vocabulary according to a preset coding rule, wherein the coded data of the vocabulary is a vocabulary vector, for example, each vocabulary is converted into a single-hot code according to a single-hot (one-hot) coding rule, so that the vocabulary vector is obtained. And generating a position vector of each word according to the position of each word in the text description information. And then, superposing the vocabulary vector and the position vector corresponding to the vocabulary vector to obtain the feature vector corresponding to the vocabulary. And finally, combining the feature vectors corresponding to the vocabularies into a vector matrix, wherein the vector matrix is the text vector.
In one embodiment of the present application, the process of feature extraction of text vectors by the first text model includes: firstly, feature extraction is carried out on a text vector through a full-connection network to obtain full-connection features, for example, convolution operation is carried out on a vector matrix representing the text vector by a convolution kernel with a set size and a step length to obtain the full-connection features, and the full-connection features are actually matrices formed by a plurality of data, so that the full-connection features can be recorded as full-connection matrices. Then, activating calculation is carried out on the full-connection feature through a nonlinear activating function, so that an activating feature is obtained; for example, the activation calculation is performed by a ReLU (Rectified Linear Unit, linear rectifying unit) function. Finally, pooling the activation characteristics to obtain target text characteristics; for example, the activation feature is subjected to maximum pooling, the maximum value in each feature is extracted, and then the maximum values are spliced to obtain a target text feature, and the target text feature can also be regarded as a vector.
And 330, extracting image features of the image to be detected through the first image model to obtain the image features to be detected.
And 340, respectively matching the image features to be detected with a plurality of target text features to obtain a matching result.
Specifically, the image feature to be detected is respectively matched with a plurality of target text features, so as to determine whether the target text features belonging to the same type as the image feature to be detected exist in the plurality of target text features for representing the detection target, wherein the image feature to be detected and the target text features belong to the same type, which means that the object represented by the image feature to be detected and the object represented by the target text features belong to the same type, for example, the image feature to be detected represents a dog, and the target text features also represent a dog, and then the image feature to be detected and the target text features are considered to belong to the same type.
In one embodiment of the application, the matching between the image to be detected and the target text feature can be achieved by calculating the similarity between the image feature to be detected and the target text feature. Specifically, the image feature to be detected is a vector, and is marked as a second vector; the target text feature is also a vector, and is marked as a first vector; then the match between the image feature to be detected and the target text feature is actually a calculation of the similarity between the second vector and the first vector. For example, a cosine similarity between the second vector and the first vector is calculated, and the cosine similarity is used as a matching result between the image feature to be detected and the target text feature. For another example, a euclidean distance between the second vector and the first vector is calculated, and the euclidean distance is used as a matching result between the image feature to be detected and the target text feature.
Step 350, determining whether the image to be detected is an image containing a detection target according to the matching result.
Specifically, the matching result indicates the similarity degree of the feature of the image to be detected and the feature of the target text, and when the matching result shows that the feature of the image to be detected and the feature of the target text are higher in similarity degree, the image represented by the feature of the image to be detected and the image described by the feature of the target text belong to the same type of image, so that the image to be detected can be determined to be the image containing the detection target. When the matching result shows that the similarity degree of the image feature to be detected and any target text feature is low, the image represented by the image feature to be detected is greatly different from the image described by any target text feature, and therefore it can be determined that the image to be detected is not the image containing the detection target.
In one embodiment of the present application, since the objects described by the plurality of target text features are all detection targets, the image to be detected is considered to be an image containing the detection targets as long as there is a high similarity between one target text feature and the feature of the image to be detected. Furthermore, according to the text description information corresponding to the target text features with higher similarity to the features of the image to be detected, the scene or environment corresponding to the detection target in the image to be detected can be further determined, so that the image detection has a more detailed and specific detection result.
According to the technical scheme of the embodiment of the application, under some service scenes, the image containing the detection target is assumed to be an illegal image, and then the image to be detected can be determined to be the illegal image as long as the similarity between the text characteristic of one target and the characteristic of the image to be detected is higher; meanwhile, according to the text description information corresponding to the target text features with higher similarity to the features of the image to be detected, the specific type of the image to be detected belongs to can be further determined, and modification and adjustment of the subsequent image to be detected are facilitated.
In one embodiment of the present application, when the matching result between the image to be detected and the target text feature is represented by the similarity, then when at least one similarity greater than a preset similarity threshold exists in the plurality of matching results, determining that the image to be detected is an image containing the detection target; and when the similarity larger than the preset similarity threshold value does not exist in the plurality of matching results, determining that the image to be detected is not the image containing the detection target.
According to the technical scheme provided by the embodiment of the application, the text description information of the detection targets is subjected to text feature extraction to obtain the target text features, then the image feature extraction is performed on the image to be detected to obtain the image feature to be detected, and the image feature to be detected is respectively matched with each target text feature, so that whether the image to be detected is an image containing the detection targets or not is determined, the image content is detected in a mode of combining the text description information and the image features, whether the image contains the detection targets or not can be detected, the specific condition of the detection targets contained in the image can be detected, detailed detection results can be output, and the accuracy and the precision of image detection are improved.
Fig. 4 schematically shows a flowchart of an image detection method according to an embodiment of the present application, which is a further refinement of the above embodiment. As shown in fig. 4, the method includes steps 410 to 490, specifically as follows:
step 410, acquiring second training data, where the second training data includes a second sample image and a sample text label corresponding to the second sample image.
Specifically, the text description information corresponding to the second sample image is marked as a sample text label, and the second sample image and the sample text label are labels, namely, the sample text label can be regarded as the text label of the second sample image, and the second sample image can also be regarded as the image label of the sample text label. The second training data may be composed of published image-text training data.
Step 420, training the preset network model through second training data to obtain a second model; the preset network model comprises a preset text network model and a preset image network model, and the second model comprises a second text model and a second image model.
Specifically, the second training data is input into the preset network model, and after relevant processing such as feature extraction and the like is performed on the second training data by the preset network model, a prediction label corresponding to the second training data is obtained, and then the prediction label of the second training data is compared with a sample label of the second training data to determine whether the two are identical. If the predicted label of the second training data is the same as the sample label of the second training data, the preset network model is indicated to be capable of classifying the second training data more accurately, and at the moment, the preset network model can be considered to be trained to obtain the second model. If the predicted label of the second training data is different from the sample label of the second training data, the preset network model is required to learn the characteristic information contained in the second training data continuously so as to classify the second training data better, and model training can be performed continuously after model parameters are updated according to the difference between the predicted label and the sample label.
In the embodiment of the application, the preset network model comprises a preset text network model and a preset image network model, and the second model obtained through training comprises a second text model and a second image model, wherein the second text model is obtained through training of the preset text network model, and the second image model is obtained through training of the preset image network model.
In one embodiment of the present application, the training the preset network model through the second training data includes: extracting text features of the sample text labels in the second training data through a preset text network model to obtain second sample text label features; extracting image features of a second sample image in the second training data through a preset image network model to obtain second sample image features; calculating a second model loss according to the second sample text label feature and the second sample image feature; and updating model parameters of the preset text network model and the preset image network model according to the second model loss, and continuing training based on the preset text network model and the preset image network model after parameter updating until a second model convergence condition is reached.
Specifically, feature extraction is performed on a second sample image and a sample text label through a preset image network model and a preset text network model respectively, then features extracted by the two models are compared to calculate a loss function, and then model parameters of the two models are updated respectively according to the loss function to continue training until convergence conditions are reached. When the second training data is input to the preset network model, the second training data may be divided into a plurality of batches (batches), and then the second training data is input to the preset network model for training in batches.
In the training process, first, a second sample image in second training data is input into a preset image network model, and image feature extraction is carried out on the second sample image through the preset image network model to obtain second sample image features. And simultaneously, inputting the sample text labels in the second training data into a preset text network model, and extracting text features of the sample text labels through the preset text network model to obtain second sample text label features.
And comparing the second sample image characteristic with the second sample text label characteristic, and calculating a loss function. Comparing and calculating the second sample text label characteristic with the second sample image characteristic by taking the second sample image characteristic as a reference to obtain a second image loss, wherein the second image loss L p2t The calculation method of (2) is shown in the following formula (1):
wherein i represents the number of the second sample image; n represents the total amount of second sample images in one model training (or model iteration), for example, when dividing the second training data into a plurality of batches, N may be the number of second sample data contained in one batch. X is x i Representing the characteristics of a second sample image obtained by extracting the second sample image i through a preset image network model,a transpose feature representing the second sample image feature; y is i And extracting a second sample text label characteristic of a sample text label i corresponding to the second sample image i through a preset text network model. In general, the second sample image feature x i And a second sample text label feature y i Are all column vectors, by combining the second sample image features x i Conversion to transpose features with the second sample text label feature y i Multiplying the two so that the product of the two results in a second image feature matrix +.>And further, the loss can be obtained by carrying out feature matrix calculation.
Similarly, based on the second sample text label feature, comparing the second sample image feature with the second sample text label feature to obtain a second text loss L t2p The calculation method of (2) is as follows:
/>
wherein, the liquid crystal display device comprises a liquid crystal display device,transposed feature representing a text label feature of a second sample, +.>Representing a second text feature matrix; the remaining parameters have the same meaning as in formula (1).
Adding the second image loss and the second text loss to obtain the total loss of the training process, namely the second model loss, wherein the calculation method is shown in the following formula (3):
L con =L p2t +L t2p (3)
wherein L is con Representing the second model loss.
After the second model loss is obtained, judging whether a second model convergence condition is reached. And when the convergence condition of the second model is not met, updating model parameters of the preset text network model and the preset image network model according to the second model loss, and continuing training based on the preset text network model and the preset image network model after parameter updating, for example, dividing the second training data into a plurality of batches, obtaining the second model loss combination updated model parameters according to the second training data of the jth batch, and then continuing training the model after parameter updating by using the second training data of the jth+1th batch.
In the embodiment of the application, the model parameters can be updated according to the gradient descent mode. For example, algorithms such as random gradient descent, random gradient descent with a measure, adam (Adaptive Moment Estimation ), adagard (Adaptive Gradient, adaptive gradient), etc. are used to update the model parameters.
In the embodiment of the present application, the second model convergence condition may be that the training frequency (or referred to as the iteration frequency) of the model reaches the set frequency, or the second model loss is smaller than the preset threshold.
Illustratively, fig. 5 schematically shows a block diagram of an apparatus for training a preset network model according to an embodiment of the present application. As shown in fig. 5, the apparatus includes: a training data preparation module 510, an image network element module 520, a text network element module 530, an alignment objective function calculation module 540, a judgment module 550, and an alignment objective function optimization module 560.
The training data preparation module 510 is configured to obtain second training data, where the second training data includes a second sample image and a sample text label corresponding to the second sample image. And, the training data preparation module 510 is further configured to divide the second training data into a plurality of batches, and input the second training data to the image network element module 520 and the text network element module 530 in batches.
The image network unit module 520 is configured to perform image feature extraction on the second sample image to obtain the second sample image feature. The image network element module 520 may employ a convolutional neural network or a network structure with a transducer type, and the specific calculation process may include operations such as convolutional calculation, nonlinear activation function calculation, pooling calculation, and the like.
The function of the text network element module 530 is to extract text features from the sample text labels to obtain second sample text label features. The text network element module 530 may employ a network architecture with a transducer type, and specific computing processes may include operations such as full-join computing, nonlinear activation function computing, pooling computing, and the like.
The alignment objective function calculation module 540 is configured to calculate a second model loss. First, a second image loss is calculated according to the transposed feature of the second sample image feature and the second sample text label feature, as shown in formula (1). And simultaneously calculating a second text loss according to the transposed feature of the second sample text label feature and the second sample image feature, as shown in the formula (2). And then adding the second image loss and the second text loss to obtain a second model loss, as shown in formula (5).
The determining module 550 is configured to determine whether a termination model training condition is satisfied, i.e., determine whether a current model training reaches a second model convergence condition, e.g., determine whether the number of iterations reaches a preset number of iterations, or determine whether a second model loss is less than a preset threshold. And when the termination condition is met, model training is finished, and a trained second model is obtained.
The comparison objective function optimization module 560 updates model parameters based on the gradient descent mode, and performs training optimization on the whole network.
Step 430, acquiring first training data, where the first training data includes second training data and a first sample image belonging to the same service scene as the detection target.
Specifically, the first training data is obtained by adding a first sample image on the basis of the second training data, wherein the first sample image is an image under a service scene where a detection target is located, that is, the first sample image is obtained from the service scene needing to perform image content detection and added into the first training data. In this case, the first training data includes a second sample image and a first sample image, the second sample image having a corresponding sample text label, but the first sample image does not need to be text labeled, i.e., the first sample image does not have a corresponding sample text label.
In one embodiment of the present application, after the first sample image in the service scene is acquired, the image data enhancement processing may be further performed on the first sample image, so as to obtain an enhanced sample image, and then the enhanced sample image, the first sample image and the second training data are taken together as the first training data.
The image data enhancement process is to output a new image by making some modifications to the original image. The image data enhancement processing can increase the number of images and reduce the image acquisition requirement on the business scene where the detection target is located. In the embodiment of the application, the image data enhancement processing can be performed through operations such as random clipping, random shielding, gaussian blur, rotation, noise addition, edge gradient extraction and the like. Random cropping is to crop out an area in a randomly selected image as an enhanced image. The random occlusion is to randomly occlude a region in the image to obtain an enhanced image. Gaussian blur is the smoothing of an image using gaussian distribution to obtain an enhanced image. Adding noise is adding noise information to the image to form an enhanced image, for example, adding randomly distributed blobs to the image. Edge gradient extraction refers to extracting the edge of the outline of an object in an image through gradient values, wherein points with large gradient value changes in the image are usually outline edge points, and the outline can be compensated to be clearer, so that an enhanced image is formed.
Alternatively, a plurality of forms of image data enhancement processing may be performed on one first sample image, so that a plurality of enhanced sample images may be obtained from a second first sample image. For example, cropping a partial region of the first sample image may result in an enhanced sample image, and masking a partial region of the first sample image may result in an enhanced sample image.
Step 440, training the second model through the first training data to obtain a first model; the first model includes a first text model and a first image model.
Specifically, the second model includes a trained second text model and a second image model. Because the second training data is sample data which does not distinguish the service scenes, the second model obtained by training is equivalent to a model applicable to the general service scenes, and the image detection accuracy of the model needs to be improved in some specific service scenes. According to the application, the second model is continuously trained through the first training data, and the first training data is added with the first sample image under the service scene where the detection target is located, so that the second model can learn the image characteristics under the specific service scene, the obtained first model can improve the accuracy of image detection under the service scene where the detection target is located, and the robustness of the model to the deployment scene is improved.
In one embodiment of the application, the training process of the second model comprises: extracting text features of the sample text labels in the first training data through the second text model to obtain first sample text features; extracting image features of a sample image in the first training data through a second image model to obtain first sample image features; the sample image includes a second sample image and a first sample image; calculating a first model loss from the first sample feature and the first sample image feature; and updating model parameters of the second text model and the second image model according to the first model loss, and continuing training based on the second text model and the second image model after parameter updating until the first model convergence condition is reached.
Specifically, the image data in the first training data is collectively recorded as a sample image, and the sample image includes a second sample image and a first sample image, and if the image data enhancement processing is performed, the sample image further includes an enhanced sample image. Since the first sample image and the enhanced sample image do not have corresponding text description information, the text data in the first training data only includes the sample text label corresponding to the second sample image. The first training data still consists of image data and text data, and then the training process of the second model still adopts a text and image training mode respectively, namely, text feature extraction is carried out on a sample text label in the first training data through the second text model, so as to obtain first sample text features; and simultaneously, extracting image features of the sample images in the first training data through the second image model to obtain first sample image features.
After feature extraction, a first model loss is calculated from the first sample features and the first sample image features to update model parameters based on the first model loss. Because part of the sample images do not have corresponding text labels, the embodiment of the application calculates the first model loss by adopting a self-supervision comparison learning objective function mode, wherein the self-supervision refers to model training through the sample images without text labels, namely, the model learns the unlabeled sample images; the comparison learning refers to comparing and learning the sample image features and the sample text label features extracted by the model to identify the sample image features and the sample text label features belonging to the same detection target; the self-supervision alignment learning objective function refers to the calculation of the alignment objective function between the sample image features and the sample text label features on the basis of the unlabeled sample image. The first model loss includes three parts: the first image loss, the third image loss and the first text loss are calculated in the same manner as the second image loss and the second text loss in the calculation process of the second model loss, and reference may be made to the related descriptions of the foregoing formulas (1) and (2) and are not repeated here.
For the third image loss, embodiments of the present application calculate the third image loss L by the distance between the first sample image features i,j The following formula (4) shows:
wherein i, j, k each represent a sample image label, z i 、z j 、z k Each representing a first sample image feature of a corresponding numbered sample image; the sim function represents calculating a cosine distance between two features, which in some embodiments may also be replaced by a euclidean distance. τ is an adjustment parameter, typically an empirical value, e.g., set to 2, is used. M represents the total number of image features involved in the calculation. The distance between two features represents the similarity between the two features, and taking cosine distance as an example, the higher the similarity between the two features is, the smaller the cosine distance between the two features is; conversely, the lower the similarity between two features, the greater the cosine distance between the two. By calculating the cosine distance between the two features, the model can identify sample images with the same detection targets under the condition of not having text labels, so that self-supervision comparison learning is realized.
In the embodiment of the present application, assuming that the second training data includes P first sample images and N second sample images, m=p+n. In some cases, the calculation of the second image loss may be performed using only the first sample image features corresponding to the first sample image without text labels M=p. In some cases, before model training, if image data enhancement processing is performed on P first sample images so that one first sample image corresponds to one enhanced sample image, m=2p+n or m=2p. In some cases, the iterative feature may be used to calculate the second image loss, that is, the first sample image feature obtained by the last model training is used as the image feature data when the second image loss is calculated in the training process, for example, the image feature number generated in the first iteration is A1, where m=a 1 The method comprises the steps of carrying out a first treatment on the surface of the The feature number of the image generated by the second iteration is A 2 M=a 1 +A 2 The method comprises the steps of carrying out a first treatment on the surface of the The total number of image features generated by the ith iteration is A i M=a i-1 +A i . By the image loss calculation mode, the model can learn adjacent iterative processes, so that the calculation of the image loss is more accurate, and the model convergence speed and the image feature extraction accuracy can be improved.
After three losses are calculated, the three losses are added to obtain a first model loss L total The following formula (5) shows:
L total =L i,j +L eon =L i,j +L p2t +L t2p (5)
and finally updating model parameters of the second model in a gradient descent mode based on the first model loss until reaching a first model convergence condition, wherein the convergence condition can be that the iteration times reach set times or the first model loss is smaller than a set value. The obtained first model comprises a first text model and a first image model, wherein the first text model can be used for extracting text characteristics of text description information of a detection target in a subsequent step, and the first image model can be used for extracting image characteristics of an image to be detected in the subsequent step.
Illustratively, FIG. 6 schematically shows a block diagram of an apparatus for training a second model provided by an embodiment of the present application. As shown in fig. 6, the apparatus includes: the training data preparation module 610, the image data enhancement module 620, the image network element module 630, the text network element module 640, the alignment objective function calculation module 650, the self-supervised contrast learning module 660, the total loss function calculation module 670, the judgment module 680, and the total objective function optimization module 690.
The training data preparation module 610 is configured to obtain first training data, where the first training data includes second training data and a first sample image in a service scenario where a detection target is located. The training data preparation module 610 may also divide the first training data into a plurality of batches, which are input to the image network element module 630 and the text network element module 640.
The image data enhancing module 620 is configured to perform image data enhancing processing on the first sample image, and input the obtained enhanced sample image to the image network element module 630. The image data enhancement processing method mainly comprises the following steps: random clipping, random shielding, gaussian blur, rotation, noise addition, edge gradient extraction and the like.
The functions of the image network element module 630, the text network element module 640 and the comparison objective function calculation module 650 are the same as those of the corresponding model in fig. 5, and will not be described again here.
The self-supervision contrast learning module 660 is configured to calculate the third image loss according to the distance between the image features of the first samples, and may refer to the calculation process corresponding to equation (4).
In one embodiment of the present application, after the image network unit module 630 outputs the image feature, the image feature may be further mapped in a nonlinear manner, and then the self-supervised contrast learning module 660 performs the calculation of the second image loss according to the feature obtained by the nonlinear mapping. Nonlinear mapping can increase the convergence rate of the model.
The total objective function optimization module 670 is configured to calculate the first model loss, and may refer to the calculation process corresponding to the equation (5).
The judging module 680 is configured to judge whether the current training meets a termination model training condition, where the termination model training condition is a first model convergence condition. For example, it is determined whether the number of iterations reaches a preset number or whether the second model loss is less than a preset threshold. And when the termination condition is met, obtaining a trained first model.
The overall objective function optimization module 690 updates model parameters based on the gradient descent mode, and performs training optimization on the whole network.
Step 450, obtaining a plurality of text description information related to the detection target.
And 460, respectively extracting text features of the text description information through the first text model to obtain a plurality of target text features.
Specifically, the text feature extraction of the text description information can be realized through the first text model obtained through training in the previous steps.
In one embodiment of the application, the text feature extraction of the text description information is an offline feature extraction, that is, in the image detection process, the first text model is not required to be deployed on the line for real-time text feature extraction, so that a plurality of target text features can be extracted in advance to construct a registered text set, and then the plurality of target text features can be rapidly acquired when the image to be detected is detected in real time.
And 470, extracting image features of the image to be detected through the first image model to obtain the image features to be detected.
Specifically, the first image model is deployed on the line of the service scene for real-time image detection, so that the image characteristics of the image to be detected can be extracted in real time. Compared with the deployment of the whole first model on the line in the service scene, the embodiment of the application only deploys the first image model on the line, so that the size of the deployment model can be reduced, and the image auditing speed can be improved.
And 480, respectively matching the image features to be detected with a plurality of target text features to obtain a matching result.
Step 490, determining whether the image to be detected is an image containing a detection target according to the matching result.
According to the embodiment of the application, the model is not required to be customized and trained for each image detection task, and the model can be subjected to scene adaptation training only by collecting the image data in the corresponding scene, so that the lightweight deployment of engineering is facilitated. Meanwhile, the embodiment of the application maintains the mode of image-text multi-mode comparison, ensures that the model does not fit the data, and can recall the missed recall data by adding text labels after deployment.
Fig. 7 schematically illustrates a flowchart of an image detection method according to an embodiment of the present application. As shown in fig. 7, first, text data in the text registry is input to the text network element module 710, where the text data in the text registry is a plurality of text description information of the detection target, and the text network element module 710 is a text feature extraction network trained in the foregoing steps, that is, a first text model. The first text network obtains a plurality of target text features in an off-line computing mode, and the target text features are text registry features.
In the image detection stage, the image to be detected is input to the image network unit module 720, and the image network unit module 720 is the image feature extraction network trained in the previous step, namely the first image model. And extracting image features of the image to be detected by the first image model to obtain the image features to be detected.
Then, the similarity between the image feature to be detected and each target text feature is calculated, and similarity data is input to the threshold comparison module 730. The threshold value comparison module compares the similarity corresponding to the text features of each target with a preset similarity threshold value, and when at least one similarity larger than the preset similarity threshold value exists, the image to be detected is determined to be the image containing the detection target; when there is no similarity greater than the preset similarity threshold, it is determined that the image to be detected is not an image containing the detection target.
Illustratively, FIG. 8 schematically illustrates a flow chart of an image detection method provided by one embodiment of the present application, as shown in FIG. 8, the convenience including a network module training phase 810 and a network module deployment phase 820.
In the network module training stage 810, the multi-mode network element is first trained according to the general scene, and the model is trained by using the graphic data in the open-source general scene, so as to maintain the capability of the model in the general scene, that is, the preset network model is trained by using the second training data, so as to obtain a second model, wherein the second model comprises a second text model and a second image model. And then performing model adaptation on the deployment scene, adapting the unlabeled data of the deployment service scene by adopting a self-supervision and contrast learning mode to obtain the adaptation capability of the corresponding service scene, wherein text labels are not required to be used as supervision signals for the use of the deployment service scene data, namely the second model is trained through first training data to obtain a first model, and a first sample image of a sample-free text in the service scene is added into the first training data.
In the network module deployment stage 820, a text registry is first constructed, text feature extraction is performed on a plurality of text description information of a detection target through a first text model, so as to obtain a plurality of target text features, and the process is implemented in an offline computing mode. And then leading out image network progress deployment, namely deploying a first image model to an on-line business scene, extracting image features of the image to be detected through the first image model to obtain image features to be detected, and finally determining whether the image to be detected contains a detection target according to the similarity between the image features to be detected and the text features of each target.
In the technical scheme provided by the embodiment of the application, the second model is obtained by carrying out model training through the second training data, has strong universality, generally occupies larger memory resources, and can cause longer time consumption of image detection due to overlarge models under some business scenes with higher requirements on the image speed reduction speed. The application further trains the second model to obtain the first model through the first training data comprising the first sample image in the service scene, and further obtains the image detection model of the self-adaptive service scene, thereby improving the image detection precision in the specific service scene. Meanwhile, when the model deployment is carried out, the offline text feature extraction is carried out through the first text model, and only the first image model is deployed to an online service scene, so that the size of the deployment model is effectively reduced, and the effect of light model deployment is achieved; in addition, the target text features can be obtained offline through the first text model in advance, so that the image features of the image to be detected are extracted only through the first image model in the real-time image detection process, and the speed and efficiency of image detection are further improved.
It should be noted that although the steps of the methods of the present application are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
The following describes an embodiment of the apparatus of the present application, which can be used to perform the image detection method in the above-described embodiment of the present application. Fig. 9 schematically shows a block diagram of an image detection apparatus provided by an embodiment of the present application. As shown in fig. 9, the image detection apparatus includes:
the image feature extraction module 910 is configured to extract image features of an image to be detected through a first image model, so as to obtain features of the image to be detected; the first image model is obtained through training of first training data, and the first training data comprises a first sample image which does not have a sample text label and belongs to the same service scene as a detection target and a second sample image which has the sample text label;
the matching module 920 is configured to match the image feature to be detected with a target text feature, so as to obtain a matching result, where the target text feature is a text feature related to the detection target; the target text features are obtained by extracting text features through a first text model, and the first text model is obtained by carrying out joint training on the basis of the first training data and the first image model data;
And an image detection module 930, configured to determine whether the image to be detected is an image including the detection target according to the matching result.
In one embodiment of the application, the apparatus further comprises:
the second training data acquisition module is used for acquiring second training data, wherein the second training data comprises a second sample image and a sample text label corresponding to the second sample image;
the second model training module is used for training a preset network model through the second training data to obtain a second model; the preset network model comprises a preset text network model and a preset image network model, and the second model comprises a second text model and a second image model;
the first training data acquisition module is used for acquiring first training data, wherein the first training data comprises the second training data and the first sample image;
the first model training module is used for training the second model through the first training data to obtain a first model; the first model includes a first text model and a first image model.
In one embodiment of the present application, the second model training module includes:
The second text feature extraction unit is used for extracting text features of sample texts in the second training data through the preset text network model to obtain second sample text features;
the second image feature extraction unit is used for extracting image features of a second sample image in the second training data through the preset image network model to obtain second sample image features;
a second model loss calculation unit, configured to calculate a feature matrix according to a transposed feature of one of the second sample text feature and the second sample image feature and the other of the second sample text feature and the second sample image feature, and calculate a second model loss according to the feature matrix;
and the second model parameter updating unit is used for updating the model parameters of the preset text network model and the preset image network model according to the second model loss, and continuing training based on the preset text network model and the preset image network model after parameter updating until a second model convergence condition is reached.
In one embodiment of the present application, the second model loss calculation unit is specifically configured to:
Calculating a second image feature matrix according to the transposed features of the second sample image features and the second sample text features, and calculating to obtain a second image loss according to the second image feature matrix;
calculating a second text feature matrix according to the transposed features of the second sample text features and the second sample image features, and calculating a second text loss according to the second text feature matrix;
and obtaining the second model loss according to the sum of the second image loss and the second text loss.
In one embodiment of the present application, the first model training module includes:
the first text feature extraction unit is used for extracting text features of sample texts in the first training data through the second text model to obtain first text sample features;
the first image feature extraction unit is used for extracting image features of the sample images in the first training data through the second image model to obtain first sample image features; the sample image includes the second sample image and the first sample image;
a first model loss calculation unit for calculating a first model loss from the first sample image feature and the first sample text feature;
And the first model parameter updating unit is used for updating the model parameters of the second text model and the second image model according to the first model loss, and continuing training based on the second text model and the second image model after parameter updating until a first model convergence condition is reached.
In one embodiment of the present application, the first model loss calculation unit is specifically configured to:
calculating a first image loss from a transposed feature of the first sample image feature and the first sample text feature;
calculating a first text loss from the transposed feature of the first sample image feature and the first sample image feature;
calculating a third image loss according to cosine distances between the image features of the first samples;
and obtaining the first model loss according to the sum of the first image loss, the first text loss and the third image loss.
In one embodiment of the present application, the first training data acquisition module includes:
the data enhancement unit is used for carrying out image data enhancement processing on the first sample image to obtain an enhanced sample image;
and the first training data acquisition unit is used for taking the first sample image, the enhanced sample image and the second training data as first training data.
In one embodiment of the application, the image data enhancement processing includes at least one of:
random clipping, random occlusion, gaussian blur, rotation, addition of noise, and edge gradient extraction.
In one embodiment of the application, the apparatus further comprises:
a text description information acquisition module for acquiring a plurality of text description information related to the detection target;
and the text feature extraction module is used for extracting text features of the text description information through the first text model to obtain a plurality of target text features.
In one embodiment of the present application, the target text features are plural, and the matching result includes a similarity between the image feature to be detected and each of the target text features; the image detection module 930 is specifically configured to:
when at least one similarity larger than a preset similarity threshold exists in the matching result, determining that the image to be detected is an image containing the detection target;
and when the similarity larger than a preset similarity threshold does not exist in the matching result, determining that the image to be detected is not the image containing the detection target.
Specific details of the image detection device provided in each embodiment of the present application have been described in the corresponding method embodiments, and are not described herein.
Fig. 10 schematically shows a block diagram of a computer system of an electronic device for implementing an embodiment of the application.
It should be noted that, the computer system 1000 of the electronic device shown in fig. 10 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.
As shown in fig. 10, the computer system 1000 includes a central processing unit 1001 (Central Processing Unit, CPU) which can execute various appropriate actions and processes according to a program stored in a Read-Only Memory 1002 (ROM) or a program loaded from a storage section 1008 into a random access Memory 1003 (Random Access Memory, RAM). In the random access memory 1003, various programs and data necessary for the system operation are also stored. The cpu 1001, the rom 1002, and the ram 1003 are connected to each other via a bus 1004. An Input/Output interface 1005 (i.e., an I/O interface) is also connected to bus 1004.
The following components are connected to the input/output interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a local area network card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the input/output interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read out therefrom is installed as needed in the storage section 1008.
In particular, the processes described in the various method flowcharts may be implemented as computer software programs according to embodiments of the application. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. The computer programs, when executed by the central processor 1001, perform the various functions defined in the system of the present application.
It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.
It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (13)

1. An image detection method, comprising:
extracting image features of the image to be detected through the first image model to obtain the image features to be detected; the first image model is obtained through training of first training data, and the first training data comprises a first sample image which does not have a sample text label and belongs to the same service scene as a detection target and a second sample image which has the sample text label;
matching the image features to be detected with target text features to obtain a matching result, wherein the target text features are text features related to the detection target; the target text features are obtained by extracting text features through a first text model, and the first text model is obtained by carrying out joint training on the basis of the first training data and the first image model data;
and determining whether the image to be detected is an image containing the detection target according to the matching result.
2. The image detection method according to claim 1, wherein before matching the image feature to be detected with a target text feature, the method further comprises:
acquiring second training data, wherein the second training data comprises the second sample image and a sample text label corresponding to the second sample image;
Training a preset network model through the second training data to obtain a second model; the preset network model comprises a preset text network model and a preset image network model, and the second model comprises a second text model and a second image model;
acquiring first training data, wherein the first training data comprises the second training data and the first sample image;
training the second model through the first training data to obtain a first model; the first model includes the first text model and the first image model.
3. The image detection method according to claim 2, wherein training a preset network model by the second training data includes:
extracting text features from the sample text labels in the second training data through the preset text network model to obtain second sample text features;
extracting image features of a second sample image in the first training data through the preset image network model to obtain second sample image features;
calculating a feature matrix according to the transposed feature of one of the second sample text feature and the second sample image feature and the other of the second sample text feature and the second sample image feature, and calculating to obtain a second model loss according to the feature matrix;
Updating model parameters of the preset text network model and the preset image network model according to the first model loss, and continuing training based on the preset text network model and the preset image network model after parameter updating until a first model convergence condition is reached.
4. The image detection method according to claim 3, wherein calculating a feature matrix from a transposed feature of one of the second sample text feature and the second sample image feature and the other of the second sample text feature and the second sample image feature, and calculating a second model loss from the feature demonstration, comprises:
calculating a second image feature matrix according to the transposed features of the second sample image features and the second sample text features, and calculating to obtain a second image loss according to the second image feature matrix;
calculating a second text feature matrix according to the transposed features of the second sample text features and the second sample image features, and calculating a second text loss according to the second text feature matrix;
and obtaining the second model loss according to the sum of the second image loss and the second text loss.
5. The image detection method according to claim 2, wherein training the second model with the first training data includes:
extracting text features from the sample text labels in the first training data through the second text model to obtain first sample text features;
extracting image features of sample images in the first training data through the second image model to obtain first sample image features; the sample image includes the first sample image and the second sample image;
calculating a first model loss from the first sample image feature and the first sample text feature;
and respectively updating model parameters of the second text model and the second image model according to the first model loss, and continuing training based on the second text model and the second image model after parameter updating until a first model convergence condition is reached.
6. The image detection method of claim 5, wherein calculating a first model loss from the first sample image feature and the first sample text feature comprises:
calculating a first image loss from a transposed feature of the first sample image feature and the first sample text feature;
Calculating a first text loss from the transposed feature of the first sample image feature and the first sample image feature;
calculating a third image loss according to the distance between the first sample image features;
and obtaining the first model loss according to the sum of the first image loss, the first text loss and the third image loss.
7. The image detection method according to claim 2, wherein acquiring the first training data includes:
performing image data enhancement processing on the first sample image to obtain an enhanced sample image;
the first sample image, the enhanced sample image and the second training data are taken as first training data.
8. The image detection method according to claim 7, wherein the image data enhancement processing includes at least one of:
random clipping, random occlusion, gaussian blur, rotation, addition of noise, and edge gradient extraction.
9. The image detection method according to any one of claims 1-8, wherein before matching the image feature to be detected with a target text feature, the method further comprises:
Acquiring a plurality of text description information related to the detection target;
and respectively extracting text features of the text description information through the first text model to obtain a plurality of target text features.
10. The image detection method according to any one of claims 1 to 8, wherein the target text features are plural, and the matching result includes a similarity between an image feature to be detected and each of the target text features; determining whether the image to be detected is an image containing the detection target according to the matching result comprises the following steps:
when at least one similarity larger than a preset similarity threshold exists in the matching result, determining that the image to be detected is an image containing the detection target;
and when the similarity larger than a preset similarity threshold does not exist in the matching result, determining that the image to be detected is not the image containing the detection target.
11. An image detection apparatus, comprising:
the image feature extraction module is used for extracting image features of the image to be detected through the first image model to obtain the image features to be detected; the first image model is obtained through training of first training data, and the first training data comprises a first sample image which does not have a sample text label and belongs to the same service scene as a detection target and a second sample image which has the sample text label;
The matching module is used for matching the image feature to be detected with a target text feature to obtain a matching result, wherein the target text feature is a text feature related to the detection target; the target text features are obtained by extracting text features through a first text model, and the first text model is obtained by carrying out joint training on the basis of the first training data and the first image model data;
and the image detection module is used for determining whether the image to be detected is an image containing the detection target according to the matching result.
12. A computer readable medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the image detection method according to any one of claims 1 to 10.
13. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein execution of the executable instructions by the processor causes the electronic device to perform the image detection method of any one of claims 1 to 10.
CN202211447191.7A 2022-11-18 2022-11-18 Image detection method, device, computer readable medium and electronic equipment Pending CN116994021A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211447191.7A CN116994021A (en) 2022-11-18 2022-11-18 Image detection method, device, computer readable medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211447191.7A CN116994021A (en) 2022-11-18 2022-11-18 Image detection method, device, computer readable medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN116994021A true CN116994021A (en) 2023-11-03

Family

ID=88532771

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211447191.7A Pending CN116994021A (en) 2022-11-18 2022-11-18 Image detection method, device, computer readable medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN116994021A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117271819A (en) * 2023-11-17 2023-12-22 上海闪马智能科技有限公司 Image data processing method and device, storage medium and electronic device
CN117591901A (en) * 2024-01-17 2024-02-23 合肥中科类脑智能技术有限公司 Insulator breakage detection method and device, storage medium and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117271819A (en) * 2023-11-17 2023-12-22 上海闪马智能科技有限公司 Image data processing method and device, storage medium and electronic device
CN117271819B (en) * 2023-11-17 2024-03-01 上海闪马智能科技有限公司 Image data processing method and device, storage medium and electronic device
CN117591901A (en) * 2024-01-17 2024-02-23 合肥中科类脑智能技术有限公司 Insulator breakage detection method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN111797893B (en) Neural network training method, image classification system and related equipment
CN109117777B (en) Method and device for generating information
CN110532996B (en) Video classification method, information processing method and server
CN111507378A (en) Method and apparatus for training image processing model
CN112487182A (en) Training method of text processing model, and text processing method and device
WO2021139191A1 (en) Method for data labeling and apparatus for data labeling
CN111741330B (en) Video content evaluation method and device, storage medium and computer equipment
CN110851641B (en) Cross-modal retrieval method and device and readable storage medium
CN116994021A (en) Image detection method, device, computer readable medium and electronic equipment
CN110598019B (en) Repeated image identification method and device
CN114943789A (en) Image processing method, model training method and related device
CN111368972A (en) Convolution layer quantization method and device thereof
WO2022253074A1 (en) Data processing method and related device
CN113408570A (en) Image category identification method and device based on model distillation, storage medium and terminal
CN111738403A (en) Neural network optimization method and related equipment
CN114282059A (en) Video retrieval method, device, equipment and storage medium
CN113011568A (en) Model training method, data processing method and equipment
CN114241459B (en) Driver identity verification method and device, computer equipment and storage medium
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN113128526B (en) Image recognition method and device, electronic equipment and computer-readable storage medium
CN116958512A (en) Target detection method, target detection device, computer readable medium and electronic equipment
CN112560848B (en) Training method and device for POI (Point of interest) pre-training model and electronic equipment
CN111445545B (en) Text transfer mapping method and device, storage medium and electronic equipment
CN114913330A (en) Point cloud component segmentation method and device, electronic equipment and storage medium
CN116258190A (en) Quantization method, quantization device and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication