CN116977707A - Image recognition method, device, electronic equipment and storage medium - Google Patents

Image recognition method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116977707A
CN116977707A CN202310681800.3A CN202310681800A CN116977707A CN 116977707 A CN116977707 A CN 116977707A CN 202310681800 A CN202310681800 A CN 202310681800A CN 116977707 A CN116977707 A CN 116977707A
Authority
CN
China
Prior art keywords
target
image
label
negative
feature extractor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310681800.3A
Other languages
Chinese (zh)
Inventor
郭太安
何肃南
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310681800.3A priority Critical patent/CN116977707A/en
Publication of CN116977707A publication Critical patent/CN116977707A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The application discloses an image recognition method, an image recognition device, electronic equipment and a storage medium. The embodiment of the application relates to the technical fields of artificial intelligence machine learning, cloud technology and the like. The method comprises the following steps: inputting the image to be identified into a target image feature extractor for encoding to obtain an image encoding result; the target image feature extractor is obtained through training of a target sample image, a target positive conceptual label and a target negative conceptual label; the target positive concept label and the target negative concept label are extracted from the text description information; and determining an image recognition result of the image to be recognized according to the image coding result and the text characteristics of each of the plurality of target tags. According to the method and the device for identifying the target image feature extractor, the accuracy of the target positive conceptual labels and the target negative conceptual labels is high, and the number of the target negative conceptual labels is high, so that the identification effect of the target image feature extractor obtained through training is better, and the accuracy of the image identification result of the image to be identified, which is determined according to the target image feature extractor, is improved.

Description

Image recognition method, device, electronic equipment and storage medium
Technical Field
The present application relates to the field of image processing technologies, and in particular, to an image recognition method, an image recognition device, an electronic device, and a storage medium.
Background
With the development of artificial intelligence technology, more and more ways of classifying images, such as multi-label classification, are performed on images.
At present, the labels of the sample images can be determined in a manual labeling mode, then the initial model is trained according to the sample images and the labels of the sample images, a trained image classification model is obtained, then the images to be identified can be identified through the image classification model, and a plurality of labels corresponding to the images to be classified are obtained, and the labels are the image identification results of the images to be classified. However, the accuracy of labels marked on sample images by the method is low, so that the recognition effect of the image classification model obtained by training is poor, and the accuracy of image recognition results of images to be classified is low.
Disclosure of Invention
In view of the above, the embodiments of the present application provide an image recognition method, an image recognition device, an electronic device, and a storage medium.
In a first aspect, an embodiment of the present application provides an image recognition method, including: acquiring text features of an image to be identified and a plurality of target labels; inputting the image to be identified into a target image feature extractor for encoding to obtain an image encoding result; the target image feature extractor is obtained through training of a target sample image, a target positive conceptual label and a target negative conceptual label; the target positive conceptual label is extracted from text description information corresponding to the target sample image, and the target negative conceptual label is a negative conceptual label which is obtained from a negative conceptual label warehouse of the sample image and is different from the target positive conceptual label; negative concept labels in the negative concept label warehouse are extracted from text description information corresponding to the comparison sample image; the control sample image is different from the target sample image; and determining an image recognition result of the image to be recognized according to the image coding result and the text characteristics of each of the plurality of target tags.
In a second aspect, an embodiment of the present application provides an image recognition apparatus, including: the acquisition module is used for acquiring the images to be identified and the text characteristics of each of the plurality of target tags; the coding module is used for inputting the image to be identified into the target image feature extractor for coding to obtain an image coding result; the target image feature extractor is obtained through training of a target sample image, a target positive conceptual label and a target negative conceptual label; the target positive conceptual label is extracted from text description information corresponding to the target sample image, and the target negative conceptual label is a negative conceptual label which is obtained from a negative conceptual label warehouse of the sample image and is different from the target positive conceptual label; negative concept labels in the negative concept label warehouse are extracted from text description information corresponding to the comparison sample image; the control sample image is different from the target sample image; and the image recognition module is used for determining an image recognition result of the image to be recognized according to the image coding result and the text characteristics of each of the plurality of target tags.
Optionally, the target image feature extractor is trained with a plurality of batches of sample images; the control sample image includes a first control sample image and a second control sample image; the device also comprises a training module, wherein the training module is used for acquiring images which are different from the target sample images from the sample images of the historical batches and used as first control sample images; the historical batch refers to the batch before the current batch where the target sample image is located; acquiring sample images except for a target sample image from a plurality of sample images of the current batch as a second control sample image; and extracting negative concept labels from the text description information corresponding to the first control sample image and the text description information corresponding to the second control sample image to form a negative concept label warehouse of the sample image.
Optionally, the training module is further configured to extract a label from the text description information corresponding to the first control sample image and the text description information corresponding to the second control sample image according to a requirement of the number of labels in the label repository, so as to form a negative conceptual label repository of the sample image; and acquiring negative concept labels with different target quantity from the negative concept label warehouse of the sample image as target negative concept labels.
Optionally, the training module is further configured to input the target sample image into the initial image feature extractor to obtain a first encoding result; inputting the target positive concept label and the target negative concept label into a pre-trained text feature extractor to obtain a second coding result; inputting the target sample image into a momentum model to obtain a third coding result, wherein the momentum model has the same structure as the initial image feature extractor, and the parameters of the momentum model are obtained according to the parameters of the initial image feature extractor and are different from the parameters of the initial image feature extractor; and training the initial image feature extractor according to the first encoding result, the second encoding result and the third encoding result to obtain the target image feature extractor.
Optionally, the target image feature extractor is obtained by multiple iterative training; the training module is also used for determining the parameters of the momentum model in the iterative training process according to the parameters of the momentum model in the previous iterative training process of the iterative training process and the parameters of the initial image feature extractor in the iterative training process of the iterative training process.
Optionally, the training module is further configured to input the target sample image into a pre-trained image feature extractor to obtain a fourth encoding result, where the pre-trained image feature extractor is a teacher model for performing knowledge distillation on the initial feature extractor, and the pre-trained image feature extractor and the pre-trained text feature extractor belong to the same graphic pre-training model; and training the initial image feature extractor according to the first coding result, the second coding result, the third coding result and the fourth coding result to obtain the target image feature extractor.
Optionally, the training module is further configured to determine a first loss value according to the first encoding result and the second encoding result, where the first loss value characterizes the prediction accuracy of the initial image feature extractor on the target positive concept tag and the target negative concept tag; according to the second coding result and the third coding result, pseudo positive labels and pseudo negative labels are obtained from a label set, wherein the label set comprises target positive conceptual labels and target negative conceptual labels; determining a second loss value according to the pseudo positive label, the pseudo negative label, the first coding result and the second coding result, wherein the second loss value indicates the prediction accuracy of the initial image feature extractor on the pseudo positive conceptual label and the pseudo negative conceptual label; determining a third loss value according to the first coding result and the fourth coding result, wherein the third loss value represents the matching degree between the first coding result and the fourth coding result; and training the initial image feature extractor according to the first loss value, the second loss value and the third loss value to obtain the target image feature extractor.
Optionally, the training module is further configured to determine, according to the first encoding result and the second encoding result, a respective first probability of each target concept label, where the first probability of the target concept label refers to a probability that the target sample image belongs to the target concept label, and the target concept label includes a target positive concept label and a target negative concept label; a first penalty value is determined based on the respective first probabilities for each of the target concept tags.
Optionally, the training module is further configured to determine a similarity between the second encoding result and the third encoding result; and obtaining the pseudo positive labels and the pseudo negative labels from the label set according to the similarity between the second coding result and the third coding result.
Optionally, the training module is further configured to determine, according to the first encoding result and the second encoding result, a respective first probability of each target concept label, where the first probability of the target concept label refers to a probability that the target sample image belongs to the target concept label, and the target concept label includes a target positive concept label and a target negative concept label; acquiring a first probability of a target conceptual label matched with a pseudo positive label as the probability of the pseudo positive label, and acquiring a first probability of the target conceptual label matched with a pseudo negative label as the probability of the pseudo negative label; and determining a second loss value according to the respective probability of each pseudo-certificate label and the respective probability of each pseudo-negative label.
Optionally, the image encoding result comprises a visual global representation of the image to be identified and a plurality of visual local representations; the image recognition module is also used for determining the similarity between the text characteristic of each target label and each visual local representation so as to obtain the respective similarity of a plurality of visual local representations under each target label; determining the similarity between the text features of each target label and the visual global representation as the similarity of the visual global representation under each target label; and determining an image recognition result of the image to be recognized according to the respective similarity of the visual local representations under each target label and the similarity of the visual global representation under each target label.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory; one or more programs are stored in the memory and configured to be executed by the processor to implement the methods described above.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium having program code stored therein, wherein the program code, when executed by a processor, performs the method described above.
In a fifth aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the electronic device to perform the method described above.
According to the image recognition method, the device, the electronic equipment and the storage medium, the target positive conceptual labels are extracted from text description information corresponding to the target sample image, the target negative conceptual labels are labels which are obtained from a negative conceptual label warehouse and are different from the target positive conceptual labels, the target positive conceptual labels and the target negative conceptual labels are not manually marked, the accuracy of the target positive conceptual labels and the target negative conceptual labels is high, the number of the determined target negative conceptual labels is large, the training of the target image feature extractor according to the target sample image, the target positive conceptual labels and the target negative conceptual labels is sufficient, the recognition effect of the obtained target image feature extractor is good, and therefore the accuracy of the image recognition result of the image to be recognized obtained according to the target image feature extractor is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 shows a schematic diagram of an application scenario to which an embodiment of the present application is applicable;
FIG. 2 is a flow chart illustrating a training process of a target image feature extractor in an embodiment of the present application;
FIG. 3 is a schematic diagram of an image of a target sample in an embodiment of the application;
FIG. 4 is a flow chart illustrating a training process of yet another object-image-feature extractor in an embodiment of the present application;
FIG. 5 is a flow chart illustrating a training process of yet another object-image-feature extractor in an embodiment of the present application;
FIG. 6 is a flow chart illustrating a training process of yet another object-image-feature extractor in an embodiment of the present application;
FIG. 7 shows a schematic diagram of yet another target sample image in an embodiment of the application;
FIG. 8 is a schematic diagram of still another target sample image in accordance with an embodiment of the application;
FIG. 9 is a schematic diagram of still another target sample image in accordance with an embodiment of the application;
FIG. 10 is a schematic diagram of still another target sample image in accordance with an embodiment of the application;
FIG. 11 is a schematic diagram of a label extraction process in an embodiment of the application;
FIG. 12 is a schematic diagram of a training process for a target image feature extractor in an embodiment of the present application;
FIG. 13 is a schematic diagram of an image to be identified in an embodiment of the application;
FIG. 14 is a schematic diagram of yet another image to be identified in an embodiment of the application;
FIG. 15 is a schematic view of yet another image to be identified in an embodiment of the application;
FIG. 16 is a schematic diagram of yet another image to be identified in an embodiment of the application;
FIG. 17 is a schematic diagram of yet another image to be identified in an embodiment of the application;
FIG. 18 is a schematic diagram of yet another image to be identified in an embodiment of the application;
FIG. 19 is a block diagram of an image recognition device according to an embodiment of the present application;
fig. 20 shows a block diagram of an electronic device for performing an image recognition method according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the application, are within the scope of the application in accordance with embodiments of the present application.
In the following description, the terms "first", "second", and the like are merely used to distinguish between similar objects and do not represent a particular ordering of the objects, it being understood that the "first", "second", or the like may be interchanged with one another, if permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
It should be noted that: references herein to "a plurality" means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., a and/or B may represent: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
In addition, in the embodiment of the present application, the user's work type, learning, age, life stage (e.g. whether to marrie), work place, favorite product, album, etc. are all required to obtain the user's permission or agreement, and the collection, use, processing and storage of the user's work type, learning, age, life stage (e.g. whether to marrie), work place, favorite product, etc. are all required to meet the specifications of the region where they are located.
Currently, in the actual use scenario of the image multi-tag recognition algorithm, it is often required to recognize unknown tags that have not appeared in the training set, namely multi-tag zero-shot learning (ML-ZSL). The identification capability migration from a known label to an unknown label can be realized by migrating similar semantic information between different labels through text label Embedding (i.e. feature representation).
For example, the methods such as MKT (Multi-modal Knowledge Transfer, multi-modal knowledge migration, which is a Multi-label open set recognition framework) migrate Multi-modal knowledge in the graphic pre-training model in a knowledge distillation mode, align images and labels in the training process, and achieve good effects in Multi-label zero sample recognition.
However, the above method relies on a larger scale of known label (annotation) dataset during training, while in actual business, the accuracy of manual annotation is lower, resulting in poor recognition of the model. Meanwhile, the cost of manual labeling is high, so that the model training efficiency is low.
Therefore, the image recognition method provided by the application has the advantages that the target positive conceptual label is extracted from the text description information corresponding to the target sample image, the target negative conceptual label is a label which is obtained from a negative conceptual label warehouse and is different from the target positive conceptual label, the target positive conceptual label and the target negative conceptual label are not manually marked, the accuracy of the target positive conceptual label and the target negative conceptual label is higher, and the number of the determined target negative conceptual labels is more, so that the training of the target image feature extractor according to the target sample image, the target positive conceptual label and the target negative conceptual label is more sufficient, the recognition effect of the obtained target image feature extractor is better, and the accuracy of the image recognition result of the image to be recognized obtained according to the target image feature extractor is improved. Meanwhile, the target positive conceptual label and the target negative conceptual label are not required to be obtained by manually labeling the target sample image, so that the acquisition efficiency of the target positive conceptual label and the target negative conceptual label is improved, the training efficiency of the target image feature extractor is further improved, and the training cost is reduced.
The application discloses an image recognition method, an image recognition device, electronic equipment and a storage medium, and relates to artificial intelligence machine learning, cloud technology and the like.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.
Cloud technology (Cloud technology) is based on the general terms of network technology, information technology, integration technology, management platform technology, application technology and the like applied by Cloud computing business models, and can form a resource pool, so that the Cloud computing business model is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.
Cloud storage (cloud storage) is a new concept that extends and develops in the concept of cloud computing, and a distributed cloud storage system (hereinafter referred to as a storage system for short) refers to a storage system that integrates a large number of storage devices (storage devices are also referred to as storage nodes) of various types in a network to work cooperatively through application software or application interfaces through functions such as cluster application, grid technology, and a distributed storage file system, so as to provide data storage and service access functions for the outside.
At present, the storage method of the storage system is as follows: when creating logical volumes, each logical volume is allocated a physical storage space, which may be a disk composition of a certain storage device or of several storage devices. The client stores data on a certain logical volume, that is, the data is stored on a file system, the file system divides the data into a plurality of parts, each part is an object, the object not only contains the data but also contains additional information such as a data Identification (ID) and the like, the file system writes each object into a physical storage space of the logical volume, and the file system records storage position information of each object, so that when the client requests to access the data, the file system can enable the client to access the data according to the storage position information of each object.
The process of allocating physical storage space for the logical volume by the storage system specifically includes: according to the group of capacity measurement of objects stored in a logical volume (which often has a large margin with respect to the capacity of the objects actually to be stored) and redundant array of independent disks (RAID, redundant Array of Independent Disk), physical storage space is divided into stripes in advance, and a logical volume can be understood as a stripe, thereby allocating physical storage space to the logical volume
As shown in fig. 1, an application scenario to which the embodiment of the present application is applicable includes a terminal 20 and a server 10, where the terminal 20 and the server 10 are connected through a wired network or a wireless network. The terminal 20 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart home appliance, a vehicle-mounted terminal, an aircraft, a wearable device terminal, a virtual reality device, and other terminal devices capable of page presentation, or other applications (e.g., instant messaging applications, shopping applications, search applications, game applications, forum applications, map traffic applications, etc.) capable of invoking page presentation applications.
The server 10 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like. The server 10 may be used to provide services for applications running at the terminal 20.
The terminal 20 may send an image to be identified to the server 10, and the server 10 may obtain an image encoding result through a built-in target image feature extractor according to the image to be identified, and the server 10 determines an image recognition result of the image to be identified according to the image encoding result and text features of each of the built-in target tags, and feeds back the image recognition result of the image to be identified to the terminal 20.
The image to be recognized may refer to an image to be recognized, and the image to be recognized may be a color image. The target tag may be set based on demand, the target tag may include objects in the image, e.g., the target tag may include people, dogs, cats, stadiums, containers, etc., the target tag may also include categories of objects in the image, e.g., the target tag may be a cheering team, athlete, lawyer, etc.
In one embodiment, the server may incorporate a pre-trained text feature extractor, through which text features of the target tag are extracted and saved. Where the pre-trained text feature extractor may refer to a portion of the text modality in a model such as ALIGN, ALBEF, viLT.
In some possible implementations, the server 10 may train the initial image feature extractor according to the target sample image, the target positive concept tag, the target negative concept tag, the pre-trained text feature extractor, the pre-trained image feature extractor, and the momentum, obtain the target image feature extractor, and store the target image feature extractor locally at the server 10. The pre-trained image feature extractor may refer to a part of an image mode in a model ALIGN, ALBEF, viLT, and the momentum model is the same as the initial image feature extractor in structure and is configured through parameters of the initial image feature extractor.
In another embodiment, the server 10 may transmit the target image feature extractor to the terminal 20 after acquiring the target image feature extractor, and the terminal 20 stores the target image feature extractor. The terminal 20 may obtain an image encoding result through a built-in target image feature extractor according to the obtained image to be identified, and then, the terminal 20 determines an image identification result of the image to be identified according to the image encoding result and text features of each of the built-in target tags.
Alternatively, the server 10 may store the obtained target image feature extractor in a cloud storage system, and the terminal 20 obtains the target image feature extractor from the cloud storage system when performing the image recognition method of the present application.
For convenience of description, in the following embodiments, description will be made as an example in which image recognition is performed by an electronic device.
Referring to fig. 2, fig. 2 is a flowchart illustrating a training process of a target image feature extractor according to an embodiment of the present application, where the method may be applied to an electronic device, and the electronic device may be the terminal 20 or the server 10 in fig. 1, and the method includes:
s110, acquiring a target sample image and a control sample image corresponding to the target sample image.
The target sample image may refer to an image for training the initial image feature extractor, and the target sample image may be a color image, for example, a color image of a soccer team playing a ball at a soccer field. The control sample image may be an image different from the target sample image, and the control sample image may be a color image, for example, the target sample image is a color image of a football team playing a football on a football field, and the control sample image is a color image of a cheering team dancing on a basketball field.
In this embodiment, the target sample image may include a plurality of different target sample images, each corresponding to a respective control sample image, and the corresponding control sample image may also be different for different target sample images.
For example, the target sample image includes a target sample image 1 of a football team playing a ball at a football field and a target sample image 2 of a cheering team dancing at a basketball field, the control sample image of the target sample image 1 may include a photographed image of a dog, a photographed image of a building, and a photographed image of a table tennis match, and the control sample image of the target sample image 2 may include a photographed image of a cat, a photographed image of a ship, and a photographed image of a table tennis match.
It will be appreciated that when the target sample image comprises a plurality of different target sample images, other target sample images than the target sample image may serve as control sample images for the target sample image.
S120, extracting labels from text description information corresponding to the target sample image to serve as target positive conceptual labels of the target sample image, extracting labels from text description information of a control sample image corresponding to the target sample image to obtain a negative conceptual label warehouse of the sample image, and selecting target negative conceptual labels of the target sample image from the negative conceptual label warehouse of the sample image.
The target sample image and the reference sample image may correspond to text description information, where the text description information may refer to information for describing contents in the images, for example, the target sample image is shown as a in fig. 3, and the text description information of the target sample image may be "a queen horse and a banlunsia play a football game in a national stadium, the playing process is stressed, the spectator cheering and shouting, and the queen horse is" dropped "in a collective manner, and the game is dropped. As another example, the text description information of the target sample image may be "at dock of d3 market," the container crane carries the container carrying the chile cherry to the ship, and the ship carries the container to the fruit market of d3 market. "
The labels can be extracted from the text description information of the target sample image through a preset algorithm and used as target positive concept labels of the target sample image. And extracting and summarizing labels in text information of a control sample image corresponding to the target sample image through a preset algorithm to obtain a negative concept label warehouse of the sample image. The preset algorithm can be TF-IDF, textRank and other algorithms.
For example, a in fig. 3 is taken as a target sample image t1, b in fig. 3 is taken as a control sample image t2 corresponding to a in fig. 3, and text description information "a queen horse and a barensiya play a football game in a national stadium, the game process is stressed, the audience shout is shout, and the queen horse is collectively" dropped "behind the queen horse, and the game is dropped. "extract tags" football "," Royal "," Valencia "," stadium "," football match "and" audience "as target positive conceptual tags of the target sample image t 1. The text description of the control sample image t2 shown in b in fig. 3 "at dock in d3 market, the container crane carries containers carrying chile's cherries to ship, from ship to fruit market in d3 market. The labels "wharf", "container crane", "cherry", "container", "ship" and "fruit market" are extracted from the "and the extracted labels" wharf "," container crane "," cherry "," container "," ship "and" fruit market "are summarized as negative conceptual label warehouse of the sample image.
As an implementation manner, the labels can be extracted and summarized in the text information of the control sample image corresponding to the target sample image to obtain an initial negative conceptual label warehouse of the sample image, and then the labels which are the same as the target positive conceptual labels of the target sample image in the initial negative conceptual label warehouse of the sample image are deleted to obtain the negative conceptual label warehouse of the sample image.
In another possible implementation manner, the target sample image corresponds to a plurality of comparison sample images, the labels are extracted and summarized in text description information of each comparison sample image corresponding to the target sample image, the labels are used as a label set corresponding to the sample image, a plurality of label sets generated by the sample image corresponding to the plurality of comparison sample images are combined to obtain an initial negative concept label warehouse of the sample image, and labels identical to the target positive concept labels of the target sample image in the initial negative concept label warehouse of the sample image are deleted to obtain the negative concept label warehouse of the sample image.
After the negative conceptual label warehouse of the sample image is obtained, all negative conceptual labels in the negative conceptual label warehouse of the sample image can be used as target negative conceptual labels of the standard sample image, and a preset number of negative conceptual labels can be sampled (can be randomly sampled) from the negative conceptual label warehouse of the sample image to be used as target negative conceptual labels of the target sample image. The preset number may be set based on the requirement, and may be the same as the number of the target positive concept tags.
In this embodiment, the object indicated by the negative conceptual label corresponding to the extracted target sample image is not included in the target sample image, or the correlation between the object indicated by the negative conceptual label corresponding to the extracted target sample image and each object in the target sample image is low. That is, in the present application, after the negative concept label is extracted, semantic analysis may be performed on the negative concept label and the positive concept label, and the negative concept label similar to the positive concept label in the negative concept label is deleted, and the remaining negative concept labels are summarized to be used for constructing a negative concept label warehouse.
S130, inputting the target sample image into an initial image feature extractor to obtain a first coding result; inputting the target positive concept label and the target negative concept label into a pre-trained text feature extractor to obtain a second coding result; and inputting the target sample image into a momentum model to obtain a third coding result.
The initial image feature extractor may refer to a backbone network of a Vision Transforme model (visual transducer, a network structure based on an attention mechanism) initialized by parameters, and the target sample image is input into the initial image feature extractor to obtain a result output by the initial image feature extractor as a first encoding result. Meanwhile, the target positive concept label and the target negative concept label of the target sample image can be input into a pre-trained text feature extractor (for example, a text mode part of a CLIP model, the CLIP model (Connecting Text and Images) is a large-scale image-text pre-training model for performing image-text contrast training on 4 hundred million image-text pairs crawled on a network), and a second coding result output by the pre-trained text feature extractor is obtained.
For example, the target sample image is represented as x ε R C×H×W C is the number of channels (C is 3 when the target sample image is a color image), H is the height of the target sample image, and W is the width of the target sample image. Inputting the target sample image into an initial image feature extractor to obtain a first encoding result, wherein the first encoding result comprises a visual global representation of the target sample imageVisual local representation +.>D e For the dimension of the eimbedding (dimension of the feature representation) output by the initial image feature extractor, N is the total block number after the target sample image is divided when the initial image feature extractor processes the target sample image, e j And (3) a visual local representation of the j-th block after dividing the target sample image.
Meanwhile, inputting the target positive concept label and the target negative concept label of the target sample image into a pre-trained text feature extractor for processing to obtain a second coding result, wherein the second coding result is expressed asL represents the total number of target positive conceptual labels and target negative conceptual labels of the target sample image, z i The second encoding result is the ith label (the ith label is the target positive concept label or the target negative concept label).
In this embodiment, the momentum model has the same structure as the initial image feature extractor, and the parameters of the momentum model are obtained according to the parameters of the initial image feature extractor and are different from the parameters of the initial image feature extractor; the momentum model may be configured by parameters of the initial image feature extractor in different iterative training processes. Specifically, for each iterative training process, determining a target parameter according to the parameters of the initial image feature extractor in the previous iterative training process of the iterative training process and the parameters of the initial image feature extractor in the iterative training process; and configuring an initial image feature extractor before any one iterative training process according to the target parameters to obtain a momentum model of the iterative training process.
The above process of determining the target parameter may be expressed as formula one, which is as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,parameters of an initial image feature extractor in the previous iteration training process of the iteration training process, namely +.>For the parameters of the initial image feature extractor in the iterative training process, m is a weighting coefficient, and can be set based on actual scenes and requirements, and the application is not limited.
And inputting the target sample image into a momentum model to obtain an encoding result output by the momentum model as a third encoding result, wherein the structure of the momentum model is the same as that of the initial image feature extractor, and the process of obtaining the third encoding result is similar to that of obtaining the first encoding result, and is not repeated. The third encoding result may include a visual global representation of the target sample imageVisual local representation +.>Wherein e' j And (3) a visual local representation of the j-th block after dividing the target sample image.
And S140, training the initial image feature extractor according to the first coding result, the second coding result and the third coding result to obtain the target image feature extractor.
After the first encoding result, the second encoding result and the third encoding result are obtained, a loss value of the initial image feature extractor for the target sample image can be determined according to the first encoding result, the second encoding result and the third encoding result, and training is performed on the initial image feature extractor through the loss value of the initial image feature extractor for the target sample image to obtain the target image feature extractor.
The training process of the initial image feature extractor comprises a plurality of iterative training processes, each iterative training process determines a loss value of the initial image feature extractor for the target sample image, and then trains the initial image feature extractor through the loss value of the initial image feature extractor for the target sample image until the training is finished, so as to obtain the target image feature extractor. The training ending may refer to that the iterative training times reach the set times.
As an embodiment, S140 may further include: inputting the target sample image into a pre-trained image feature extractor to obtain a fourth coding result, wherein the pre-trained image feature extractor is a teacher model for performing knowledge distillation on the initial feature extractor, and the pre-trained image feature extractor and the pre-trained text feature extractor belong to the same image-text pre-training model; and training the initial image feature extractor according to the first coding result, the second coding result, the third coding result and the fourth coding result to obtain the target image feature extractor.
Inputting the target sample image into a pre-trained image feature extractor, the result output by the pre-trained image feature extractor being a fourth encoding result, the fourth encoding result also comprising a visual global representation of the target sample image Visual local representation +.>Wherein e' j ' target sampleVisual local representation of the j-th block after image division.
The pre-trained image feature extractor has the same structure as the initial feature extractor, wherein in the pre-training process of the feature extractor of the pre-trained image feature extractor, the samples for pre-training can comprise various different scenes and various different categories of objects, for example, the samples for pre-training can comprise images corresponding to the objects such as fruits, animals, buildings and people, and the samples for pre-training can comprise images corresponding to the scenes such as football match, basketball match and dancing. The pre-trained feature extractor has a certain recognition capability and can be used as a teacher model of the initial feature extractor for performing knowledge distillation on the initial feature extractor.
Knowledge distillation (Knowledge Distillation, abbreviated as KD) is a classical model compression method, and the core idea is to improve the performance of a light student model by guiding the student model to 'imitate' a teacher model (or multi-model ensemeble) with better performance and more complex structure without changing the structure of the student model.
In this embodiment, the pre-trained image feature extractor and the pre-trained text feature extractor belong to the same teletext pre-training model. For example, the teletext pre-training model may be a CLIP model, i.e. the pre-trained text feature extractor is the text modality part of the CLIP model and the pre-trained image feature extractor is the image modality part of the CLIP model.
The loss value of the initial image feature extractor for the target sample image can be determined according to the first encoding result, the second encoding result, the third encoding result and the fourth encoding result, and the initial image feature extractor is trained through the loss value of the initial image feature extractor for the target sample image, so that the target image feature extractor is obtained.
The training process of the initial image feature extractor comprises a plurality of iterative training processes, each iterative training process determines a loss value of the initial image feature extractor for the target sample image, and then trains the initial image feature extractor through the loss value of the initial image feature extractor for the target sample image until the training is finished, so as to obtain the target image feature extractor. The training ending may refer to that the iterative training times reach the set times.
In this embodiment, the target image feature extractor is obtained through training the target sample image, the target positive conceptual label and the target negative conceptual label, and the target positive conceptual label is extracted from text description information corresponding to the target sample image, and the target negative conceptual label is a label different from the target positive conceptual label and obtained from the negative conceptual label warehouse, so that the target positive conceptual label and the target negative conceptual label do not need to be manually marked on the target sample image, the obtaining efficiency of the target positive conceptual label and the target negative conceptual label is improved, the training efficiency of the target image feature extractor is further improved, and the training cost is reduced.
Referring to fig. 4, fig. 4 is a flowchart illustrating a training process of a target image feature extractor according to another embodiment of the present application, where the method may be applied to an electronic device, and the electronic device may be the terminal 20 or the server 10 in fig. 1, and the method includes:
s210, acquiring a target sample image and a target positive conceptual label corresponding to the target sample image.
The description of S210 refers to the descriptions of S110 to S120 above, and will not be repeated here.
S220, acquiring images different from the target sample images from the sample images of the historical batch, and taking the images as first control sample images; sample images other than the target sample image are acquired from the plurality of sample images of the current lot as second control sample images.
The target image feature extractor corresponds to a plurality of iterative training processes, each iterative training process can correspond to a plurality of batches of sample images, each batch of sample images comprises a plurality of sample images, and each target sample image corresponds to a respective control sample image; the samples of each batch in each iterative training process can be iterated according to a certain sequence (the sequence can be set based on requirements, the application is not limited), the sample iteration of each batch is completed, and the iterative training process is finished. The historical lot refers to the lot preceding the current lot in which the target sample image is located.
It will be appreciated that the historical batches may be batches during the same iterative training process or batches during different iterative training processes.
After the first control sample image and the second control sample image of the target sample image are determined, the first control sample image and the second control sample image of the target sample image are summarized to be used as the control sample images corresponding to the target sample image.
For example, the target image feature extractor corresponds to 2 iterative training processes, each iterative training process corresponds to 5 batches of samples, each batch of samples corresponds to 100 sample images, if the sample is currently an 8 th batch of 10 batches of samples, for one target sample image in the 8 th batch, 696 images different from the target sample image are obtained from 700 sample images in the previous 7 batches to be used as a first reference sample image, other 99 sample images in the 8 th batch are obtained to be used as second reference sample images of the target sample image, at this time, it is determined that the reference sample image corresponding to the target sample image comprises 795 images, and all sample images in the 8 th batch are traversed to obtain the reference sample images corresponding to each sample image in the 8 th batch.
S230, extracting labels from text description information corresponding to the first comparison sample image and text description information corresponding to the second comparison sample image according to the label number requirement in the label warehouse to form a negative concept label warehouse of the sample image.
The requirement for the number of tags in the tag warehouse may be to choose a fixed number of tags to build the tag warehouse, e.g. the fixed number may be 100. Wherein, selecting may refer to randomly sampling a fixed number of labels, or sampling a fixed number of labels in a near-to-far order according to the lot to which the control sample image belongs. The batch distance refers to the time distance between the batch in which the sample image is located and the current batch.
And determining a control sample image of each target sample image according to the method, extracting and summarizing labels from text description information of the control sample image of each target sample image, and obtaining an initial negative conceptual label warehouse of the sample image. And then selecting a fixed number of labels from the initial negative concept label warehouse of the sample image according to the number of labels in the label warehouse, and summarizing the selected fixed number of labels to serve as the negative concept label warehouse of the sample image.
For example, the number of tags in the tag warehouse requires 30 tags to be randomly selected. The comparison sample image corresponding to the target sample image a1 comprises 20, the text description information of the target sample image a1 is subjected to label extraction to obtain labels of 'people', 'grassland' and 'lake water', the three labels are used as target positive conceptual labels corresponding to the target sample image a1, the text description information of the 20 comparison sample images is subjected to label extraction to obtain 40 labels, the 40 labels are summarized to obtain an initial negative conceptual label warehouse corresponding to the sample image, 30 labels are randomly sampled in the initial negative conceptual label warehouse of the sample image, and the collected 30 labels are summarized to obtain the negative conceptual label warehouse corresponding to the sample image.
In the present application, the negative Concept tag repository is also called Concept Bank. Concept Bank is constructed by crossing the comparison sample images of each Batch (Batch is also called Batch) of a plurality of iterative training processes, so that the acquisition difficulty of the negative Concept labels is reduced, and the number of the negative Concept labels and the acquisition efficiency are improved.
S240, acquiring negative concept labels with different target number from a negative concept label warehouse of the sample image, and taking the negative concept labels with different target number as target negative concept labels; training the initial image feature extractor according to the target sample image, the target positive conceptual label and the target negative conceptual label to obtain the target image feature extractor.
Wherein the target number may be set based on demand, for example, the target number is 20. The process of obtaining the target number of negative conceptual labels may be random sampling. In addition, the training process of the target image feature extractor is described with reference to S130 to S140 above, and will not be repeated here.
In this embodiment, for each sample image of each batch in each iterative training process, a history sample image and other sample images in the batch are directly obtained, and used as a reference sample image of the sample images, so that the difficulty in obtaining the reference sample image is reduced, the data size of the reference sample image is larger, and the data size of the training sample is improved, thereby improving the coding effect of the target image feature extractor obtained by training.
Meanwhile, the negative concept labels in the negative concept label warehouse are extracted according to the text description information of the comparison sample image, so that the construction and expansion of a label system are realized while the comparison sample image is expanded, and the difficulty in acquiring and mining the label labels is reduced.
Referring to fig. 5, fig. 5 is a flowchart illustrating a training process of a target image feature extractor according to another embodiment of the present application, where the method may be applied to an electronic device, and the electronic device may be the terminal 20 or the server 10 in fig. 1, and the method includes:
s310, acquiring a target sample image, a target positive conceptual label of the target sample image and a target negative conceptual label of the target sample image.
S320, inputting the target sample image into an initial image feature extractor to obtain a first coding result; inputting the target positive concept label and the target negative concept label into a pre-trained text feature extractor to obtain a second coding result; inputting the target sample image into a momentum model to obtain a third coding result, wherein the momentum model is obtained based on an initial image feature extractor; inputting the target sample image into a pre-trained image feature extractor to obtain a fourth coding result.
The description of S310 refers to the descriptions of S110 to S130 above, and will not be repeated here.
S330, determining a first loss value according to the first coding result and the second coding result.
Wherein the first penalty value characterizes a predictive accuracy of the initial image feature extractor on the target positive conceptual label and the target negative conceptual label. A difference between the first encoding result and the second encoding result may be determined as the first loss value based on the first encoding result and the second encoding result.
Specifically, according to the first encoding result and the second encoding result, determining a respective first probability of each target concept label, where the first probability of the target concept label refers to a probability that the target sample image belongs to the target concept label, and the target concept label includes a target positive concept label and a target negative concept label; a first penalty value is determined based on the respective first probabilities for each of the target concept tags.
Referring to the description of S130 above, the first encoding result includes a visual global representation of the target sample imageVisual local representation +.>The second coding result is denoted +.>z i For the ith tag z i The second encoding result is the ith label (the ith label is the target positive concept label or the target negative concept label).
For the ith tag, determining the ith tag as a target conceptual tag, wherein the first probability of the target conceptual tag is s i ,s i Referring to formula two, the calculation mode of formula two is as follows:
s i =<z i ,e cls >+TopK([<z i ,e 1 >,<z i ,e 2 >,…,<z i ,e N >])
(II)
Where TopK (·) represents top-K average pooling, <·, · > represents vector inner product, K may be a value determined based on demand and actual scenario, e.g., k=20.
Obtaining the first probabilities of the target positive conceptual labels and the target negative conceptual labels according to the formula II, and obtaining the first probabilities of the target positive conceptual labels and the target negative conceptual labels according to the first probabilities of the target positive conceptual labels and the target negative conceptual labelsCalculating a first loss value according to a formula IIIThe formula three is as follows:
wherein y is a Refers to a set consisting of target positive conceptual labels corresponding to a target sample image a in one iterative training process,refers to the first probability of the target negative conceptual tag n, < ->Refers to the first probability of the target positive conceptual label p.
S340, according to the second coding result and the third coding result, a pseudo positive label and a pseudo negative label are obtained from the label set; and determining a second loss value according to the pseudo positive label, the pseudo negative label, the first coding result and the second coding result.
Wherein the set of labels includes a target positive conceptual label and a target negative conceptual label, and the second loss value indicates a predictive accuracy of the pseudo positive conceptual label and the pseudo negative conceptual label by the initial image feature extractor.
As an embodiment, the similarity between the second encoding result and the third encoding result may be determined, and the pseudo positive label and the pseudo negative label may be obtained from the label set according to the similarity between the second encoding result and the third encoding result.
The third encoding result may include a visual global representation of the target sample image and a plurality of visual local representations of the target sample image; the second encoding result may include a respective text feature for each target concept tag; the similarity between the text feature of each target concept label and each visual local representation of the target sample image can be determined to obtain respective similarities of a plurality of visual local representations of the target sample image under each target concept label; determining the similarity between the text features of each target concept label and the visual global representation as the similarity of the visual global representation of the target sample image under each target concept label; and acquiring respective similarity of multiple visual local representations of the target sample image under each target concept label and similarity of visual global representations of the target sample image under each target concept label, and taking the similarity as similarity between the second coding result and the third coding result.
Then, a respective second probability for each target concept tag may be determined based on the respective similarity of the plurality of sample visual local representations under each target concept tag and the similarity of the sample visual global representations under each target concept tag.
In this embodiment, a vector inner product between the text feature of each target concept label and each visual local representation of the target sample image may be determined as respective similarities of the plurality of visual local representations of the target sample image under each target concept label, so as to obtain respective local vector products of the plurality of visual local representations of the target sample image under each target concept label as respective similarities of the plurality of visual local representations of the target sample image under each target concept label; meanwhile, the vector inner product between the text feature of each target concept label and the visual global representation of the target sample image is determined and used as the global vector product of each target concept label, and the global vector product of each target concept label is used as the similarity of the visual global representation of the target sample image under each target concept label.
In addition, the method can perform average pooling processing on the local vector products of the multiple visual local representations of the target sample image under each target concept label to obtain the pooling processing result of each target concept label; and calculating the sum of the global vector product and the pooling processing result of each target concept label as the second probability of each target concept label. Wherein the average pooling process may be top-K average pooling.
Third plaited with reference to the description of S130 aboveThe code results include a visual global representation of the target sample imageVisual local representation +.>For the ith tag, determining the ith tag as a target conceptual tag, wherein the second probability of the target conceptual tag is s' i ,s' i Referring to formula four, formula four is as follows:
s′ i =<z i ,e′ cls >+TopK([<z i ,e′ 1 >,<z i ,e′ 2 >,…,<z i ,e′ N >])
(IV)
And then, according to each target concept label and the respective second probability of each target concept label, screening target concept labels with the second probability reaching a probability threshold value from the label set as pseudo positive labels, and taking target concept labels with the second probability not reaching the probability threshold value in the label set as pseudo negative labels, wherein the probability threshold value can be a value set based on requirements without limitation.
It should be noted that, the pseudo positive label may include a target positive conceptual label in the target conceptual labels, or may include a target negative conceptual label; accordingly, the pseudo negative labels may include target positive conceptual labels in the target conceptual labels, and may also include target negative conceptual labels.
After the pseudo positive label and the pseudo negative label are obtained, the second loss value can be determined according to the pseudo positive label, the pseudo negative label, the first coding result and the second coding result.
Specifically, a first probability of a target conceptual label matched with a pseudo positive label can be obtained as a probability of the pseudo positive label, and a first probability of the target conceptual label matched with the pseudo negative label can be obtained as a probability of the pseudo negative label; and determining a second loss value according to the respective probability of each pseudo-certificate label and the respective probability of each pseudo-negative label. The first probability of the target concept label refers to the probability that the target sample image belongs to the target concept label.
The target conceptual label matching the pseudo positive label may refer to the same target conceptual label as the pseudo positive label, and the target conceptual label matching the pseudo negative label may refer to the same target conceptual label as the pseudo negative label.
For example, the target concept labels comprise 10 target positive concept labels (b 1-b 10) and 10 target negative concept labels (c 1-c 10), b1-b5 and c1-c5 are screened as pseudo positive labels according to respective second probabilities of the target concept labels, b6-b10 and c6-c10 are determined to be pseudo negative labels, a first probability of the target positive concept label b1 is obtained as a probability of the pseudo positive label b1, a first probability of the target positive concept label b2-b5 is obtained as a probability of the pseudo positive label b2-b5, and a second probability of the target negative concept label c1-c5 is obtained as a probability of the pseudo positive label c1-c 5; the first probability of the target positive conceptual label b6-b10 is obtained as the probability of each of the pseudo negative labels c6-c10, and the second probability of the target negative conceptual label c6-c10 is obtained as the probability of each of the pseudo negative labels c6-c 10.
After determining the respective probabilities of the respective pseudo-certificate tags and the respective probabilities of the respective pseudo-negative tags, the second loss value may be determined according to the fifth formula according to the respective probabilities of the respective root pseudo-certificate tags and the respective probabilities of the respective pseudo-negative tagsThe formula five is as follows:
wherein y' a Refers to a set consisting of pseudo positive labels corresponding to the target sample image a in one iterative training process,refers to the probability of a false negative label n, +.>Refers to the probability of a false positive label p.
S350, determining a third loss value according to the first coding result and the fourth coding result.
Wherein the third loss value characterizes a degree of matching between the first encoding result and the fourth encoding result. The difference between the first encoding result and the fourth encoding result may be determined as the third loss value based on the first encoding result and the fourth encoding result.
Specifically, a difference between the first encoding result and the fourth encoding result may be calculated as an intermediate result; and performing norm operation on the intermediate result to obtain a third loss value.
With reference to the description of S130 above, the fourth encoding result may include a visual global representationVisual local representation +.>Determining a third loss value +.>The formula six is as follows: / >
Wherein, the liquid crystal display device comprises a liquid crystal display device,for one norm operation corresponding to the target sample image a in one iteration training process, (e' ") cls -e cls ) The method is an intermediate result corresponding to the target sample image a in one iterative training process.
And S360, training the initial image feature extractor according to the first loss value, the second loss value and the third loss value to obtain the target image feature extractor.
After obtaining the first loss value, the second loss value, and the third loss value, a final loss value may be determined according to the first loss value, the second loss value, and the third loss value, and the initial image feature extractor may be trained according to the final loss value to obtain the target image feature extractor.
Wherein the final loss valueCan be calculated by the formula seven, which is as follows:
where λ and μ are weight coefficients, which can be set based on demand, e.g., μ=0.5, λ=0.5.
And aiming at each iteration training process, obtaining a final loss value through each target sample image, a target positive conceptual label and a target negative conceptual label corresponding to each target sample image in the iteration training process, and iterating the initial image feature extractor through the final loss value until the training is finished to obtain the target image feature extractor.
In this embodiment, the first loss value, the second loss value and the third loss value are introduced into the final loss value, and feature distillation is achieved through the second loss value corresponding to the momentum model and the third loss value corresponding to the pre-trained image feature extractor, so that noise interference is effectively resisted in the model training process, and meanwhile, the model effect is improved, so that the coding accuracy of the target image feature extractor obtained by training is higher, and the coding effect is better.
Referring to fig. 6, fig. 6 is a flowchart illustrating an image recognition method according to an embodiment of the present application, where the method may be applied to an electronic device, and the electronic device may be the terminal 20 or the server 10 in fig. 1, and the method includes:
s410, acquiring text features of each of the image to be identified and the target labels.
In this embodiment, the electronic device may be built in with a pre-trained text feature extractor, and after a plurality of target tags are acquired, the plurality of target tags may be input into the pre-trained text feature extractor to obtain text features of each of the plurality of target tags.
It should be noted that, the target label may refer to a part of the target positive conceptual labels and the target negative conceptual labels corresponding to all the target sample images in the above-mentioned multiple iterative training processes, and the target label may not appear in the target positive conceptual labels and the target negative conceptual labels corresponding to all the target sample images in the above-mentioned multiple iterative training processes, that is, the target label is a new label (a label not involved in the training process).
S420, inputting the image to be identified into a target image feature extractor for encoding, and obtaining an image encoding result.
The training process of the target image feature extractor is described with reference to the above embodiments, and will not be described in detail here.
Inputting the image to be identified into a target image feature extractor to obtain an image coding result, wherein the image coding result can comprise a visual global representation of the image to be identifiedVisual local representation +.> D e N is the total block number after the image to be identified is divided when the target image feature extractor processes the image to be identified, f is the dimension of the coding output by the target image feature extractor j And (5) dividing the image to be identified into a visual local representation of the j-th block.
It will be appreciated that the initial image feature extractor and the target image encoder are identical in structure and different in parameters, and therefore, the N and D are involved in the resulting image encoding results e As in the initial image feature extractor.
S430, determining an image recognition result of the image to be recognized according to the image coding result and the text characteristics of each of the plurality of target labels.
After the image coding result and the text characteristics of each of the plurality of target labels are obtained, the respective score of each target label can be calculated according to the text characteristics and the image coding result of each target label, and the plurality of target labels are determined as the image recognition results of the images to be recognized according to the scores from high to low.
As one embodiment, determining the similarity between the text feature of each target label and each visual local representation to obtain respective similarities of the plurality of visual local representations under each target label; determining the similarity between the text features of each target label and the visual global representation as the similarity of the visual global representation under each target label; and determining an image recognition result of the image to be recognized according to the respective similarity of the visual local representations under each target label and the similarity of the visual global representation under each target label.
Specifically, the vector inner product between the text feature of each target label and each visual local representation can be determined and used as the respective similarity of the multiple visual local representations under each target label, so that the respective local vector product of the multiple visual local representations under each target label is obtained and used as the respective similarity of the multiple visual local representations under each target label; determining the inner vector product between the text feature of each target tag and the visual global representation as the global vector product of each target tag, and taking the global vector product of each target tag as the similarity of the visual global representation under each target tag; then, carrying out average pooling treatment on the local vector products of the visual local representations under each target label to obtain the pooling treatment result of each target label; calculating the sum of the global vector product and the pooling processing result of each target label as the prediction probability of each target label; and determining an image recognition result of the image to be recognized according to the respective prediction probabilities of the target tags. Wherein the average pooling process may be top-K average pooling.
For the ith target label, the text feature corresponding to the ith target label can be expressed asM represents the total number of the plurality of target tags. The prediction probability of the ith target label is determined to be v i The process of (2) may refer to equation eight, which is as follows:
v i =<g i ,e cls >+TopK([<g i ,f 1 >,<g i ,f 2 >,…,<g i ,f N >])
(eight)
Wherein, the liquid crystal display device comprises a liquid crystal display device,<g i ,e cls >for the global vector product of the ith target tag,<g i ,f N >visual local representation f for ith target label N TopK ([ [)<g i ,f 1 >,<g i ,f 2 >,…,<g i ,f N >]) And (5) pooling processing results for the ith target label.
After the prediction probabilities of the target labels are obtained, the target labels can be screened from high to low according to the prediction probabilities to be used as image recognition results; a plurality of target tags whose predicted probability exceeds a target probability value, which may be set based on demand, for example, a target probability value of 0.7, may also be screened as image recognition results.
In this embodiment, the target positive conceptual label is extracted from text description information corresponding to the target sample image, the target negative conceptual label is a label different from the target positive conceptual label and obtained from the negative conceptual label warehouse, the target positive conceptual label and the target negative conceptual label are not manually labeled, the accuracy of the target positive conceptual label and the target negative conceptual label is higher, and the number of the determined target negative conceptual labels is more, so that training of the target image feature extractor according to the target sample image, the target positive conceptual label and the target negative conceptual label is more sufficient, the obtained recognition effect of the target image feature extractor is better, and the accuracy of the image recognition result of the image to be recognized obtained according to the target image feature extractor is improved.
Meanwhile, the target image feature extractor can be used for a new label scene, zero sample multi-label identification under the new label scene is realized, and the identification effect of the zero sample multi-label identification under the new label scene is higher.
In order to more clearly explain the technical solution of the present application, the image recognition method of the present application is explained below in conjunction with an exemplary scenario, in which the number of iterations is 50 and the target tag includes 100. The target sample image in the second iteration process includes four target sample images.
The four target sample images are shown in fig. 7-10, wherein the text description information of fig. 7 is summer holidays of pupil and animal kingdom calendar note: one-dimensional girls fed with alpaca at zero distance. The text description information of figure 8 is "forest big goods in area A is open fire again, and under the unfavorable conditions of mountain height Lin Mi, steep trench depth and smoke diffusion, armed forces officials still struggle in the first line of fire scene. The text description information of the figure 9 is that the flood disasters are encountered in the land B for 28 days in 6 months, the flood rises sharply, the large areas along the C river and the town are flooded, the excessive roads are washed out, and the houses are soaked. Personnel evacuation and flood prevention emergency work are carried out by multiple departments. The figure shows that fire rescue workers are urgently transferred to the trapped people. "the text description information of fig. 10 is" 2018, F, national light industry encyclopedia awards: the national light industry department of China's strong enterprise peak forum of national light industry' and 25 th month of 2019 are held in E market, and the drawing is that the participant guests get on the table to get a prize. "
As shown in fig. 11, the extracted tags of fig. 7 are "pupil", "animal", "girl", "alpaca", and "feeding", the extracted tags of fig. 8 are "forest fire", "open fire", "mountain height Lin Mi", "steep trench depth", "smoke diffusion", "armed officer", and "fire scene", and the tags of "forest fire", "open fire", "mountain height Lin Mi", "steep trench depth", "smoke diffusion", "armed officer", and "fire scene", the extracted tags of fig. 9 are "flood disaster", "river", "flooded", "house", "soaked", "personnel evacuation", "flood prevention", and "fire fighter", and the extracted tags of fig. 10 are "prize", "peak forum", "guest" and "prize awarding".
When fig. 7 is taken as the target sample image, as shown in fig. 11, the labels "pupil", "animal", "girl", "alpaca", and "feed" of fig. 7 are taken as target positive conceptual labels, and the labels corresponding to fig. 8 to 10 are summarized as negative conceptual label warehouse.
As shown in fig. 12, after obtaining a target sample image, a target positive conceptual label and a negative conceptual label repository of the target sample image, inputting the target sample image into an initial image feature extractor to obtain a first encoding result, inputting the target sample image into a pre-trained image feature extractor to obtain a fourth encoding result, and inputting the target sample image into a momentum model to obtain a third encoding result. The momentum model is configured according to the target parameters of the second iterative training process, wherein the target parameters of the second iterative training process are determined by the parameters of the initial image feature extractor in the second iterative process and the parameters of the initial image feature extractor in the first iterative training process (may be determined by the way of formula one). Meanwhile, the negative concept label warehouse can be randomly sampled to obtain a target negative concept label, and then the target positive concept label and the target negative concept label are input into a pre-trained text feature extractor to obtain a second coding result.
Determining a target distribution according to the first encoding result and the second encoding result, wherein the target distribution comprises first probabilities of each target positive concept label and each target negative concept label, and determining a first loss value (a first sorting loss value in fig. 12) according to the target distribution; determining a pseudo tag distribution according to the first encoding result, the second encoding result and the third encoding result, wherein the pseudo tag distribution comprises probabilities of each pseudo positive tag and each pseudo negative tag, and determining a second loss value (a second sorting loss value in fig. 12) according to the pseudo tag distribution; and determining a third loss value according to the first coding result and the fourth coding result.
And determining a final loss value through the first loss value, the second loss value and the third loss value, and training the initial image feature extractor according to the final loss value to obtain the initial image feature extractor after the second iteration training.
And continuing to perform iterative training according to the second iterative training process until the iteration times reach 50, ending the training process to obtain an initial image feature extractor after the last iterative training as a target image feature extractor, and obtaining the target image feature extractor and a pre-trained text feature extractor as a target classification model.
2. And (5) identifying the image to be identified.
For news media scenes, acquiring images to be identified as shown in fig. 13-15; for each image to be identified, respectively encoding 100 target labels through a pre-trained text feature extractor in a target classification model to obtain respective text features of the 100 target labels, inputting the image to be identified into the target image feature extractor to obtain an image encoding result output by the target image feature extractor, determining respective prediction probability of each target label according to the image encoding result, and screening 5 target labels from high to low according to the prediction probability as an image identification result.
The image recognition result of the image to be recognized shown in fig. 13 is: vegetables, supermarkets, vegetable markets, fruits and cabbages, wherein the cabbages are labels with misprediction; the image recognition result of the image to be recognized shown in fig. 14 is: live wire, concert, singing, musical instrument and stage, wherein the live wire is a label for predicting errors; the image recognition result of the image to be recognized shown in fig. 15 is: football stadium, football, sports field and athlete, there are no mispredicted tags.
For a general scene, acquiring an image to be identified as shown in fig. 16-18; for each image to be identified, respectively encoding 100 target labels through a pre-trained text feature extractor in a target classification model to obtain respective text features of the 100 target labels, inputting the image to be identified into the target image feature extractor to obtain an image encoding result output by the target image feature extractor, determining respective prediction probability of each target label according to the image encoding result, and screening 5 target labels from high to low according to the prediction probability as an image identification result.
The image recognition result of the image to be recognized shown in fig. 16 is: modern dance, performance misdirection, interactive dance, dancing and dance, wherein interactive dance is a label with misprediction, and dance is a label which is difficult to identify; the image recognition result of the image to be recognized shown in fig. 17 is: hospitals, medical staff, obstetrical ultrasound examinations, medical equipment, and medical equipment, wherein obstetrical ultrasound examinations are labels that predict errors; the image recognition result of the image to be recognized shown in fig. 18 is: remote control vessels, drift, just-awakened inflatable vessels, yachts and rubber boats, which are labels of misprediction.
Therefore, the target classification model has good prediction effect on the images of the news media scene and has excellent accuracy rate on the prediction of the general scene. In order to embody the classification effect of the target classification model, the target classification model is compared with the existing model, and the comparison result is shown in table 1, and the table 1 is as follows:
the reduced target model is that the pre-trained image feature extractor is not introduced, that is, the target image feature extractor is trained only by the first loss value and the second loss value, K is K in topK averaging pooling, mAP (meanAverage Precision averaging accuracy) is an index used to measure whether the model has better performance, and F1 Score (F1 Score) is an index used to measure accuracy of the two-classification model in statistics.
As can be seen from Table 1, compared with the existing CLIP model, the target classification model and the simplified target classification model of the application have higher mAP and F1 scores, the target classification model and the simplified target classification model which characterize the scheme have better classification effect and generalization, and meanwhile, compared with the simplified target classification model, the mAP and F1 scores of the target classification model are higher, and a pre-trained image feature extractor is introduced as a teacher network, so that the recognition effect of the obtained target classification model can be effectively improved.
Referring to fig. 19, fig. 19 shows a block diagram of an image recognition apparatus according to an embodiment of the application, where the apparatus 1900 includes:
an obtaining module 1910, configured to obtain an image to be identified and text features of each of a plurality of target tags;
the encoding module 1920 is used for inputting the image to be identified into the target image feature extractor for encoding, and obtaining an image encoding result; the target image feature extractor is obtained through training of a target sample image, a target positive conceptual label and a target negative conceptual label; the target positive conceptual label is extracted from text description information corresponding to the target sample image, and the target negative conceptual label is a negative conceptual label which is obtained from a negative conceptual label warehouse of the sample image and is different from the target positive conceptual label; negative concept labels in the negative concept label warehouse are extracted from text description information corresponding to the comparison sample image; the control sample image is different from the target sample image;
The image recognition module 1930 is configured to determine an image recognition result of the image to be recognized according to the image encoding result and the text features of each of the plurality of target tags.
Optionally, the target image feature extractor is trained with a plurality of batches of sample images; the control sample image includes a first control sample image and a second control sample image; the device also comprises a training module, wherein the training module is used for acquiring images which are different from the target sample images from the sample images of the historical batches and used as first control sample images; the historical batch refers to the batch before the current batch where the target sample image is located; acquiring sample images except for a target sample image from a plurality of sample images of the current batch as a second control sample image; and extracting negative concept labels from the text description information corresponding to the first control sample image and the text description information corresponding to the second control sample image to form a negative concept label warehouse of the sample image.
Optionally, the training module is further configured to extract a label from the text description information corresponding to the first control sample image and the text description information corresponding to the second control sample image according to a requirement of the number of labels in the label repository, so as to form a negative conceptual label repository of the sample image; and acquiring negative concept labels with different target quantity from the negative concept label warehouse of the sample image as target negative concept labels.
Optionally, the training module is further configured to input the target sample image into the initial image feature extractor to obtain a first encoding result; inputting the target positive concept label and the target negative concept label into a pre-trained text feature extractor to obtain a second coding result; inputting the target sample image into a momentum model to obtain a third coding result, wherein the momentum model has the same structure as the initial image feature extractor, and the parameters of the momentum model are obtained according to the parameters of the initial image feature extractor and are different from the parameters of the initial image feature extractor; and training the initial image feature extractor according to the first encoding result, the second encoding result and the third encoding result to obtain the target image feature extractor.
Optionally, the target image feature extractor is obtained by multiple iterative training; the training module is also used for determining the parameters of the momentum model in the iterative training process according to the parameters of the momentum model in the previous iterative training process of the iterative training process and the parameters of the initial image feature extractor in the iterative training process of the iterative training process.
Optionally, the training module is further configured to input the target sample image into a pre-trained image feature extractor to obtain a fourth encoding result, where the pre-trained image feature extractor is a teacher model for performing knowledge distillation on the initial feature extractor, and the pre-trained image feature extractor and the pre-trained text feature extractor belong to the same graphic pre-training model; and training the initial image feature extractor according to the first coding result, the second coding result, the third coding result and the fourth coding result to obtain the target image feature extractor.
Optionally, the training module is further configured to determine a first loss value according to the first encoding result and the second encoding result, where the first loss value characterizes the prediction accuracy of the initial image feature extractor on the target positive concept tag and the target negative concept tag; according to the second coding result and the third coding result, pseudo positive labels and pseudo negative labels are obtained from a label set, wherein the label set comprises target positive conceptual labels and target negative conceptual labels; determining a second loss value according to the pseudo positive label, the pseudo negative label, the first coding result and the second coding result, wherein the second loss value indicates the prediction accuracy of the initial image feature extractor on the pseudo positive conceptual label and the pseudo negative conceptual label; determining a third loss value according to the first coding result and the fourth coding result, wherein the third loss value represents the matching degree between the first coding result and the fourth coding result; and training the initial image feature extractor according to the first loss value, the second loss value and the third loss value to obtain the target image feature extractor.
Optionally, the training module is further configured to determine, according to the first encoding result and the second encoding result, a respective first probability of each target concept label, where the first probability of the target concept label refers to a probability that the target sample image belongs to the target concept label, and the target concept label includes a target positive concept label and a target negative concept label; a first penalty value is determined based on the respective first probabilities for each of the target concept tags.
Optionally, the training module is further configured to determine a similarity between the second encoding result and the third encoding result; and obtaining the pseudo positive labels and the pseudo negative labels from the label set according to the similarity between the second coding result and the third coding result.
Optionally, the training module is further configured to determine, according to the first encoding result and the second encoding result, a respective first probability of each target concept label, where the first probability of the target concept label refers to a probability that the target sample image belongs to the target concept label, and the target concept label includes a target positive concept label and a target negative concept label; acquiring a first probability of a target conceptual label matched with a pseudo positive label as the probability of the pseudo positive label, and acquiring a first probability of the target conceptual label matched with a pseudo negative label as the probability of the pseudo negative label; and determining a second loss value according to the respective probability of each pseudo-certificate label and the respective probability of each pseudo-negative label.
Optionally, the image encoding result comprises a visual global representation of the image to be identified and a plurality of visual local representations; the image recognition module 1930 is further configured to determine a similarity between the text feature of each target tag and each visual local representation, so as to obtain respective similarities of the visual local representations under each target tag; determining the similarity between the text features of each target label and the visual global representation as the similarity of the visual global representation under each target label; and determining an image recognition result of the image to be recognized according to the respective similarity of the visual local representations under each target label and the similarity of the visual global representation under each target label.
It should be noted that, in the present application, the device embodiment and the foregoing method embodiment correspond to each other, and specific principles in the device embodiment may refer to the content in the foregoing method embodiment, which is not described herein again.
Fig. 20 shows a block diagram of an electronic device for performing an image recognition method according to an embodiment of the present application. The electronic device may be the terminal 20 or the server 10 in fig. 1, and it should be noted that, the computer system 1200 of the electronic device shown in fig. 20 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.
As shown in fig. 20, the computer system 1200 includes a central processing unit (Central Processing Unit, CPU) 1201 which can perform various appropriate actions and processes, such as performing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a random access Memory (Random Access Memory, RAM) 1203. In the RAM 1203, various programs and data required for the system operation are also stored. The CPU1201, ROM1202, and RAM 1203 are connected to each other through a bus 1204. An Input/Output (I/O) interface 1205 is also connected to bus 1204.
The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1207 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker, etc.; a storage section 1208 including a hard disk or the like; and a communication section 1209 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. The drive 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 1210 as needed, so that a computer program read out therefrom is installed into the storage section 1208 as needed.
In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1209, and/or installed from the removable media 1211. When executed by a Central Processing Unit (CPU) 1201, performs the various functions defined in the system of the present application.
It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
As another aspect, the present application also provides a computer-readable storage medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer readable storage medium carries computer readable instructions which, when executed by a processor, implement the method of any of the above embodiments.
According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the electronic device to perform the method of any of the embodiments described above.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause an electronic device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be appreciated by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (14)

1. An image recognition method, the method comprising:
acquiring text features of an image to be identified and a plurality of target labels;
inputting the image to be identified into a target image feature extractor for encoding to obtain an image encoding result; the target image feature extractor is obtained through training of a target sample image, a target positive concept tag and a target negative concept tag; the target positive conceptual label is extracted from text description information corresponding to the target sample image, and the target negative conceptual label is a negative conceptual label which is obtained from a negative conceptual label warehouse of the sample image and is different from the target positive conceptual label; negative concept labels in the negative concept label warehouse are extracted from text description information corresponding to the comparison sample image; the control sample image is different from the target sample image;
And determining an image recognition result of the image to be recognized according to the image coding result and the text characteristics of each of the plurality of target tags.
2. The method of claim 1, wherein the target image feature extractor is trained with a plurality of batches of sample images; the control sample image comprises a first control sample image and a second control sample image; the method further comprises the steps of:
acquiring an image different from the target sample image from the sample images of the historical batch as the first control sample image; the historical batch refers to a batch before the current batch where the target sample image is located;
acquiring sample images except the target sample image from the plurality of sample images of the current batch as the second control sample image;
and extracting negative conceptual labels from the text description information corresponding to the first comparison sample image and the text description information corresponding to the second comparison sample image to form a negative conceptual label warehouse of the sample image.
3. The method of claim 2, wherein extracting negative concept labels from text description information corresponding to the first control sample image and text description information corresponding to the second control sample image to form a negative concept label repository for the sample image comprises:
Extracting labels from text description information corresponding to the first control sample image and text description information corresponding to the second control sample image according to the label number requirement in a label warehouse to form a negative concept label warehouse of the sample image;
the method further comprises the steps of:
and acquiring negative concept labels with different target numbers from the negative concept label warehouse of the sample image as the target negative concept labels.
4. The method according to claim 1, wherein the target image feature extractor acquisition method includes:
inputting the target sample image into an initial image feature extractor to obtain a first coding result;
inputting the target positive concept label and the target negative concept label into a pre-trained text feature extractor to obtain a second coding result;
inputting the target sample image into a momentum model to obtain a third coding result, wherein the momentum model has the same structure as the initial image feature extractor, and the parameters of the momentum model are obtained according to the parameters of the initial image feature extractor and are different from the parameters of the initial image feature extractor;
And training the initial image feature extractor according to the first encoding result, the second encoding result and the third encoding result to obtain a target image feature extractor.
5. The method of claim 4, wherein the target image feature extractor is obtained by a plurality of iterative training; before the inputting the target sample image into the momentum model, the method further comprises:
and determining the parameters of the momentum model in the iterative training process according to the parameters of the momentum model in the previous iterative training process of the iterative training process and the parameters of the initial image feature extractor in the iterative training process.
6. The method of claim 4, wherein training the initial image feature extractor based on the first encoding result, the second encoding result, and the third encoding result results to obtain a target image feature extractor comprises:
inputting the target sample image into a pre-trained image feature extractor to obtain a fourth coding result, wherein the pre-trained image feature extractor is a teacher model for performing knowledge distillation on the initial feature extractor, and the pre-trained image feature extractor and the pre-trained text feature extractor belong to the same image-text pre-training model;
And training the initial image feature extractor according to the first coding result, the second coding result, the third coding result and the fourth coding result to obtain a target image feature extractor.
7. The method of claim 6, wherein training the initial image feature extractor to obtain a target image feature extractor based on the first encoding result, the second encoding result, the third encoding result, and the fourth encoding result comprises:
determining a first loss value according to the first coding result and the second coding result, wherein the first loss value represents the prediction accuracy of the initial image feature extractor on the target positive concept label and the target negative concept label;
according to the second coding result and the third coding result, pseudo positive labels and pseudo negative labels are obtained from a label set, wherein the label set comprises the target positive conceptual labels and the target negative conceptual labels;
determining a second loss value according to the pseudo positive label, the pseudo negative label, the first coding result and the second coding result, wherein the second loss value indicates the prediction accuracy of the initial image feature extractor on the pseudo positive conceptual label and the pseudo negative conceptual label;
Determining a third loss value according to the first coding result and the fourth coding result, wherein the third loss value represents the matching degree between the first coding result and the fourth coding result;
and training the initial image feature extractor according to the first loss value, the second loss value and the third loss value to obtain a target image feature extractor.
8. The method of claim 7, wherein determining a first loss value based on the first result and the second result comprises:
determining respective first probabilities of each target concept label according to the first coding result and the second coding result, wherein the first probabilities of the target concept labels refer to probabilities that the target sample image belongs to the target concept labels, and the target concept labels comprise the target positive concept labels and the target negative concept labels;
and determining a first loss value according to the respective first probability of each target concept label.
9. The method of claim 7, wherein the obtaining the pseudo positive and negative labels from the label set according to the second and third encoding results comprises:
Determining the similarity between the second encoding result and the third encoding result;
and according to the similarity between the second coding result and the third coding result, obtaining a pseudo positive label and a pseudo negative label from a label set.
10. The method of claim 7, wherein the determining a second loss value based on the pseudo positive tag, the pseudo negative tag, the first encoding result, and the second encoding result comprises:
determining respective first probabilities of each target label according to the first coding result and the second coding result, wherein the first probabilities of the target conceptual labels refer to probabilities that the target sample image belongs to the target labels, and the target conceptual labels comprise the target positive conceptual labels and the target negative conceptual labels;
acquiring a first probability of a target conceptual label matched with the pseudo positive label as the probability of the pseudo positive label, and acquiring a first probability of a target conceptual label matched with the pseudo negative label as the probability of the pseudo negative label;
and determining a second loss value according to the probability of each pseudo-certificate label and the probability of each pseudo-negative label.
11. The method according to any of claims 1-10, wherein the image encoding result comprises a visual global representation and a plurality of visual local representations of the image to be identified; the determining the image recognition result of the image to be recognized according to the image coding result and the text characteristics of each of the plurality of target tags comprises the following steps:
determining the similarity between the text feature of each target label and each visual local representation to obtain the respective similarity of the plurality of visual local representations under each target label;
determining the similarity between the text feature of each target label and the visual global representation as the similarity of the visual global representation under each target label;
and determining an image recognition result of the image to be recognized according to the respective similarity of the visual local representations under each target label and the similarity of the visual global representation under each target label.
12. An image recognition apparatus, the apparatus comprising:
the acquisition module is used for acquiring the images to be identified and the text characteristics of each of the plurality of target tags;
The coding module is used for inputting the image to be identified into the target image feature extractor for coding to obtain an image coding result; the target image feature extractor is obtained through training of a target sample image, a target positive concept tag and a target negative concept tag; the target positive conceptual label is extracted from text description information corresponding to the target sample image, and the target negative conceptual label is a negative conceptual label which is obtained from a negative conceptual label warehouse of the sample image and is different from the target positive conceptual label; negative concept labels in the negative concept label warehouse are extracted from text description information corresponding to the comparison sample image; the control sample image is different from the target sample image;
and the image recognition module is used for determining an image recognition result of the image to be recognized according to the image coding result and the text characteristics of each of the plurality of target tags.
13. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of any of claims 1-11.
14. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program code, which is callable by a processor for performing the method according to any one of claims 1-11.
CN202310681800.3A 2023-06-08 2023-06-08 Image recognition method, device, electronic equipment and storage medium Pending CN116977707A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310681800.3A CN116977707A (en) 2023-06-08 2023-06-08 Image recognition method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310681800.3A CN116977707A (en) 2023-06-08 2023-06-08 Image recognition method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116977707A true CN116977707A (en) 2023-10-31

Family

ID=88480522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310681800.3A Pending CN116977707A (en) 2023-06-08 2023-06-08 Image recognition method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116977707A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118072261A (en) * 2024-04-25 2024-05-24 杭州华是智能设备有限公司 Ship detection method and system based on polymorphic supervision text guidance

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118072261A (en) * 2024-04-25 2024-05-24 杭州华是智能设备有限公司 Ship detection method and system based on polymorphic supervision text guidance
CN118072261B (en) * 2024-04-25 2024-06-28 杭州华是智能设备有限公司 Ship detection method and system based on polymorphic supervision text guidance

Similar Documents

Publication Publication Date Title
Norouzzadeh et al. Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning
US10430689B2 (en) Training a classifier algorithm used for automatically generating tags to be applied to images
US20230208793A1 (en) Social media influence of geographic locations
CN110458107B (en) Method and device for image recognition
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
US11687716B2 (en) Machine-learning techniques for augmenting electronic documents with data-verification indicators
US20200159863A1 (en) Memory networks for fine-grain opinion mining
CN112148831B (en) Image-text mixed retrieval method and device, storage medium and computer equipment
Ruette et al. A lectometric analysis of aggregated lexical variation in written Standard English with Semantic Vector Space models
Cho et al. Classifying tourists’ photos and exploring tourism destination image using a deep learning model
CN108920446A (en) A kind of processing method of Engineering document
Zhang et al. Image clustering: An unsupervised approach to categorize visual data in social science research
CN113515669A (en) Data processing method based on artificial intelligence and related equipment
CN116977707A (en) Image recognition method, device, electronic equipment and storage medium
CN113011126A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN115222443A (en) Client group division method, device, equipment and storage medium
Liu et al. Extracting locations from sport and exercise-related social media messages using a neural network-based bilingual toponym recognition model
CN116824677B (en) Expression recognition method and device, electronic equipment and storage medium
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
CN112507912B (en) Method and device for identifying illegal pictures
Wieczorek et al. Semantic Image-Based Profiling of Users' Interests with Neural Networks
CN115270943A (en) Knowledge tag extraction model based on attention mechanism
Chen et al. Probabilistic urban structural damage classification using bitemporal satellite images
CN114580354A (en) Synonym-based information encoding method, device, equipment and storage medium
Kim et al. Background modelling using edge-segment distributions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication