WO2023231355A1 - Procédé et appareil de reconnaissance d'image - Google Patents

Procédé et appareil de reconnaissance d'image Download PDF

Info

Publication number
WO2023231355A1
WO2023231355A1 PCT/CN2022/137039 CN2022137039W WO2023231355A1 WO 2023231355 A1 WO2023231355 A1 WO 2023231355A1 CN 2022137039 W CN2022137039 W CN 2022137039W WO 2023231355 A1 WO2023231355 A1 WO 2023231355A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
sample
feature vector
feature
vector
Prior art date
Application number
PCT/CN2022/137039
Other languages
English (en)
Chinese (zh)
Inventor
杨振宇
李剑平
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2023231355A1 publication Critical patent/WO2023231355A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present application relates to the field of image processing technology. Specifically, the present application relates to an image recognition method and device.
  • Image recognition is an important research topic in the field of computer vision and has been widely used in many fields. For example, it is aimed at image recognition of plankton in the marine environment to achieve long-term, continuous in-situ observation of the plankton.
  • image recognition usually uses a training set to train a convolutional neural network model, and then predicts the category of the image to be recognized based on the convolutional neural network model to obtain the target category of the image to be recognized.
  • the training set needs to be continuously updated, which in turn causes the convolutional neural network model to be retrained more frequently in order to maintain the image recognition based on the convolutional neural network model. recognition performance.
  • Each embodiment of the present application provides an image recognition method, device, electronic device, and storage medium, which can solve the problems of low recognition accuracy, instability, and poor generalization performance in related technologies.
  • the technical solutions are as follows:
  • an image recognition method includes: obtaining an image to be recognized; performing image feature extraction on the image to be recognized to obtain a first feature vector; and storing sample images and their corresponding samples.
  • search for sample images whose similarity between the second feature vector and the first feature vector satisfies the similarity condition, and the second feature vector is used to represent the image features of the sample image; according to the found sample
  • the sample category corresponding to the image determines the target category of the image to be recognized.
  • an image recognition device includes: an image acquisition module, used to acquire an image to be recognized; a feature extraction module, used to extract image features from the image to be recognized, to obtain a first feature vector ; Image search module, used to search for sample images whose similarity between the second feature vector and the first feature vector satisfies the similarity condition in the retrieval library used to store sample images and their corresponding sample categories, the second The feature vector is used to represent the image features of the sample image; the image recognition module is used to determine the target category of the image to be recognized based on the sample category corresponding to the found sample image.
  • the feature extraction module includes: an extractor unit, configured to convert the image to be recognized into the first feature vector using a feature extractor that has completed model training.
  • the device further includes: a model training module, configured to perform model training on a basic model according to the image pairs in the training set to obtain the feature extractor, where the basic model includes a first training branch and a second training branch, the first training branch and the second training branch respectively include a feature extraction layer and a dimensionality reduction layer;
  • the model training module includes: an image traversal unit, used to perform image pairs in the training set Traverse, the image pair includes a positive sample pair and a negative sample pair, the two sample images in the positive sample pair belong to the same sample category, and the two sample images in the negative sample pair belong to different sample categories;
  • the traversal includes: inputting two sample images in the image pair into the first training branch and the second training branch respectively for processing; obtaining according to the first training branch and the second training branch.
  • the processing result is to calculate the model loss value;
  • the convergence unit is used to obtain the feature extractor by converging the feature extraction layer in the basic model if the convergence condition is satisfied by the model loss value.
  • the device further includes: an image pair building module; the image pair building module includes: an amplification unit, configured to perform at least two different operations on one of the sample images in the training set. Image data enhancement processing, then at least a first enhanced image and a second enhanced image are obtained by amplifying the sample image; a pairing unit is used to amplify the first enhanced image and the second enhanced image obtained by amplifying each sample image in the training set. Enhance the image and perform image pairing processing to obtain the image pair.
  • an image pair building module includes: an amplification unit, configured to perform at least two different operations on one of the sample images in the training set.
  • Image data enhancement processing then at least a first enhanced image and a second enhanced image are obtained by amplifying the sample image
  • a pairing unit is used to amplify the first enhanced image and the second enhanced image obtained by amplifying each sample image in the training set. Enhance the image and perform image pairing processing to obtain the image pair.
  • the image search module includes: a similarity calculation unit, configured to respectively calculate the second feature vector and the first feature vector for each second feature vector in the feature vector set.
  • the similarity of the feature vector set is constructed from the second feature vector of the sample image in the retrieval library; the image search unit is used to select the sample with the highest similarity between the second feature vector and the first feature vector. image as a sample image found from the retrieval library.
  • the device further includes: a set building module, configured to build the feature vector set from the second feature vector of the sample image in the retrieval library; the set building module package: a vector adding unit , used to extract image features from each sample image in the retrieval library, obtain the second feature vector of each sample image in the retrieval library, and add it to the feature vector set; a vector traversal unit, used for Traverse the second feature vectors in the feature vector set, use the traversed second feature vector as the first vector, and calculate the similarity between the first vector and the remaining second feature vectors in the feature vector set. , obtain the first similarity; the vector deletion unit is used to delete the second feature vector with high redundancy from the feature vector set based on the first similarity, the redundancy is used to indicate that in the The number of similar second feature vectors in the feature vector set.
  • a set building module configured to build the feature vector set from the second feature vector of the sample image in the retrieval library
  • the set building module package a vector adding unit , used to extract image features from each
  • the vector deletion unit includes: a vector determination subunit, configured to use a second feature vector whose first similarity to the first vector is greater than a first set threshold as the second vector. ; Similarity calculation subunit, used to respectively calculate the similarity between the second vector and the remaining second feature vectors in the feature vector set to obtain the second similarity; Redundancy calculation subunit, used to calculate the similarity according to the second vector and the remaining second feature vectors in the feature vector set; The number of second feature vectors whose first similarity is greater than the first set threshold is determined, and the redundancy of the first vector is determined based on the second similarity with the second vector being greater than the first set threshold. 2.
  • the image recognition module includes: an image recognition unit, configured to use the sample category corresponding to the found sample image as The target category of the image to be recognized.
  • the device further includes: a new category correction module, configured to correct the target category of the image to be recognized in response to a category correction instruction; and a new category adding module, configured to correct the target category of the image to be recognized in the image to be recognized.
  • a new category correction module configured to correct the target category of the image to be recognized in response to a category correction instruction
  • a new category adding module configured to correct the target category of the image to be recognized in the image to be recognized.
  • an electronic device includes: at least one processor, at least one memory, and at least one communication bus, wherein a computer program is stored on the memory, and the processor reads the data in the memory through the communication bus.
  • Computer program when the computer program is executed by the processor, the image recognition method as described above is implemented.
  • a storage medium has a computer program stored thereon.
  • the computer program is executed by a processor, the image recognition method as described above is implemented.
  • a computer program product includes a computer program, the computer program is stored in a storage medium, a processor of an electronic device reads the computer program from the storage medium, and the processor executes the computer program, such that When executed, the electronic device implements the image recognition method as described above.
  • Figure 1 is a schematic diagram of an implementation environment involved in this application.
  • Figure 2 is a flow chart of an image recognition method according to an exemplary embodiment
  • Figure 3 is a schematic diagram showing that the image to be recognized is an ROI image according to an exemplary embodiment
  • Figure 4 is a schematic structural diagram of a basic model according to an exemplary embodiment
  • Figure 5 is a schematic structural diagram of a feature extraction layer according to an exemplary embodiment
  • Figure 6 is a method flow chart of the model training process of the feature extraction layer according to an exemplary embodiment
  • Figure 7 is a schematic diagram of an image pairing process according to an exemplary embodiment
  • Figure 8a is a method flow chart of a construction process of a feature vector set according to an exemplary embodiment
  • Figure 8b is a method flow chart in one embodiment of step 550 involved in the corresponding embodiment of Figure 8a;
  • Figure 9 is a flow chart of another image recognition method according to an exemplary embodiment.
  • Figure 10 is a schematic diagram of an image recognition framework based on image retrieval according to an exemplary embodiment
  • Figure 11 is a structural block diagram of an image recognition device according to an exemplary embodiment
  • Figure 12 is a hardware structure diagram of an electronic device according to an exemplary embodiment
  • Figure 13 is a structural block diagram of an electronic device according to an exemplary embodiment.
  • the training set needs to be continuously updated, which in turn causes the convolutional neural network model to be retrained more frequently in order to maintain the performance based on the convolutional neural network model.
  • the update of the training set relies on a large amount of manual annotation and manual correction; on the other hand, the training set constructed by sampling images at limited spatial and temporal scales and resolutions is always difficult to fully and faithfully reflect the real marine environment. Plankton, these will inevitably affect the recognition accuracy of image recognition, and cannot meet the needs of real-time observation of plankton in the marine environment.
  • the image recognition method provided by this application can effectively improve the recognition accuracy and robustness, and fully ensure the generalization performance. Accordingly, the image recognition method is suitable for image recognition devices, and the image recognition device can be deployed in Electronic equipment configured with the von Neumann architecture, for example, the electronic equipment can be a desktop computer, a laptop computer, a server, etc.
  • Figure 1 is a schematic diagram of an implementation environment involved in an image recognition method. It should be noted that this implementation environment is only an example adapted to the present invention and cannot be considered to provide any limitation on the scope of the present invention.
  • the implementation environment includes a collection terminal 110 and a server terminal 130.
  • the collection terminal 110 can also be considered as an image collection device, including but not limited to a camera, a still camera, a camcorder and other electronic devices with a shooting function.
  • the collection terminal 110 is an underwater camera.
  • the server 130 can be an electronic device such as a desktop computer, a laptop computer, a server, etc., or it can be a computer cluster composed of multiple servers, or even a cloud computing center composed of multiple servers.
  • the server 130 is used to provide background services.
  • the background services include but are not limited to image recognition services and so on.
  • a network communication connection is established in advance between the server 130 and the collection terminal 110 through wired or wireless means, and data transmission between the server 130 and the collection terminal 110 is implemented through the network communication connection.
  • the transmitted data includes but is not limited to: images to be recognized, etc.
  • the collection terminal 110 captures and collects the image to be recognized, and uploads the image to be recognized to the server 130 to request the server 130 to provide image recognition services.
  • the image recognition service is called to search for images similar to the image to be identified in the retrieval database that stores sample images and their corresponding sample categories. sample images, and then determine the target category of the image to be recognized based on the sample category corresponding to the found sample image, thereby realizing an image recognition solution that replaces image classification with image retrieval, thereby solving the inaccuracy in recognition accuracy existing in related technologies. High, unrobust, and poor generalization performance problems.
  • the electronic equipment can be the server 130 in the implementation environment shown in Figure 1.
  • the method may include the following steps:
  • Step 310 Obtain the image to be recognized.
  • the image to be recognized is generated by photographing and collecting the environment containing the target object by the image acquisition device in the implementation environment shown in Figure 1 .
  • the target object refers to an object in the shooting environment.
  • the target object may be an underwater creature, and specifically the underwater creature may be a plankton in a marine environment.
  • the shooting can be a single shooting or a continuous shooting.
  • a video can be obtained, and the image to be recognized can be any number of frames in the video.
  • multiple photos can be obtained, and the image to be recognized can be any number of photos among the multiple photos.
  • the image to be recognized in this embodiment may refer to a dynamic image, such as multiple frames in a video, or multiple photos, or a static image, such as any frame in a video, or
  • the image recognition in this embodiment can be performed on dynamic images or on static images, which is not limited here.
  • the image to be recognized can come from the image to be recognized that is captured and collected in real time by the image acquisition device, or it can be the image to be recognized that is captured and captured by the image acquisition device in a historical time period that is pre-stored in the electronic device. Then, for electronic devices, after the image acquisition device captures and collects the image to be recognized, the image to be recognized can be processed in real time, or it can be stored in advance for processing. For example, the image to be recognized can be processed when the CPU of the electronic device is low. , or process the image to be recognized according to the instructions of the staff. Therefore, the image recognition in this embodiment can be based on the image to be recognized obtained in real time or the image to be identified obtained in a historical time period, which is not specifically limited here.
  • the image to be recognized is an ROI (region of interest) image, that is to say, in the image to be recognized, the target object is located in the area of interest, which can also be understood as the target object passing through the sensor.
  • the identification of the area of interest is significantly different from the background area.
  • the target object is plankton, located in the area of interest (gray-white area), which is significantly different from the background area (black area).
  • Step 330 Extract image features from the image to be recognized to obtain a first feature vector.
  • the first feature vector is used to represent the image features of the sample image. It can also be considered that the first feature vector is an accurate description of the image features of the image to be identified. It should be understood that if the image to be identified is different, the extracted image features will be different. The difference is that correspondingly, the first feature vectors are also different.
  • image feature extraction can be implemented through feature extraction algorithms such as directional gradient histogram features, local binary pattern features, and Haar-like features.
  • image feature extraction is achieved through a convolution kernel. It should be noted that based on different numbers and different sizes of convolution kernels, first feature vectors of different lengths will be obtained to reflect the image to be recognized from different scales.
  • the image features are extracted through a feature extractor.
  • the feature extractor that has completed model training is used to convert the image to be recognized into a first feature vector.
  • Step 350 In the retrieval database used to store sample images and their corresponding sample categories, search for sample images whose similarity between the second feature vector and the first feature vector satisfies the similarity condition.
  • the retrieval database essentially establishes a correspondence between sample images and their corresponding sample categories.
  • the sample category corresponding to the sample object can be quickly determined, which is then used as the basis for image retrieval.
  • the sample image refers to an image labeled with a sample category.
  • the sample image refers to an image carrying a label indicating the sample category.
  • image retrieval The essence of image retrieval is to measure the similarity between the image to be recognized and the sample images in the retrieval database.
  • Image recognition based on image retrieval does not directly obtain the target category of the image to be recognized, but by comparing the image to be recognized and the retrieved image.
  • the similarity between the sample images in the library indirectly obtains the target category of the image to be recognized, that is, first obtains the sample category corresponding to the sample image that satisfies the similarity condition between the images to be recognized, and then obtains the image to be recognized. target category.
  • the comparison of the similarity between the image to be recognized and the sample image in the retrieval database is achieved by calculating the similarity between the first feature vector and the second feature vector.
  • the first feature vector is used to represent the image features of the image to be recognized
  • the second feature vector is used to represent the image features of the sample images in the retrieval database.
  • the similarity calculation scheme includes but is not limited to: cosine similarity, Euclidean distance, Manhattan distance, Jaccard similarity coefficient, Pearson correlation coefficient, etc.
  • Similarity(x,y) represents the similarity between x and y, and the value range of this similarity is [0,1]; x represents the first feature vector of the image to be recognized, and y represents the sample The second eigenvector of the image. It should be understood that the closer the similarity is to 1, the closer the first feature vector and the second feature vector are, that is, the more similar the image to be recognized is to the sample image.
  • the image to be recognized is not limited to a static image, such as a photo or a frame, but can also be a dynamic image. If the image to be recognized refers to a dynamic image, such as multiple photos or multiple frames screen, you can combine calculation formula (1) and calculation formula (2) to calculate multiple similarities at the same time.
  • V represents the similarity result matrix
  • Q represents the first eigenvector matrix of the image to be recognized
  • G represents the second eigenvector matrix of each sample image in the retrieval database.
  • the values in each column of the i-th row represent: the first feature vector of the i-th photo or i-th frame in the image to be recognized, and the second feature vector of each sample image in the retrieval database.
  • the similarity condition refers to the highest degree of similarity. Therefore, the sample image with the highest degree of similarity between the second feature vector and the first feature vector is used as the sample image found from the retrieval database.
  • the second feature vector is pre-calculated and stored in the storage area of the electronic device. In this way, when performing image recognition on different images to be recognized, it can be read directly from the storage area of the electronic device. Taking the second feature vector calculated in advance avoids repeated extraction of the second feature vector in each image recognition process, thereby helping to further improve the recognition efficiency of image recognition.
  • the second feature vector is stored in the storage area of the electronic device in the form of LUT (Look-up Table). Then, during the image recognition process, the LUT can be directly loaded into the memory of the electronic device to This avoids repeated extraction of the second feature vector in each image recognition process.
  • LUT Look-up Table
  • the above process is especially suitable for image recognition of out-of-distribution samples.
  • the image recognition scheme based on image classification will not only affect the accuracy of classification, but also lead to inaccuracies in abundance quantification.
  • the image recognition solution based on image retrieval can more accurately exclude out-of-distribution samples through similarity calculation, thereby effectively ensuring the recognition accuracy of image recognition.
  • Step 370 Determine the target category of the image to be recognized based on the sample category corresponding to the found sample image.
  • the sample category corresponding to the found sample image is the recognition result obtained by image recognition of the image to be recognized, that is, the target category of the image to be recognized.
  • the target category of the image to be recognized may be a new category, that is, it does not belong to any sample category corresponding to each sample image in the retrieval database. It can also be understood that the target category of the image to be recognized is an unknown category. , at this time, according to the sample category corresponding to the found sample image, the target category of the image to be recognized cannot actually be obtained correctly.
  • a decision condition is proposed to reject the recognition of unknown categories, thereby avoiding recognition errors.
  • the decision condition refers to that the similarity between the image to be recognized and the found sample image is greater than the similarity threshold.
  • the category decision-making process based on this decision condition specifically refers to: if the similarity between the second feature vector of the found sample image and the first feature vector of the image to be recognized is greater than the similarity threshold, then the found sample image will be The corresponding sample category is used as the target category of the image to be recognized; otherwise, the target category of the image to be recognized is determined to be a new category.
  • the decision-making condition may also be related to the weight configured for the found sample image, which is not specifically limited here.
  • an image recognition solution is implemented where image retrieval replaces image classification. Since the recognition accuracy of image retrieval depends on the sample images in the retrieval database and their corresponding sample categories, unlike image classification, which relies on frequent changes in the training set and The retraining of the convolutional neural network model can fully ensure the recognition accuracy while minimizing manual participation, and can effectively solve the problems of low recognition accuracy, instability, and poor generalization performance in related technologies. The problem.
  • Figure 4 shows a schematic structural diagram of the basic model in one embodiment.
  • the basic model includes a first training branch and a second training branch.
  • the first training branch and the second training branch respectively include a feature extraction layer and dimensionality reduction. layer.
  • the feature extraction layer can be considered as a feature extractor that has not completed model training and is used to extract image features;
  • the dimensionality reduction layer consists of two fully connected layers and is used to further reduce the dimensionality of the feature vector obtained by the feature extraction layer, for example , convert the feature vector of length 2048 obtained by the feature extraction layer into a feature vector of length 128.
  • Figure 5 shows a schematic structural diagram of the feature extraction layer in one embodiment.
  • the feature extraction layer is a convolutional neural network model with a structural depth of 50 layers and does not include a fully connected layer.
  • the convolution layer Conv the convolution layer Conv
  • the pooling layer Pool the activation function layer ReLU
  • it is also based on the ResNeXt module and introduces the SE (Squeeze-and-Excitation) attention module.
  • SE Seeze-and-Excitation
  • the feature vector obtained based on this feature extraction layer has strong abstract expression ability, and with the assistance of the attention mechanism, it can focus on the parts of the image that play a major role in recognition, such as the area of interest in the ROI image, thus fully This ensures that image features can be extracted more effectively.
  • model training process may include the following steps:
  • Step 410 Traverse the image pairs in the training set.
  • the image pairs include positive sample pairs and negative sample pairs.
  • the two sample images in the positive sample pair belong to the same sample category, and the two sample images in the negative sample pair belong to different sample categories.
  • image data enhancement processing includes but is not limited to: random cropping, rotation, flipping, grayscale, brightness adjustment, contrast adjustment, saturation adjustment, etc., which are not limited here.
  • the first enhanced image and the second enhanced image obtained by amplifying each sample image in the training set are subjected to image pairing processing to obtain an image pair.
  • the sample images in the training set include 701 and 702.
  • the first enhanced image and the second enhanced image obtained by amplifying the sample image 701 are 7011 and 7012 respectively.
  • the first enhanced image obtained by amplifying the sample image 702 The enhanced image and the second enhanced image are 7021 and 7022 respectively.
  • the constructed image pairs include ⁇ 7011, 7012 ⁇ , ⁇ 7011, 7021 ⁇ , ⁇ 7011, 7022 ⁇ , ⁇ 7012, 7021 ⁇ , ⁇ 7012, 7022 ⁇ , ⁇ 7021, 7022 ⁇ .
  • ⁇ 7011, 7012 ⁇ , ⁇ 7021, 7022 ⁇ belong to the positive sample pair
  • ⁇ 7011, 7021 ⁇ , ⁇ 7011, 7022 ⁇ , ⁇ 7012, 7021 ⁇ , ⁇ 7012, 7022 ⁇ belong to the negative sample pair right.
  • the traversal process for image pairs in the training set can include the following steps:
  • Step 411 Input the two sample images in the image pair to the first training branch and the second training branch respectively for processing.
  • the processing at least includes: extracting image features through the feature extraction layer, and reducing the dimensionality of the feature vector through the dimensionality reduction layer. wait.
  • the sample images are pre-processed before being input to the first training branch or the second training branch.
  • preprocessing includes but is not limited to: filling, scaling, normalization, etc. In this manner, since the occurrence of distortion is avoided, it is conducive to further effectively improving the accuracy of recognition.
  • the purpose of preprocessing such as padding and scaling is to ensure a unified input size of the first training branch or the second training branch.
  • the unified input size is 224 ⁇ 224.
  • Normalization preprocessing means that the sample image is normalized pixel by pixel according to the following calculation formula (3) after encoding preprocessing.
  • I Norm represents the pixels in the sample image that have completed normalization processing, and I represents the pixels to be processed in the sample image;
  • mean and std respectively represent the pixel mean and pixel standard deviation of all pixels in all sample images in the training set.
  • Step 413 Calculate the model loss value based on the processing results obtained by the first training branch and the second training branch.
  • the calculation formula (4) of the model loss value is as follows:
  • L sup represents the model loss value
  • I represents the set of all sample images in the training set
  • P(i) represents the set of positive sample pairs to which the i-th sample image belongs in the training set
  • A(i) represents all sample images in the training set excluding the i-th sample image. collection
  • step 430 is executed.
  • step 415 is executed.
  • the convergence condition can refer to the model loss value being minimum or lower than the loss value threshold, or it can also refer to the number of iterations meeting the iteration threshold. This is not limited here and can be flexibly set according to the actual needs of the application scenario.
  • Step 415 Update the parameters of the basic model and return to step 410.
  • Step 430 The feature extractor is obtained by converging the feature extraction layer in the basic model.
  • the supervised contrastive learning model training of the feature extraction layer is completed, so that the feature extractor has the effect of bringing the positive sample closer to the two sample images in the feature space and pushing the negative sample farther away from the two sample images. .
  • both the dual training branch and the dimensionality reduction layer are discarded, and only one of the feature extraction layers in the dual training branch is retained as a feature extractor for subsequent image recognition.
  • the convolutional neural network model in image classification has a greatly simplified model structure, which further avoids relying on frequent changes in the training set to maintain recognition performance, and is more conducive to improving recognition accuracy.
  • the above method may further include the following steps: constructing a feature vector set from the second feature vector of the sample image in the retrieval library.
  • the feature vector set is a LUT.
  • the second feature vector can be stored in the storage area of the electronic device in a LUT manner, thereby avoiding repeated extraction of the second feature vector in each image recognition process, thereby improving the recognition efficiency of image recognition.
  • the inventor also realized that as the number of sample images in the retrieval library increases, the number of pre-calculated second feature vectors in the LUT also increases, due to the need to calculate the first feature vector and each second feature in the LUT Vector similarity, then the number of second feature vectors in the LUT will affect the similarity calculation speed, thereby affecting the recognition efficiency of image recognition.
  • a construction process of a feature vector set is proposed to realize LUT pruning, which can not only reduce the size of the LUT, that is, reduce the number of second feature vectors in the LUT, but also try to maintain the second feature vector in the LUT. Diversity of feature vectors.
  • the construction process of the feature vector set may include the following steps:
  • Step 510 Perform image feature extraction on each sample image in the retrieval library, obtain the second feature vector of each sample image in the retrieval library, and add it to the feature vector set.
  • Step 530 Traverse the second feature vectors in the feature vector set, use the traversed second feature vector as the first vector, calculate the similarity between the first vector and the remaining second feature vectors in the feature vector set, and obtain the second feature vector. One degree of similarity.
  • Step 550 Based on the first similarity, delete the second feature vector with high redundancy from the feature vector set.
  • the redundancy of the second feature vector is used to indicate the number of similar second feature vectors to the second feature vector in the feature vector set. It should be understood that the higher the redundancy, the greater the number of second feature vectors that are similar to the second feature vector in the feature vector set. Then, in the feature vector set, it can be considered that the second feature vector corresponds to If the sample image is redundant, the second feature vector can be deleted from the set of feature vectors.
  • the LUT pruning process can include the following steps:
  • Step 551 Use the second feature vector whose first similarity to the first vector is greater than the first set threshold as the second vector.
  • Step 553 Calculate the similarity between the second vector and the remaining second feature vectors in the feature vector set to obtain the second similarity.
  • Step 555 Determine the redundancy of the first vector based on the number of second feature vectors whose first similarity with the first vector is greater than the first set threshold, and determine the redundancy of the first vector based on the second similarity with the second vector that is greater than the first set threshold. 2. Set the number of second feature vectors with a threshold value to determine the redundancy of the second vector.
  • the redundancy of the first vector is used to indicate the number of similar second feature vectors to the first vector in the feature vector set. Similarity means that the first similarity is greater than the first set threshold.
  • the redundancy of the second vector is used to indicate the number of similar second feature vectors to the second vector in the feature vector set. Similarity means that the second similarity is greater than the first set threshold.
  • Step 557 Based on the redundancy of the second feature vector, delete the corresponding second feature vector from the feature vector set.
  • the first vector is deleted from the feature vector set.
  • the redundancy of the second vector is greater than the redundancy of the first vector. Then the second vector is deleted from the feature vector set.
  • the second eigenvectors in the eigenvector set are: A, B, C, and D respectively.
  • the second feature vectors B, C, and D are used as the second vectors. At this time, the second vector B and the remaining second vectors are calculated respectively.
  • the similarities of the feature vectors A, C, and D are used to obtain the second similarity: 0.91, 0.7, and 0.97; the similarities between the second vector C and the remaining second feature vectors A, B, and D are calculated respectively, and the second similarity is obtained: 0.95, 0.97, 0.75; calculate the similarity between the second vector D and the remaining second feature vectors A, B, and C respectively, and obtain the second similarity: 0.97, 0.75, 0.77.
  • the second set threshold is also 0.8
  • the number of second feature vectors (B, C, D) with a first similarity greater than 0.8 to the first vector A is 3, and the number of second feature vectors (B, C, D) with the first similarity of The number of second feature vectors (A, D) with a second similarity greater than 0.8 is 2, and the number of second feature vectors (A, B) with a second similarity greater than 0.8 with the second vector C is 2,
  • the number of second feature vectors (A) whose second similarity degree is greater than 0.8 with the second vector D is 1.
  • the redundancy of the first vector A is 3, the redundancy of the second vector B is 2, the redundancy of the second vector C is 2, and the redundancy of the second vector D is 2.
  • the degree is 1.
  • the first vector A with a redundancy of 3 is deleted from the feature vector set.
  • the redundancy can also be expressed in other forms, such as a number-based normalization method, which is not specifically limited here.
  • the first set threshold and the second set threshold may be the same or different, and they may be flexibly adjusted according to the actual needs of the application scenario to balance recognition efficiency and recognition accuracy. For example, in application scenarios with high recognition efficiency requirements, set a smaller first set threshold.
  • plankton application scenario assuming that there are 200 plankton categories, if each plankton category contains 1,000 sample images, then the retrieval library contains 200,000 sample images, then the LUT contains up to 200,000 second features.
  • Vector taking the LUT in NVIDA RTX3090 GPU as an example, image recognition of the image to be recognized takes up to 5.8ms, which can fully meet the needs of real-time observation of plankton in the marine environment.
  • the above method may further include the following steps:
  • Step 610 In response to the category correction instruction, correct the target category of the image to be recognized.
  • Step 630 If the corrected target category of the image to be recognized is a new category, add the image to be recognized and its corrected target category to the retrieval library in response to the category adding instruction.
  • the new category means that the corrected target category of the image to be recognized is different from the sample category in the retrieval database.
  • a human-computer interaction interface is provided, thereby helping to promptly discover and correct image recognition deviations to fully ensure the recognition performance of image recognition.
  • FIG. 10 shows a schematic diagram of an image recognition framework based on image retrieval in one embodiment.
  • the image recognition framework includes: a query image module (query) 801 for obtaining images to be recognized, a retrieval library (gallery) 802 for storing sample images and their corresponding sample categories, and a retrieval module (gallery) 802 for performing image feature analysis.
  • human-computer interaction interface for a query image module (query) 801 for obtaining images to be recognized
  • a retrieval library for storing sample images and their corresponding sample categories
  • a retrieval module for performing image feature analysis.
  • the human-computer interaction interface includes a correction interface 807 and an adding interface 808.
  • the correction interface 807 is used to generate a category correction instruction to correct the target category of the image to be recognized;
  • the adding interface 808 is used to generate a category addition instruction to add the image to be recognized and its corrected target category to the retrieval library.
  • the electronic device is a smartphone that provides browsing of recognition results.
  • the smartphone displays a browsing page for browsing the recognition results, and the browsing page displays a correction interface and an adding interface.
  • the correction interface and the addition interface are essentially controls that can realize human-computer interaction.
  • the controls can be input boxes, selection boxes, buttons, switches, progress bars, etc.
  • the user finds that the target category of the image to be recognized is a new category, he can trigger the corresponding operation on the correction interface.
  • the correction interface if the corresponding operation triggered by the user is detected, a category correction instruction is generated to instruct the electronic device
  • the target category of the image to be recognized is corrected in response to the category correction instruction.
  • the correction interface is an input box for the user to input the name of the new category, and the user's input operation is regarded as the corresponding operation triggered by the user on the correction interface; similarly , when the corrected target category of the image to be recognized is a new category, the user can also trigger the corresponding operation in the adding interface.
  • the adding interface For the adding interface, if the corresponding operation triggered by the user is detected, a category adding instruction is generated, so that Instruct the electronic device to add the image to be recognized and its corrected target category to the retrieval library in response to the category adding instruction.
  • the adding interface is a "confirm/cancel" button for the user to click, and the user's click operation is regarded as The corresponding operation triggered by the user when adding the interface.
  • the specific behavior of the corresponding operation triggered by the user will be different.
  • the corresponding operations triggered may be click, touch, slide and other gesture operations; or, if the electronic device is a laptop equipped with a mouse, the corresponding operations triggered may be click, touch, slide, etc.
  • Mechanical operations such as double-clicking and dragging are not specifically limited in this embodiment.
  • the image recognition framework based on image retrieval has the characteristic of "by adding a new category in the retrieval library, the target category of the image to be recognized can be immediately recognized as a new category", making retraining not suitable for this image recognition framework. is always necessary, thus helping to delay the need for retraining and reducing the frequency of retraining, providing more convenience and greater flexibility for image recognition.
  • an embodiment of the present application provides an image recognition device 900, including but not limited to: an image acquisition module 910, a feature extraction module 930, an image search module 950, and an image recognition module 970.
  • the image acquisition module 910 is used to acquire the image to be recognized.
  • the feature extraction module 930 is used to extract image features from the image to be recognized to obtain the first feature vector.
  • the image search module 950 is used to search for sample images whose similarity between the second feature vector and the first feature vector satisfies the similarity condition in the retrieval database used to store sample images and their corresponding sample categories.
  • the second feature vector is used to Represents the image features of the sample image.
  • the image recognition module 970 is used to determine the target category of the image to be recognized based on the sample category corresponding to the found sample image.
  • Figure 12 shows a schematic structural diagram of an electronic device according to an exemplary embodiment.
  • the electronic device is suitable for the server 130 in the implementation environment shown in FIG. 1 .
  • this electronic device is only an example adapted to the present application and cannot be considered to provide any limitation on the scope of use of the present application.
  • the electronic device is also not to be construed as being dependent on or required to have one or more components of the exemplary electronic device 2000 shown in FIG. 12 .
  • the hardware structure of the electronic device 2000 may vary greatly due to different configurations or performance.
  • the electronic device 2000 includes: a power supply 210, an interface 230, at least one memory 250, and at least one central processing unit (CPU). ,Central Processing Units)270.
  • CPU central processing unit
  • CPU Central Processing Unit
  • the power supply 210 is used to provide operating voltage for each hardware device on the electronic device 2000 .
  • the interface 230 includes at least one wired or wireless network interface for interacting with external devices. For example, the interaction between the collection terminal 110 and the server terminal 130 in the implementation environment shown in Figure 1 is performed.
  • the interface 230 may further include at least one serial-to-parallel conversion interface 233, at least one input-output interface 235, and at least one USB interface 237, etc., as shown in Figure 12, which is not intended here. This constitutes a specific limitation.
  • the memory 250 can be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc.
  • the resources stored thereon include the operating system 251, application programs 253, data 255, etc., and the storage method can be short-term storage or permanent storage. .
  • the operating system 251 is used to manage and control each hardware device and application program 253 on the electronic device 2000, so as to realize the operation and processing of the massive data 255 in the memory 250 by the central processor 270, which can be Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM, etc.
  • the application program 253 is a computer program that performs at least one specific job based on the operating system 251. It may include at least one module (not shown in FIG. 12), and each module may include a computer program for the electronic device 2000. .
  • the image recognition device can be regarded as an application program 253 deployed on the electronic device 2000.
  • the data 255 may be photos, pictures, etc. stored in a disk, or may be an image to be recognized, etc., stored in the memory 250 .
  • the central processing unit 270 may include one or more processors and is configured to communicate with the memory 250 through at least one communication bus to read the computer program stored in the memory 250 and thereby implement operations on the massive data 255 in the memory 250 and processing. For example, the image recognition method is completed by the central processor 270 reading a series of computer programs stored in the memory 250 .
  • present application can also be implemented through hardware circuits or hardware circuits combined with software. Therefore, implementation of the present application is not limited to any specific hardware circuit, software, or combination of the two.
  • the electronic device 400 may include: a desktop computer, a notebook computer, an electronic device, etc.
  • the electronic device 4000 includes at least one processor 4001 , at least one communication bus 4002 and at least one memory 4003 .
  • the processor 4001 and the memory 4003 are connected, such as through a communication bus 4002.
  • the electronic device 4000 may also include a transceiver 4004, which may be used for data interaction between the electronic device and other electronic devices, such as data transmission and/or data reception.
  • a transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as data transmission and/or data reception. It should be noted that in practical applications, the number of transceivers 4004 is not limited to one, and the structure of the electronic device 4000 does not constitute a limitation on the embodiments of the present application.
  • the processor 4001 can be a CPU (Central Processing Unit, central processing unit), a general-purpose processor, a DSP (Digital Signal Processor, a data signal processor), an ASIC (Application Specific Integrated Circuit, an application-specific integrated circuit), or an FPGA (Field Programmable Gate Array). , field programmable gate array) or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various illustrative logical blocks, modules, and circuits described in connection with this disclosure.
  • the processor 4001 may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, etc.
  • Communication bus 4002 may include a path that carries information between the above-mentioned components.
  • the communication bus 4002 may be a PCI (Peripheral Component Interconnect, Peripheral Component Interconnect Standard) bus or an EISA (Extended Industry Standard Architecture) bus, etc.
  • the communication bus 4002 can be divided into an address bus, a data bus, a control bus, etc. For ease of presentation, only one thick line is used in Figure 13, but it does not mean that there is only one bus or one type of bus.
  • the memory 4003 can be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, RAM (Random Access Memory) or other types that can store information and instructions.
  • Dynamic storage devices can also be EEPROM (Electrically Erasable Programmable Read Only Memory), CD-ROM (Compact Disc Read Only Memory) or other optical disk storage, optical disk storage (including compression Optical disc, laser disc, optical disc, digital versatile disc, Blu-ray disc, etc.), magnetic disk storage medium or other magnetic storage device, or can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer Any other medium, without limitation.
  • the computer program is stored in the memory 4003, and the processor 4001 reads the computer program stored in the memory 4003 through the communication bus 4002.
  • embodiments of the present application provide a storage medium, and a computer program is stored on the storage medium.
  • the computer program is executed by a processor, the image recognition method in the above embodiments is implemented.
  • An embodiment of the present application provides a computer program product.
  • the computer program product includes a computer program, and the computer program is stored in a storage medium.
  • the processor of the electronic device reads the computer program from the storage medium, and the processor executes the computer program, so that the electronic device performs the image recognition method in the above embodiments.
  • the image recognition framework based on image retrieval uses powerful image feature representation brought by supervised contrastive learning to make the features belong to the same category in the feature space.
  • the clustering of positive examples and the distance of negative examples belonging to different categories not only avoids relying on model retraining, but also effectively improves the recognition efficiency of image recognition, and fully guarantees the recognition accuracy of image recognition.
  • the retrieval library in the image recognition framework is not only suitable for re-training, but also suitable for user adjustment, which is conducive to flexible and customized services for recognition tasks of different attributes and spaces.
  • the number of sample categories in the retrieval database should be expanded as much as possible so that the image recognition ability can take into account the diversity;
  • the sample images in the retrieval database can be Limiting the sample categories, that is, excluding impossible sample categories, can not only reduce the calculation amount of similarity calculation, but also prevent the image to be recognized from being misidentified as impossible sample categories, indirectly ensuring the recognition performance of image recognition;
  • the size of the retrieval library can be further reduced to include only the sample categories of interest.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Image Analysis (AREA)

Abstract

Les modes de réalisation de la présente demande, qui relèvent du domaine technique du traitement d'image, concernent un procédé et un appareil de reconnaissance d'image. Le procédé comprend : l'acquisition d'une image à reconnaître ; la réalisation d'une extraction de caractéristique d'image sur l'image à reconnaître de façon à obtenir un premier vecteur de caractéristique ; la recherche d'une bibliothèque de récupération, qui est utilisée pour stocker des images d'échantillon et des catégories d'échantillon correspondant aux images d'échantillon, pour une image d'échantillon dans laquelle la similarité entre un second vecteur de caractéristique et le premier vecteur de caractéristique satisfait une condition de similarité, le second vecteur de caractéristique étant utilisé pour représenter une caractéristique d'image de l'image d'échantillon ; et la détermination, selon une catégorie d'échantillon correspondant à l'image d'échantillon trouvée, d'une catégorie cible de l'image à reconnaître. Au moyen des modes de réalisation de la présente demande, les problèmes de faible précision de reconnaissance, d'instabilité et de médiocres performances de généralisation dans l'état de la technique peuvent être résolus.
PCT/CN2022/137039 2022-06-01 2022-12-06 Procédé et appareil de reconnaissance d'image WO2023231355A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210617217.1 2022-06-01
CN202210617217.1A CN117218356A (zh) 2022-06-01 2022-06-01 图像识别方法及装置

Publications (1)

Publication Number Publication Date
WO2023231355A1 true WO2023231355A1 (fr) 2023-12-07

Family

ID=89026872

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/137039 WO2023231355A1 (fr) 2022-06-01 2022-12-06 Procédé et appareil de reconnaissance d'image

Country Status (2)

Country Link
CN (1) CN117218356A (fr)
WO (1) WO2023231355A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118035496A (zh) * 2024-04-15 2024-05-14 腾讯科技(深圳)有限公司 视频推荐方法、装置、电子设备和存储介质
CN118397420A (zh) * 2024-07-01 2024-07-26 中国计量大学 一种图像目标识别方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107690659A (zh) * 2016-12-27 2018-02-13 深圳前海达闼云端智能科技有限公司 一种图像识别系统及图像识别方法
CN111898416A (zh) * 2020-06-17 2020-11-06 绍兴埃瓦科技有限公司 视频流处理方法、装置、计算机设备和存储介质
CN112633297A (zh) * 2020-12-28 2021-04-09 浙江大华技术股份有限公司 目标对象的识别方法、装置、存储介质以及电子装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107690659A (zh) * 2016-12-27 2018-02-13 深圳前海达闼云端智能科技有限公司 一种图像识别系统及图像识别方法
CN111898416A (zh) * 2020-06-17 2020-11-06 绍兴埃瓦科技有限公司 视频流处理方法、装置、计算机设备和存储介质
CN112633297A (zh) * 2020-12-28 2021-04-09 浙江大华技术股份有限公司 目标对象的识别方法、装置、存储介质以及电子装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "One article to understand Ranking Loss/Margin Loss/Triplet Loss", 10 August 2020 (2020-08-10), XP093115639, Retrieved from the Internet <URL:https://www.cvmart.net/community/detail/3108> *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118035496A (zh) * 2024-04-15 2024-05-14 腾讯科技(深圳)有限公司 视频推荐方法、装置、电子设备和存储介质
CN118397420A (zh) * 2024-07-01 2024-07-26 中国计量大学 一种图像目标识别方法

Also Published As

Publication number Publication date
CN117218356A (zh) 2023-12-12

Similar Documents

Publication Publication Date Title
US11551333B2 (en) Image reconstruction method and device
WO2020228446A1 (fr) Procédé et appareil d&#39;entraînement de modèles, et terminal et support de stockage
WO2022068196A1 (fr) Procédé et dispositif de traitement de données intermodales, support d&#39;enregistrement et dispositif électronique
WO2023231355A1 (fr) Procédé et appareil de reconnaissance d&#39;image
WO2019100724A1 (fr) Procédé et dispositif d&#39;apprentissage de modèle de classification à étiquettes multiples
US20220375213A1 (en) Processing Apparatus and Method and Storage Medium
WO2017096753A1 (fr) Procédé de suivi de point clé facial, terminal et support de stockage lisible par ordinateur non volatil
CN113065636B (zh) 一种卷积神经网络的剪枝处理方法、数据处理方法及设备
US20220222918A1 (en) Image retrieval method and apparatus, storage medium, and device
WO2020186887A1 (fr) Procédé, dispositif et appareil de détection de cible pour petites images échantillons continues
WO2021218470A1 (fr) Procédé et dispositif d&#39;optimisation de réseau neuronal
CN112052868A (zh) 模型训练方法、图像相似度度量方法、终端及存储介质
WO2023221713A1 (fr) Procédé et appareil d&#39;apprentissage de codeur d&#39;image, dispositif et support
CN113205142A (zh) 一种基于增量学习的目标检测方法和装置
WO2023221790A1 (fr) Procédé et appareil d&#39;apprentissage de codeur d&#39;image, dispositif et support
US20210081677A1 (en) Unsupervised Video Object Segmentation and Image Object Co-Segmentation Using Attentive Graph Neural Network Architectures
US20240312252A1 (en) Action recognition method and apparatus
WO2023273572A1 (fr) Procédé de construction de modèle d&#39;extraction de caractéristiques et procédé de détection de cible, et dispositif associé
US20200151518A1 (en) Regularized multi-metric active learning system for image classification
CN113987119A (zh) 一种数据检索方法、跨模态数据匹配模型处理方法和装置
CN112529149A (zh) 一种数据处理方法及相关装置
WO2021051562A1 (fr) Procédé et appareil de positionnement de point de caractéristique faciale, dispositif informatique et support de stockage
CN111488479B (zh) 超图构建方法、装置以及计算机系统和介质
US20230072445A1 (en) Self-supervised video representation learning by exploring spatiotemporal continuity
CN114821140A (zh) 基于曼哈顿距离的图像聚类方法、终端设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22944656

Country of ref document: EP

Kind code of ref document: A1