CN115393673A

CN115393673A - Training method of object recognition model, object recognition method and electronic equipment

Info

Publication number: CN115393673A
Application number: CN202211043065.5A
Authority: CN
Inventors: 何天宇; 金鑫; 沈旭; 黄建强; 余文杰
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2022-11-25

Abstract

The application provides a training method of an object recognition model, an object recognition method, an image recognition method, electronic equipment and a storage medium, wherein the training method of the object recognition model is to perform first-stage training on an initial model by using a first image sample set marked by a pseudo label to obtain a preliminarily trained model; the pseudo label is determined using the unlabeled first set of image samples; performing second-stage training on the preliminarily trained model by using a second image sample set marked by a soft label to obtain an object recognition model; the soft label is determined using the unlabeled first set of image samples and the unlabeled second set of image samples. The image recognition method provided by the embodiment of the application divides the training process of the object recognition model into two stages. In the second stage, the soft label can be obtained by directly utilizing the pseudo label in the first stage. And in the second stage, clustering calculation is not needed, so that the calculation expense caused by clustering calculation is saved.

Description

Training method of object recognition model, object recognition method and electronic equipment

Technical Field

The present application relates to the field of image recognition technologies, and in particular, to a training method for an object recognition model, an object recognition method, an image recognition method, an electronic device, and a storage medium.

Background

In the model training scenario, the cost of manual labeling is increasing. Therefore, how to train the model in an unsupervised manner becomes an increasingly popular research field.

The principle of unsupervised training is to learn by using unlabeled samples to find out structural knowledge in the training sample set, thereby solving the problem. Such as solving classification problems, solving prediction problems, etc.

Under the training scene by using a large-scale label-free sample, the pseudo label information is generally obtained in a clustering mode, so that the training of the model is completed. However, the clustering process has great demands for the video memory and the amount of computation of a Graphics Processing Unit (GPU), and taking an image sample acquired by roadside image acquisition equipment as an example, the clustering process may cause high resource consumption due to more contents in the image sample.

Disclosure of Invention

The embodiment of the application provides a training method of an object recognition model, an object recognition method, an image recognition method, electronic equipment and a storage medium.

In a first aspect, an embodiment of the present application provides an object identification method, which may include the following steps:

acquiring a candidate image;

the reference image and the candidate image including the target object are input to a pre-trained object recognition model, and the candidate image including the target object is recognized.

In a second aspect, an embodiment of the present application provides a method for training an object recognition model, where the method may include the following steps:

performing first-stage training on the initial model by using the first image sample set labeled by the pseudo label to obtain a preliminarily trained model; the pseudo label is determined using the unlabeled first set of image samples;

performing second-stage training on the preliminarily trained model by using a second image sample set labeled by a soft label to obtain an object recognition model; the soft label is determined using the unlabeled first set of image samples and the unlabeled second set of image samples.

In a third aspect, an embodiment of the present application provides an image recognition method, which applies augmented reality equipment and/or virtual reality equipment, and includes:

acquiring a candidate image;

inputting a reference image and a candidate image containing a target object into a pre-trained object recognition model, and recognizing the candidate image containing the target object;

the candidate image containing the target object is rendered onto a display of the augmented reality device and/or the virtual reality device.

In a fourth aspect, an embodiment of the present application provides an apparatus for object recognition, where the apparatus may include:

an acquisition module for acquiring a candidate image;

and the identification module is used for inputting the reference image and the candidate image containing the target object into a pre-trained object identification model and identifying the candidate image containing the target object.

In a fifth aspect, an embodiment of the present application provides an apparatus for training an object recognition model, where the apparatus may include:

the first-stage training module is used for carrying out first-stage training on the initial model by using the first image sample set labeled by the pseudo label to obtain a model after the initial training; the pseudo label is determined by using the unmarked first image sample set;

the second-stage training module is used for carrying out second-stage training on the preliminarily trained model by using the second image sample set labeled by the soft label to obtain an object recognition model; the soft label is determined using the unlabeled first set of image samples and the unlabeled second set of image samples.

In a sixth aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory, where the processor implements the method of any one of the above when executing the computer program.

In a seventh aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the method described in any one of the foregoing methods.

Compared with the prior art, the method has the following advantages:

according to the embodiment of the application, compared with the method that clustering labeling is directly carried out on unlabeled sample characteristics by using a clustering technology to obtain a pseudo label so as to finish model training. According to the method and the device, a training process of the object recognition model can be divided into two stages, and in the first stage, clustering labeling is carried out on unlabeled sample characteristics to obtain the pseudo label. In the second stage, the soft label can be obtained by directly utilizing the clustering labeling result or the pseudo label in the first stage. In other words, clustering calculation is not needed in the second stage, and especially for the situation of image samples acquired by roadside image acquisition equipment, the calculation overhead caused by clustering calculation can be saved for complex image sample processing. Meanwhile, in the second stage, the model training is carried out by using the sample labeled by the soft label, so that the generalization of the model can be improved, and the recognition accuracy of the object recognition model is guaranteed.

The foregoing description is only an overview of the technical solutions of the present application, and the following detailed description of the present application is given to enable the technical means of the present application to be more clearly understood and to enable the above and other objects, features, and advantages of the present application to be more clearly understood.

Drawings

In the drawings, like reference characters designate like or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are not to be considered limiting of its scope.

Fig. 1 is a schematic view of a scene based on an object recognition method provided in the present application;

FIG. 2 is a flow chart of a method of object recognition according to an embodiment of the present application;

FIG. 3 is a flow chart of a method of training an object recognition model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of determining a pseudo tag according to an embodiment of the present application;

FIG. 5 is a schematic diagram of determining a soft label according to an embodiment of the present application;

FIG. 6 is a schematic diagram of determining class centers according to an embodiment of the present application;

FIG. 7 is a schematic diagram of updating class centers according to an embodiment of the present application;

FIG. 8 is a block diagram of an apparatus for object recognition according to an embodiment of the present application;

FIG. 9 is a block diagram of an apparatus for training an object recognition model according to an embodiment of the present application; and

FIG. 10 is a block diagram of an electronic device used to implement embodiments of the present application.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

In order to facilitate understanding of the technical solutions of the embodiments of the present application, the following description is made of related art of the embodiments of the present application. The following related arts as alternatives can be arbitrarily combined with the technical solutions of the embodiments of the present application, and all of them belong to the scope of the embodiments of the present application.

First, terms referred to in the present application will be explained.

Deep Learning (Deep Learning): deep learning refers to an algorithm set for solving various problems of images, texts, audios and videos and the like by applying various machine learning algorithms on a multilayer neural network. Illustratively, for example, object recognition in the image, summary generation of text, related information generation of audio and video, and the like. The core of deep learning is feature learning, which aims to obtain hierarchical feature information through a hierarchical network, thereby solving the important problem that features need to be designed manually in the past.

Artificial Neural Networks (ans): or simply Neural Networks or connection models (Connectionist models). The artificial neural network is an operation model abstracted and established from the information processing angle on the human brain neuron network, and different networks are formed according to different connection modes. The artificial neural network has a self-learning function. For example, in an image recognition scenario, the image sample and the corresponding label are input into an artificial neural network, and the network learns to recognize similar images through a self-learning function.

Meta Learning (Meta Learning): meta-learning enables the model to obtain the ability to adjust hyper-parameters, enabling the model to learn new tasks quickly based on obtaining known knowledge. Meta-learning solves the problem of learning (learning to learning). The difference between meta-learning and traditional machine learning is that the machine learning is to adjust parameters artificially first and then train a depth model under a specific task directly, and the meta-learning is to train a better hyper-parameter through other tasks first and then train the specific task to optimize the hyper-parameter.

Clustering (Clustering): in unsupervised learning, the labeled information of the training samples is unknown, and the aim is to reveal the intrinsic rules of the data through the learning of the unlabeled training samples, so as to provide a basis for further data analysis. Clustering can divide samples in a sample set into a plurality of subsets which are usually mutually disjoint, wherein each subset is called a cluster, namely, data samples in the same cluster are considered to have the same category, and training of tasks such as category is completed based on the category.

Fig. 1 is a schematic diagram of an exemplary application scenario for implementing the method of the embodiment of the present application. In fig. 1, an image containing a target object is queried in a candidate image set based on the image containing the target object. The executing body for executing the query process can be an electronic device such as a smart phone and a tablet computer. The target object may be a person, an animal, or the like, or may be clothing, an electronic product, a vehicle, or the like. The candidate image set may be an image captured in the video, an image in a network, or the like. Illustratively, the video may be a road traffic video. Alternatively, the video may be a video or movie work or the like taken by the user. In the scenario shown in fig. 1, the target object is a vehicle.

The specific query principle is briefly described as follows. In one aspect, an image containing a target object is input into a pre-trained object recognition model. The object recognition model can extract the characteristics of the target object and obtain the characteristic representation of the target object. On the other hand, the candidate image set may be input to a pre-trained object recognition model, so that the object recognition model may extract features of candidate objects included in each candidate image and obtain a feature representation of each candidate object. Finally, the object recognition model may determine candidate images with a higher probability of having the target object by comparing the similarity of the feature representation of the target object and the feature representations of the candidate objects.

In addition, the features of the images in the candidate images can also be obtained in a pre-stored manner. For example, in the case of updating the candidate image, feature extraction may be performed on the updated subsequent image. Thereby improving efficiency in a query scenario.

The scheme of the application can be suitable for inquiring lost people, animals and the like, and can also be suitable for inquiring escaping vehicles. Or the system can be suitable for searching works of actors, searching commodities of the same or similar types, and the like. For example, an image of an actor may be input into the object recognition model, and the movie and television work played by the actor may be queried in the movie and television database by using the object recognition model. For another example, a clothing image or an electronic product image that the user likes may be input to the object recognition model, and the shopping site may be searched for the same clothing or similar clothing using the object recognition model.

An embodiment of the present application provides an object identification method, and as shown in fig. 2, a flowchart of the object identification method according to the embodiment of the present application may include:

step S201: and acquiring a candidate image.

The candidate image may be an image captured in the road traffic video. Alternatively, the candidate image may be an image that is truncated from the video in the network. Still alternatively, the candidate image may be an image in a network. Such as an image that appears in an application. The candidate image may be a real-time image, a history image, an image existing in a web page, or the like. The candidate image may be a single image or a set of candidate images including a plurality of images.

Step S202: the reference image and the candidate image including the target object are input to a pre-trained object recognition model, and the candidate image including the target object is recognized.

The target object may be a person, an animal, a vehicle, a garment, an electronic product, or the like. The pre-trained object recognition model compares the candidate images with the reference image one by one to determine whether the candidate images contain the target object. Alternatively, the pre-trained object recognition model may determine the probability of the target object appearing in the candidate image. And finally, taking the determined candidate image with the target object as a target image, or taking the candidate image with the probability of the target object higher than the corresponding probability threshold value as the target image.

Take the example where the target object is an animal. A photograph of a lost pet may be used as a reference image. Road video images collected by collecting equipment within a certain distance range around a community where a pet owner lives are used as candidate images, and therefore finding of lost pets is achieved.

Take the example where the target object is a person. A photograph of the lost person may be used as the reference image. Road video images acquired by acquisition equipment within a certain distance range around a community where the lost people live are used as candidate images, so that the lost people are searched.

Take the example that the target object is a vehicle. The picture of the offending vehicle may be used as a reference image. Taking the road video image collected by the collecting device in the region, county, etc. where the accident location of the accident vehicle belongs as the candidate image, thereby realizing the investigation of the hit-and-run vehicle.

The target object is exemplified by clothing and electronic products. The photo of the clothing and the electronic product can be used as a reference image. And taking the commodity image under the corresponding vertical category of the shopping website as a candidate image, thereby realizing the query of the same or similar clothing and electronic products.

In a possible implementation manner, the method may further include the following steps:

determining relevant information of the target object by using a candidate image containing the target object; the related information includes at least one of identification information of the target object and location information of the target object.

The related information of the target object includes identification information of the target object, and/or position information of the target object. For the identification information, taking the target object as a vehicle as an example, the identification information of the target object may be a license plate number of the vehicle determined by image recognition. Taking the target object as an animal as an example, the mark of the target object may be a breed, a hair color, and the like of the animal determined by image recognition, such as an orange cat, a spotted dog, and the like. For the position, the installation position or the installation area of the acquisition device may be determined according to the number of the acquisition device that detects the target image. The determined mounting position or mounting area can then be used as the position information of the target object.

An embodiment of the present application provides a method for training an object recognition model, and as shown in fig. 3, a flowchart of the method for training an object recognition model according to an embodiment of the present application may include:

step S301: performing first-stage training on the initial model by using the first image sample set marked by the pseudo label to obtain a preliminarily trained model; the pseudo-label is determined using the unlabeled first set of image samples.

Pseudo labels are concepts in unsupervised training, and refer to labels that label unlabeled samples in an automated rather than manual manner. In one case, when determining the pseudo label, a small amount of labeled data may be used to perform initial training on the model to be trained, so as to obtain an initially trained model, and then the initially trained model may be used to predict label-free data, so as to obtain a predicted label. The predicted label can be used as a pseudo label without labeling data.

Or, in another case, the pseudo tag may be determined by predicting the label-free data by using an untrained model to obtain the features of the label-free data. And clustering the characteristics of the unmarked data by using a clustering algorithm to obtain a clustering result. The pseudo label can be obtained by using the clustering result. The so-called untrained model may be a generic feature extraction model.

And labeling the image samples in the first image sample set by using the pseudo labels. Taking the image sample as a human image sample as an example, the pseudo labels may be feature representations of different human in the human image sample. For example, pseudo tags may be used to characterize a person's gender, height, skin tone, hair color, clothing, and other information in multiple dimensions.

And inputting the first image sample set labeled by the pseudo label into an initial model by utilizing deep learning and meta-clustering learning technologies and combining an artificial neural network, wherein the initial model can obtain a label prediction result. There may be a difference between the tag prediction result and the pseudo tag, which may be embodied by a loss function. The effect of the loss function can be understood as: when the label prediction result obtained by forward propagation of the initial model to be trained is close to the pseudo label, the loss function takes a smaller value; conversely, the value of the loss function increases. The loss function is a function having parameters in the initial model as arguments.

And adjusting parameters in the initial model by using the difference. The difference is propagated backward in each layer of the initial model, and the parameters of each layer of the initial model are adjusted according to the difference until the output result of the initial model converges or reaches the expected effect.

A first stage of training for the initial model can thus be achieved. After the first stage training, the initial model may be trained to be a preliminarily trained model.

Step S302: performing second-stage training on the preliminarily trained model by using a second image sample set labeled by a soft label to obtain an object recognition model; the soft label is determined using the unlabeled first set of image samples and the unlabeled second set of image samples.

The soft label is also a concept in unsupervised training, and refers to a label corresponding to the labeled data after label dispersion. For example, the original label of the image sample is "cat," but there are also dogs and people in the image sample. If only the label "cat" is used for characterizing the image sample features, the model training result may be interfered by noise. By setting soft labels 'dog' and 'human', the prediction probability result of the model can be trained instead of the label of the original image.

The soft label may be determined using an unlabeled first set of image samples and an unlabeled second set of image samples. The unlabeled second image sample set can be input into the preliminarily trained model, and a feature extraction result for the unlabeled second image sample set is obtained. For example, the number of feature extraction results of the second image sample set is n. And obtaining the pseudo label by using the unmarked first image sample set, and obtaining the soft label corresponding to the feature extraction result according to the similarity between each feature extraction result and the pseudo label. Alternatively, as previously described, the feature clustering result may be obtained using the unlabeled first set of image samples. And then the class center of the clustering result can be obtained. According to the similarity between each feature extraction result and the class center, the soft label corresponding to the feature extraction result can be obtained.

And performing a second stage of training on the preliminarily trained model by using the second image sample set labeled by the soft label. When a specified condition is met, it may indicate that the second stage training is complete. The specified condition may be determined according to factors such as a training round, the number of samples participating in training, a training duration, and whether a loss function converges. For example, the second image sample set labeled with the soft label is used for performing parameter optimization on the model after the initial training, and the a-round parameter optimization can be determined to meet the specified conditions. The value a may be empirically determined, taking a positive integer. For another example, the number of samples participating in the second stage training reaches 50 ten thousand, or 100 ten thousand, etc., and it may be determined that the specified condition is satisfied. As another example, the second stage training duration may be up to 200 hours, or up to 1000 hours, etc., and it may be determined that the specified condition is met. Also for example, whether a specified loss function converges or not may be utilized as a specified condition.

Compared with the method that the clustering technology is directly utilized to cluster and label the unlabeled sample characteristics to obtain the pseudo label, the training is further completed. In the current implementation mode, the training process of the object recognition model can be divided into two stages, and in the first stage, clustering labeling is performed on unlabeled sample features to obtain pseudo labels. In the second stage, the clustering labeling result or the pseudo label in the first stage can be directly used to obtain the soft label. That is, no clustering calculation is required in the second stage, so that the calculation overhead caused by clustering calculation can be saved. Meanwhile, in the second stage, the model training is carried out by using the sample labeled by the soft label, so that the generalization of the model can be improved, and the recognition accuracy of the object recognition model is guaranteed.

With reference to fig. 4, in a possible implementation manner, the determining manner of the pseudo tag may include:

step S401: and extracting the features of the unmarked first image sample set by using the initial model to obtain a first feature extraction result.

Will be unlabeledThe first image sample set is input into the initial model, and a first feature extraction result can be obtained. The first feature extraction result is shown by X in FIG. 4 ₁ And (4) showing. In the example shown in fig. 4, a plurality of vehicles are included in the image sample. Based on this, the first feature extraction result may characterize the model, color, number of passengers in the vehicle, position of the vehicle in the image sample, etc. of the vehicle in the image sample.

Step S402: clustering the first feature extraction result to obtain at least one feature cluster; the feature clusters are used to characterize features of candidate objects, which are objects appearing in respective image samples of the first set of image samples that are not labeled.

The candidate objects included in each image sample in the sample set may be different, for example, the candidate object in the m-th image sample is a vehicle, the candidate object in the m + 1-th image sample is a vehicle and a pedestrian, and the candidate object in the m + 2-th image sample is a pedestrian, a pet, and a lane line. The purpose of clustering may be to cluster together features belonging to the same candidate object in each image sample. Wherein m is a positive integer.

For example, the Clustering process of the first feature extraction result may adopt a K-means algorithm (K-means), a Density-Based Spatial Clustering of Applications with Noise (DBScan), and the like.

Step S403: and determining the pseudo label according to the characteristic cluster.

There are various ways to determine the pseudo tag according to the feature cluster, for example, any one feature may be used as the pseudo tag. For example, a plurality of features may be randomly selected, and an operation such as and logic or logic may be performed between the plurality of features, and the pseudo tag may be obtained based on the operation result of the logic operation. Also for example, an average calculation may be performed on the plurality of features, and the pseudo tag may be obtained based on a calculation result of the average calculation. The representation of the pseudo tag may be a code.

With reference to fig. 5, in a possible implementation manner, the determining manner of the soft label includes:

step S501: and performing feature extraction on the unmarked second image sample set by using the model after the initial training to obtain a second feature extraction result.

In the unlabeled second image sample set, the image samples may have labels from 1 to k. k is a positive integer, i.e. k image samples may be comprised in the second set of image samples. It will be understood that a plurality of second image sample sets may also be constructed, and the image samples in each second image sample set may not be repeated with respect to each other. For example, in the first second set of image samples, the image samples may have labels from 1 to k. In the second set of image samples, the image samples may be numbered k +1 to 2k.

And inputting the unmarked second image sample set into the preliminarily trained model to obtain corresponding features. The corresponding feature is denoted X in fig. 5 ₂ . The preliminarily trained models may correspond to those in FIG. 5f _θ . For example, inputting the first image in the second image sample set to the model after the initial training, i features can be obtained. And inputting the second image in the second image sample set to the model after the initial training to obtain j features. i and j are positive integers, and the values of i and j may be the same or different.

After feature extraction is performed on each image sample in the second image sample set, all features of the second image sample can be obtained. That is, all the features of the second image sample may correspond to the second feature extraction result, and the number of the second feature extraction results may be plural.

Step S502: determining a soft label by utilizing the similarity of the second feature extraction result and the class center; the class center is determined using the unlabeled first set of image samples, and the class center is used to characterize the feature.

The nature of class centers is a representation of features. The class center may be determined using the unlabeled first set of image samples, and the number of class centers may be multiple. After the second feature extraction result is obtained, for the ith feature, the similarity with each class center can be calculated to obtain a similarity calculation result. Illustratively, the score of the similarity calculation result is proportional to the degree of similarity. And based on the similarity calculation result, selecting the class center with the highest score as the soft label of the ith feature. The similarity calculation can adopt Euclidean distance, cosine similarity and other modes. In the case that the soft label is determined, each image sample in the second set of image samples may be labeled with the soft label. The representation of the soft label may be a code. In fig. 5, the process of determining the class center is illustrated by a robot icon. The robot icon may represent a soft label determination robot. That is, the robot is determined by the soft tag, and the comparison process of the similarity is performed.

With reference to fig. 6, in a possible implementation manner, the determining manner of the class center includes:

step S601: and extracting the features of the unmarked first image sample set by using the initial model to obtain a first feature extraction result.

And inputting the unmarked first image sample set into the initial model to obtain a first feature extraction result. The first feature extraction result is shown by X in FIG. 6 ₁ And (4) showing. In the example shown in fig. 6, a plurality of vehicles are included in the image sample. Based on this, the first feature extraction result may characterize the model, color, number of passengers in the vehicle, position of the vehicle in the image sample, etc. of the vehicle in the image sample.

Step S602: clustering the first feature extraction result to obtain at least one feature cluster; the feature clusters are used to characterize features of candidate objects appearing in the unlabeled first set of image samples.

For example, the clustering process of the first feature extraction result may adopt a K-means algorithm, a density-based noise application spatial clustering algorithm, or the like.

Step S603: and respectively carrying out class center calculation on each feature cluster to obtain a corresponding class center calculation result.

Class centric computation may be implemented by a class centric algorithm. For example, an averaging algorithm can be directly adopted to obtain the class center of each feature cluster. Or, a weighted averaging algorithm can be adopted to obtain the class center of each feature cluster. For example, an averaging algorithm may be used to calculate the average value, and thus the distance of each feature in the cluster of features from the average value. And setting the weight of each feature cluster according to the distance, and thus obtaining the class center of each feature cluster by adopting a weighted averaging algorithm. Or, the feature whose distance from the average value is greater than the corresponding distance threshold value may be filtered and removed, and the remaining features may be used to perform the averaging calculation again to obtain the class center.

With reference to fig. 7, in a possible implementation manner, the method further includes a step of updating the class center:

the step of updating the class center comprises the following steps:

step S701: and performing feature extraction on the first image sample set labeled by the pseudo label by using the model after the initial training to obtain a third feature extraction result.

In FIG. 7f _θ’ An initial model may be represented. The initial model can be trained in the first stage to obtain a preliminarily trained model. And the third feature extraction result is obtained by performing feature extraction on the first image sample set labeled by the pseudo label by using the model after the initial training.

Step S702: and updating the class center by using the third feature extraction result.

Since the training in the first stage is performed based on the first image sample set labeled with the pseudo label, and the class center is obtained by using the feature cluster, a difference may exist between the third feature extraction result and the class center. Based on this, in the case where there is a difference, the class center may be updated with the third feature extraction result. For example, an average value of the third feature extraction result and the class center may be calculated, and the class center may be updated using the average value calculation result. Alternatively, the class center may be directly replaced with the third feature extraction result.

In one possible implementation, the specified condition is determined according to a convergence condition of a loss function of a specified type;

the loss function is obtained by calculation by using a model trained in the second stage;

the specified type of loss function includes at least one of a weighted ternary loss function and a consistency loss function.

And inputting the image samples in the second image sample set marked by the soft label into a model in the training process of the second stage to obtain a characteristic prediction result. By utilizing the convergence condition of the specified type loss function, whether the second-stage training can be finished or not can be measured based on the characteristic prediction result. For example, in the case that the loss function of the specified type does not converge, the optimization effect of the parameters in the model can be checked based on the convergence condition of the loss function, and the parameters are iteratively optimized. Under the condition that the loss function of the specified type is converged, the second stage training can be ended to obtain a trained object recognition model.

The specified type of penalty function may be a ternary penalty function. And (5) utilizing a ternary loss function to test the characteristic prediction result, and utilizing the test result to optimize parameters in the model. The principle of the ternary loss function is to use three image samples to construct a triplet. In a triplet, the first image sample and the second image sample (positive sample) are similar, while the first image sample and the third image sample (negative sample) are dissimilar. For example, the first image sample is an image containing a target object, the positive sample is a candidate image containing the target object, and the negative sample is a candidate image not containing the target object. The goal of the triplet-based loss function is: and detecting whether the distance between the similar samples is smaller than that between the dissimilar samples, and the difference meets the expectation. Based on this, the model trained in the second stage can be made to query the candidate images for the target image containing the target object.

Furthermore, a weighted ternary loss function can be adopted to test the feature prediction result. The weighted ternary loss function is to weight some or all of the samples. For example, a first image sample is selected, and positive and negative samples are selected based on feature similarity. Specifically, a similarity threshold may be set, and samples higher than the similarity threshold are positive samples and samples not higher than the similarity threshold are negative samples through the similarity comparison. Also, the weight of each positive sample and each negative sample may be set according to the similarity result.

In addition, the consistency loss function can be used for testing the characteristic prediction result, and the testing result is used for optimizing parameters in the model. The principle of the consistency loss function may be to perform processing such as random mosaic, random smearing, affine transformation, random rotation, or mirror image on an image including the target object, and then detect whether the model can use the processed image as a target image for querying the target image including the target object.

In one possible implementation, the number of samples in the second set of image samples is n times the number of samples in the first set of image samples; n is more than 1.

In current implementations, the number of samples in the first set of image samples may be much less than the number of samples in the second set of image samples. For example, n may take on values of 10, 50, or even 100, etc. Based on this, the amount of data that needs to be aggregated in the first stage training can be reduced.

The embodiment of the application provides an image recognition method, which can be applied to Augmented Reality (AR) equipment and/or Virtual Reality (VR) equipment, and the method can include the following steps:

s801: and acquiring a candidate image.

The candidate image may be an image captured in the road traffic video. Alternatively, the candidate image may be an image captured in a video in the network. Still alternatively, the candidate image may be an image in a network. For example, an image that appears in an application. The candidate image may be a real-time image, a history image, an image existing in a web page, or the like. The candidate image may be a single image or a candidate image set including a plurality of images.

S802: the reference image and the candidate image including the target object are input to a pre-trained object recognition model, and the candidate image including the target object is recognized.

The target object is exemplified by an animal. A photograph of a lost pet may be used as a reference image. Road video images collected by collecting equipment within a certain distance range around a community where a pet owner lives are used as candidate images, and therefore finding of lost pets is achieved.

Take the example that the target object is a vehicle. The picture of the offending vehicle may be used as a reference image. The video images of the roads collected by the collecting equipment in the region, county and the like where the accident site of the accident vehicle belongs are taken as candidate images, so that the investigation of the accident vehicle is realized.

The target object is exemplified by clothing and electronic products. The pictures of the clothes and the electronic products can be used as reference images. And taking the commodity image corresponding to the vertical type of the shopping website as a candidate image, thereby realizing the query of the same or similar clothing and electronic products.

S803: the candidate image containing the target object is rendered onto a display of the augmented reality device and/or the virtual reality device.

By means of the augmented reality device and/or the virtual reality device, marks or descriptions of the target objects can be added in the candidate images through rendering, so that a user can visually lock the target objects in the candidate images or diversified viewing experiences of the target objects are presented. For example, in the case where the target object is an offending vehicle, for a candidate image including the offending vehicle, the offending vehicle may be subjected to frame-selection rendering in the candidate image, and information of the location, license plate number, and the like of the offending vehicle may be displayed in the display frame. Further, rendering such as frame selection may be performed for a person, a pet, or the like. In addition, when the target object is content such as clothes and accessories, the clothes or accessories can be regenerated and combined with the designated user image, so that the effect of wearing by the designated user can be reflected. Based on this, the user can be given more diversified viewing experiences.

Corresponding to the application scenario and the method of the method provided by the embodiment of the application, the embodiment of the application further provides an object identification device. As shown in fig. 8, an apparatus for object recognition according to an embodiment of the present application may include:

an obtaining module 801, configured to obtain a candidate image;

the recognition module 802 is configured to input a reference image and a candidate image including a target object into a pre-trained object recognition model, and recognize the candidate image including the target object.

In a possible implementation manner, the apparatus for object recognition may further include a related information determining module. The related information determining module is used for determining related information of the target object by using a candidate image containing the target object; the related information includes at least one of identification information of the target object and location information of the target object.

Corresponding to the application scenario and the method of the method provided by the embodiment of the application, the embodiment of the application further provides a training device of the object recognition model. Fig. 9 shows an apparatus for training an object recognition model according to an embodiment of the present application, where the apparatus for training an object recognition model may include:

a first-stage training module 901, configured to perform first-stage training on an initial model by using a first image sample set labeled with a pseudo tag, so as to obtain a model after the initial training; the pseudo label is determined by using the unmarked first image sample set;

a second-stage training module 902, configured to perform second-stage training on the preliminarily trained model by using the second image sample set labeled with the soft label, to obtain an object recognition model; the soft label is determined using the unlabeled first set of image samples and the unlabeled second set of image samples.

In one possible implementation, the first-stage training module 901 may include:

the first feature extraction submodule is used for extracting the features of the unmarked first image sample set by using the initial model to obtain a first feature extraction result;

the clustering submodule is used for clustering the first feature extraction result to obtain at least one feature cluster; the feature cluster is used for characterizing the features of the candidate objects appearing in the image samples of the first image sample set which is not marked;

and the pseudo label determining submodule is used for determining the pseudo label according to the characteristic cluster.

In one possible implementation, the second stage training module 902 may include:

the second feature extraction submodule is used for extracting features of the second image sample set which is not marked by the model after the initial training to obtain a second feature extraction result;

the soft label determining sub-module is used for determining a soft label by utilizing the similarity between the second feature extraction result and the class center; the class center is determined using the unlabeled first set of image samples, and the class center is used to characterize the feature.

In a possible implementation manner, the method further comprises a class center determining submodule. The class center determination module may further include:

the clustering submodule is used for clustering the first feature extraction result to obtain at least one feature cluster; the feature cluster is used for characterizing the features of the candidate objects appearing in the unlabeled first image sample set;

and the class center calculation submodule is used for performing class center calculation on each feature cluster respectively to obtain a corresponding class center calculation result.

In a possible implementation manner, a class center updating module is further included. The class center updating module comprises:

the third feature extraction submodule is used for performing feature extraction on the first image sample set marked by the pseudo label by using the model after the initial training to obtain a third feature extraction result;

and the class center updating execution submodule is used for updating the class center by utilizing the third feature extraction result.

In one possible implementation, the specified condition is determined according to the convergence of a loss function of a specified type;

Corresponding to the application scenario and the method of the method provided by the embodiment of the application, the embodiment of the application further provides an image recognition device. The apparatus may be applied to an augmented reality device and/or a virtual reality device, and may include:

a candidate image acquisition module for acquiring a candidate image;

the target identification module is used for inputting a reference image and a candidate image containing a target object into a pre-trained object identification model and identifying the candidate image containing the target object;

a display control module to render a candidate image containing a target object onto a display of an augmented reality device and/or a virtual reality device.

The functions of each module in each device in the embodiment of the present application can be referred to the corresponding description in the above method, and have corresponding beneficial effects, which are not described herein again.

FIG. 10 is a block diagram of an electronic device used to implement embodiments of the present application. As shown in fig. 10, the electronic apparatus includes: a memory 1010 and a processor 1020, the memory 1010 having stored therein computer programs operable on the processor 1020. The processor 1020, when executing the computer program, implements the methods in the embodiments described above. The number of the memory 1010 and the processor 1020 may be one or more.

The electronic device further includes:

and a communication interface 1030, configured to communicate with an external device, and perform data interactive transmission.

If the memory 1010, the processor 1020, and the communication interface 1030 are implemented independently, the memory 1010, the processor 1020, and the communication interface 1030 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.

Optionally, in an implementation, if the memory 1010, the processor 1020, and the communication interface 1030 are integrated on a chip, the memory 1010, the processor 1020, and the communication interface 1030 may communicate with each other through an internal interface.

Embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the computer program implements the method provided in the embodiments of the present application.

The embodiment of the present application further provides a chip, where the chip includes a processor, and is configured to call and run an instruction stored in a memory from the memory, so that a communication device in which the chip is installed executes the method provided in the embodiment of the present application.

An embodiment of the present application further provides a chip, including: the system comprises an input interface, an output interface, a processor and a memory, wherein the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the embodiment of the application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting an Advanced reduced instruction set machine (ARM) architecture.

Further, the memory may optionally include a read only memory and a random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may include a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can include Random Access Memory (RAM), which acts as external cache Memory. By way of example, and not limitation, many forms of RAM may be used. For example, static Random Access Memory (Static RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), enhanced SDRAM (ESDRAM), SLDRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the present application are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Any process or method described in a flow diagram or otherwise herein may be understood as representing a module, segment, or portion of code, which includes one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.

The logic and/or steps described in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The above description is only an exemplary embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of various changes or substitutions within the technical scope of the present application, and these should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of object recognition, comprising:

acquiring a candidate image;

and inputting the reference image containing the target object and the candidate image into a pre-trained object recognition model, and recognizing the candidate image containing the target object.

2. The method of claim 1, further comprising:

determining relevant information of the target object by using the candidate image containing the target object; the related information includes at least one of identification information of the target object and location information of the target object.

3. A method for training an object recognition model, comprising:

performing first-stage training on the initial model by using the first image sample set labeled by the pseudo label to obtain a preliminarily trained model; the pseudo label is determined using an unlabeled first set of image samples;

4. The method of claim 3, wherein the determination of the pseudo tag comprises:

extracting the features of the unmarked first image sample set by using the initial model to obtain a first feature extraction result;

clustering the first feature extraction result to obtain at least one feature cluster; the feature cluster is used for characterizing features of a candidate object, wherein the candidate object is an object appearing in each image sample of the first unlabeled image sample set;

and determining a pseudo label according to the characteristic cluster.

5. The method of claim 3, wherein determining the soft label comprises:

performing feature extraction on the unmarked second image sample set by using the model after the preliminary training to obtain a second feature extraction result;

determining a soft label by utilizing the similarity between the second feature extraction result and the class center; the class center is determined using the unlabeled first set of image samples, the class center being used to characterize a feature.

6. The method of claim 5, wherein the determining of the class center comprises:

clustering the first feature extraction result to obtain at least one feature cluster; the feature cluster is used for characterizing features of a candidate object, wherein the candidate object is an object appearing in each image sample of the first set of unlabeled image samples;

and respectively carrying out class center calculation on each feature cluster to obtain a corresponding class center calculation result.

7. The method of claim 5 or 6, further comprising the step of updating the class center by:

the step of updating the class center includes:

performing feature extraction on the first image sample set labeled by the pseudo label by using the model after the initial training to obtain a third feature extraction result;

and updating the class center by using the third feature extraction result.

8. The method according to claim 3, wherein in the second stage training process, under the condition that a specified condition is met, an object recognition model is obtained;

the specified condition is determined according to the convergence condition of the loss function of the specified type;

the loss function is obtained by model calculation in the second stage training process;

9. The method of claim 3, wherein the number of samples in the second set of image samples is n times the number of samples in the first set of image samples; the n is more than 1.

10. An image recognition method is applied to an augmented reality device and/or a virtual reality device, and is characterized by comprising the following steps:

acquiring a candidate image;

inputting a reference image containing a target object and the candidate image into a pre-trained object recognition model, and recognizing the candidate image containing the target object;

rendering the candidate image containing the target object onto a display of the augmented reality device and/or virtual reality device.

11. An apparatus for object recognition, comprising:

an acquisition module for acquiring a candidate image;

and the identification module is used for inputting the reference image containing the target object and the candidate image into a pre-trained object identification model and identifying the candidate image containing the target object.

12. An apparatus for training an object recognition model, comprising:

the first-stage training module is used for carrying out first-stage training on the initial model by using the first image sample set labeled by the pseudo label to obtain a model after the initial training; the pseudo label is determined using an unlabeled first set of image samples;

the second-stage training module is used for performing second-stage training on the preliminarily trained model by using a second image sample set marked by a soft label to obtain an object recognition model; the soft label is determined using the unlabeled first set of image samples and the unlabeled second set of image samples.

13. An electronic device comprising a memory, a processor and a computer program stored on the memory, the processor implementing the method of any one of claims 1 to 9 when executing the computer program.

14. A computer-readable storage medium, having stored therein a computer program which, when executed by a processor, implements the method of any one of claims 1 to 9.