CN113837256A - Object recognition method, network training method and device, equipment and medium - Google Patents

Object recognition method, network training method and device, equipment and medium Download PDF

Info

Publication number
CN113837256A
CN113837256A CN202111081370.9A CN202111081370A CN113837256A CN 113837256 A CN113837256 A CN 113837256A CN 202111081370 A CN202111081370 A CN 202111081370A CN 113837256 A CN113837256 A CN 113837256A
Authority
CN
China
Prior art keywords
network
domain
loss
image
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111081370.9A
Other languages
Chinese (zh)
Other versions
CN113837256B (en
Inventor
余世杰
朱烽
赵瑞
乔宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Sensetime Technology Co Ltd
Original Assignee
Shenzhen Sensetime Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Sensetime Technology Co Ltd filed Critical Shenzhen Sensetime Technology Co Ltd
Priority to CN202111081370.9A priority Critical patent/CN113837256B/en
Publication of CN113837256A publication Critical patent/CN113837256A/en
Priority to PCT/CN2022/077443 priority patent/WO2023040195A1/en
Application granted granted Critical
Publication of CN113837256B publication Critical patent/CN113837256B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The embodiment of the application provides an object identification method, a network training method, a device, equipment and a medium, wherein: respectively determining a first alignment loss between the comprehensive network and the first domain network and a second alignment loss between the comprehensive network and the second domain network based on the first source domain sample image and the second source domain sample image; determining the domain loss of the first domain network and the domain loss of the second domain network based on the first source domain sample image and the first alignment loss and the second source domain sample image and the second alignment loss respectively; determining the cooperative loss between the two domain networks based on the second source domain sample image, the second alignment loss and the domain loss of the first domain network; determining network parameters of a comprehensive network based on the network parameters of the first domain network and the second domain network; and adjusting the network parameters of the object recognition network based on the domain loss and the cooperative loss of the two domain networks so as to enable the loss of the adjusted overall network output to meet the convergence condition.

Description

Object recognition method, network training method and device, equipment and medium
Technical Field
The embodiment of the application relates to the technical field of object recognition, and relates to but is not limited to an object recognition method, a network training device, equipment and a medium.
Background
In the related art, many pedestrian re-identification algorithms are based on the assumption that: the data of both the training set and the test set are from one domain. Due to the problem of domain shift, the above assumption is difficult to be established in practical application, so that the accuracy of pedestrian recognition in practical use of the trained pedestrian re-recognition model is low.
Disclosure of Invention
The embodiment of the application provides a technical scheme for training an object recognition network.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a training method of an object recognition network, wherein the object recognition network comprises a comprehensive network, a first field network and a second field network, and the method comprises the following steps:
respectively determining a first alignment loss between the comprehensive network and a first domain network and a second alignment loss between the comprehensive network and a second domain network based on a first source domain sample image and a second source domain sample image; wherein the first source domain sample image and the second source domain sample image are not overlapped, the first source domain sample image is used for training the first domain network, and the second source domain sample image is used for training the second domain network;
determining a domain loss of the first domain network and determining a domain loss of the second domain network based on the first source domain sample image and the first alignment loss, the second source domain sample image and the second alignment loss, respectively;
determining a synergy loss between the first domain network and the second domain network based on the second source domain sample image, the second alignment loss, and a domain loss of the first domain network;
determining network parameters of the comprehensive network based on the network parameters of the first domain network and the second domain network;
and adjusting the network parameters of the object recognition network based on the field loss of the first field network, the field loss of the second field network and the cooperative loss between the first field network and the second field network so as to enable the adjusted loss of the overall network output to meet the convergence condition.
In some embodiments, determining a first alignment penalty between the comprehensive network and the first domain network based on the first source domain sample image comprises: performing feature extraction on the first source domain sample image based on the comprehensive network to obtain a first image feature; based on the first domain network, performing feature extraction and projective transformation on the first source domain sample image to obtain a second image feature; determining a first alignment penalty between the full network and the first domain network based on the second image features and the first image features. Therefore, the first alignment loss between the comprehensive network and the first field network is determined by determining the space characteristic distance between the characteristics, so that the guidance of the comprehensive network to the field network in the training process can be realized, the cooperation between the comprehensive network and the field network is realized, and the generalization performance of the network is improved.
In some embodiments, said determining a first alignment penalty between said full network and said first domain network based on said second image feature and said first image feature comprises: determining a feature distance between the second image feature and the first image feature in a feature space of the first source domain sample image; determining the first alignment penalty based on the feature distance and a true feature distance between features in the first source domain sample image. Therefore, the comprehensive network can utilize the correlation degree between the characteristics belonging to one image to feed back the training process of the first field network based on the sample image through the alignment loss, so that the correlation degree between the characteristics in the same sample image can be considered in the training process of the first field network.
In some embodiments, said determining a domain penalty for said first domain network based on said first source domain sample image and said first alignment penalty comprises: taking the first source domain sample image as meta-training data, and training the first domain network to obtain the internal loss of the first domain network; determining a meta-training penalty for adjusting network parameters of the first domain network based on the internal penalty and the first alignment penalty; determining a meta-training penalty for the first domain network as the domain penalty. In this way, the meta-training loss of the first domain network in the meta-training stage is determined in the meta-learning manner, and the meta-training loss is used as the domain loss for optimizing the network parameters of the first domain network, so that the flexibility of object identification performed by the first domain network can be improved.
In some embodiments, the training the first domain network using the first source domain sample image as meta-training data to obtain the internal loss of the first domain network includes: determining an image distance loss between the first source domain sample image and a positive sample image based on the image features extracted by the first domain network and the image features of the positive sample image; in the first domain network, classifying the image features extracted by the first domain network to obtain a first classification result; determining a classification loss of the image features extracted by the first domain network based on the first classification result and a truth classification label of the first source domain sample image; determining the classification loss and the image distance loss as the internal loss of the first domain network; the determining a meta-training penalty for the first domain network based on the internal penalty and the first alignment penalty comprises: and fusing the first alignment loss, the image distance loss and the classification loss to obtain the meta-training loss of the first field network. Therefore, the internal loss of the first field network and the first alignment loss are fused to form the meta-training loss for adjusting the network parameters of the first field network, so that the training speed of the first field network can be increased, and the flexibility of the first field network can be improved.
In some embodiments, the first domain network comprises: the method comprises the following steps of performing feature extraction and projection transformation on the first source domain sample image to obtain a second image feature, wherein the feature extraction comprises: performing feature extraction on the first source domain sample image based on the feature extractor; based on the projection network, carrying out projection transformation on the image features extracted by the feature extractor to obtain second image features; in the first domain network, classifying the image features extracted by the first domain network to obtain a first classification result, including: and classifying the image features extracted by the feature extractor based on the classifier to obtain the first classification result. Therefore, the training process of the first-field network is guided by the comprehensive network based on the alignment loss through determining the spatial distance between the second image feature and the first image feature.
In some embodiments, the determining a synergy loss between the first domain network and the second domain network based on the second source domain sample image, the second alignment loss, and a domain loss of the first domain network comprises: determining an adaptive parameter based on a network parameter of a feature extractor of the first domain network and a domain loss of the first domain network; determining a synergy loss between the first domain network and the second domain network based on the adaptation parameter, the second alignment loss, and the second source domain sample image. Therefore, collaborative learning of the first-domain network and the second-domain network between different source domain sample images without overlapping is facilitated based on the collaborative loss, and therefore the generalization capability of the whole network architecture is improved.
In some embodiments, said determining a coordination loss between said first domain network and said second domain network based on said adaptation parameter, said second alignment loss, and said second source domain sample image comprises: taking the adaptive parameters as network parameters of a feature extractor of the second domain network to obtain the updated second domain network; inputting the second source domain sample image as meta-test data into a feature extractor of the second domain network for feature extraction to obtain a third image feature; based on the second domain network, respectively performing projection conversion and feature classification on the third image features to obtain fourth image features and a second classification result; determining a meta-test penalty for the first domain network based on the fourth image feature, the second classification result, and the second alignment penalty; determining the meta-test loss as the synergy loss. Therefore, in the meta-learning mode, the sample image of the other source domain is adopted to carry out testing based on the second field network in the meta-testing stage, the meta-testing loss of the second field network in the meta-testing stage is determined, and the meta-testing loss is fed back to the first field network, so that the network parameters are optimized based on the field loss of the first field network.
In some embodiments, the determining the network parameters of the comprehensive network based on the network parameters of the first-realm network and the second-realm network comprises: determining historical network parameters of the previous iterative training of the comprehensive network in the iterative training process of the identification network of the object to be trained; determining a predicted network parameter set of the feature extractors in the first domain network and the second domain network in the next iterative training; updating the network parameters of the full network in the next iteration training based on the historical network parameters and the set of predicted network parameters. Therefore, the network lock learning knowledge can be collected by the comprehensive network, so that supervision information can be better fed back to each field network, and the cooperative learning of the comprehensive network and each field network is realized.
In some embodiments, the adjusting the network parameters of the object recognition network based on the domain loss and the cooperative loss of the first domain network and the second domain network so that the adjusted loss of the overall network output satisfies a convergence condition includes: determining a data volume of the first source domain sample image; determining a uniform loss for adjusting image feature distribution based on the data volume, the image features of the second source domain sample image extracted by the comprehensive network, and the second image features extracted by the feature extractor of the first domain network; adjusting the uniform loss by adopting a preset balance amount to obtain an adjusted uniform loss; fusing the adjusted uniform loss, the domain loss of the first domain network and the domain loss of the second domain network and the cooperative loss to obtain a total loss; and adjusting the network parameters of the object recognition network based on the total loss so as to enable the adjusted loss of the overall network output to meet the convergence condition. Therefore, in the one-time iteration process, the total loss of the object recognition network is determined, and the network parameters in the network are integrally optimized, so that the comprehensive network which can be applied to the inference stage for feature extraction and is convenient for object recognition is obtained.
The embodiment of the application provides an object identification method, which comprises the following steps: acquiring a first image including an object to be recognized; respectively extracting the features of the first image and a second image in a preset image library based on a comprehensive network to obtain the image features of the first image and the image features of the second image; wherein, the comprehensive network is obtained by training based on the training method of the object network; and re-identifying the object to be identified in the preset image library based on the image characteristics of the first image and the image characteristics of the second image to obtain an identification result.
The embodiment of the application provides a training device of object recognition network, object recognition network includes comprehensive network, first field network and second field network, the device includes:
a first determining module, configured to determine, based on a first source domain sample image and a second source domain sample image, a first alignment loss between the comprehensive network and a first domain network, and a second alignment loss between the comprehensive network and a second domain network, respectively; wherein the first source domain sample image and the second source domain sample image are not overlapped, the first source domain sample image is used for training the first domain network, and the second source domain sample image is used for training the second domain network;
a second determining module, configured to determine a domain loss of the first domain network and determine a domain loss of the second domain network based on the first source domain sample image and the first alignment loss, and the second source domain sample image and the second alignment loss, respectively;
a third determining module for determining a synergy loss between the first domain network and the second domain network based on the second source domain sample image, the second alignment loss, and a domain loss of the first domain network;
a fourth determining module, configured to determine a network parameter of the comprehensive network based on the network parameters of the first domain network and the second domain network;
and the first adjusting module is used for adjusting the network parameters of the object identification network based on the field loss and the collaborative loss of the first field network and the second field network so as to enable the adjusted loss of the overall network output to meet the convergence condition.
An embodiment of the present application provides an object recognition apparatus, the apparatus includes:
the device comprises a first acquisition module, a second acquisition module and a recognition module, wherein the first acquisition module is used for acquiring a first image comprising an object to be recognized;
the first extraction module is used for respectively extracting the features of the first image and the second image in a preset image library based on a comprehensive network to obtain the image features of the first image and the image features of the second image; wherein, the comprehensive network is obtained by training based on the training method of the object network;
and the first identification module is used for re-identifying the object to be identified in the preset image library based on the image characteristics of the first image and the image characteristics of the second image to obtain an identification result.
The embodiment of the application provides a computer storage medium, wherein computer-executable instructions are stored on the computer storage medium, and after being executed, the steps of the method can be realized.
The embodiment of the application provides computer equipment, which comprises a memory and a processor, wherein computer-executable instructions are stored on the memory, and the steps of the method can be realized when the processor runs the computer-executable instructions on the memory.
The embodiment of the application provides an object recognition method, a network training device, an object recognition network training device and a network training medium. Firstly, respectively determining a first alignment loss between the comprehensive network and a first domain network and a second alignment loss between the comprehensive network and a second domain network based on a first source domain sample image and a second source domain sample image; wherein the first source domain sample image and the second source domain sample image are not overlapped, the first source domain sample image is used for training the first domain network, and the second source domain sample image is used for training the second domain network; therefore, the training of the comprehensive network to the domain network can be guided by utilizing the alignment loss. Secondly, determining the domain loss of the first domain network and determining the domain loss of the second domain network based on the first source domain sample image and the first alignment loss, and the second source domain sample image and the second alignment loss; and determining a synergy loss between the first domain network and the second domain network based on the second source domain sample image, the second alignment loss, and a domain loss of the first domain network; therefore, by utilizing the cooperative loss among the networks in different fields, the cooperative learning among the networks in different fields can be realized; thirdly, determining network parameters of the comprehensive network based on the network parameters of the first domain network and the second domain network; therefore, the network parameters of the multiple field networks can be collected to the comprehensive network so as to update the network parameters of the comprehensive network; finally, adjusting the network parameters of the object recognition network based on the domain loss of the first domain network, the domain loss of the second domain network and the cooperative loss between the first domain network and the second domain network, so that the adjusted loss of the overall network output meets the convergence condition; therefore, the object recognition network can have higher recognition accuracy in any data domain through the cooperation training among the multiple domain networks and the cooperation between the comprehensive network and each domain network, and the generalization performance of the object recognition network can be improved.
Drawings
Fig. 1 is a schematic flow chart illustrating an implementation of a training method for an object recognition network according to an embodiment of the present application;
fig. 2A is a schematic flowchart of another implementation of a training method for an object recognition network according to an embodiment of the present disclosure;
fig. 2B is a schematic flow chart illustrating an implementation of the object identification method according to the embodiment of the present application;
fig. 3 is a schematic application scenario diagram of a training method for an object recognition network according to an embodiment of the present application;
fig. 4 is a schematic diagram of an implementation framework of a training method for an object recognition network according to an embodiment of the present application;
fig. 5 is a schematic diagram of a framework for collaborative learning between networks in different domains according to an embodiment of the present application;
fig. 6 is a schematic diagram of an implementation framework of a training method for an object recognition network according to an embodiment of the present application;
FIG. 7 is a schematic structural diagram of a training apparatus of an object recognition network according to an embodiment of the present application;
FIG. 8 is a schematic structural diagram of an object recognition apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, specific technical solutions of the present invention will be described in further detail below with reference to the accompanying drawings in the embodiments of the present application. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the following description, references to the terms "first \ second \ third" are only to distinguish similar examples and do not denote a particular order of importance to the examples, and it is to be understood that "first \ second \ third" may be interchanged with a particular order or sequence where permissible to enable embodiments of the present application described herein to be practiced otherwise than as specifically illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.
1) Domain network: each field network corresponds to a deep neural network model and is divided into three parts, namely a feature extractor, a classifier and a projection network. In addition, one domain network is responsible for learning data of one domain, and if N domains exist in the training set, N domain networks exist.
2) Comprehensive experts: the full-scale expert is responsible for gathering the knowledge they learned from the domain network while knowing that they learned. Only one comprehensive expert is provided and the comprehensive expert is composed of a feature extractor. It should be noted that the domain network and the full expert feature extractor structures are identical.
3) Collaborative learning between domain networks: cooperative learning between domain networks is mainly accomplished by means of a meta-test stage of a meta-learning (meta-learning) method. When a certain domain network is trained, in a meta-test stage, classifiers of other domain networks and projection networks are utilized to assist the training of feature extractors of the domain networks.
4) Collaborative learning of comprehensive expert and domain network
A. The expert guides the learning of the domain network comprehensively. When a domain network is trained, picture data of a corresponding domain is input into a comprehensive expert and the domain network, and features extracted by the domain network and features extracted by the comprehensive expert are required to be similar as much as possible in the embodiment of the application.
B. The comprehensive expert aggregates the learned knowledge from the domain network. The main operation is to use an exponential moving average method and update the parameters of the comprehensive expert characteristic network according to the average value of the characteristic network parameters of the N field networks.
An exemplary application of the object recognition device provided in the embodiments of the present application is described below, and the device provided in the embodiments of the present application may be implemented as various types of user terminals such as a notebook computer with an image capture function, a tablet computer, a desktop computer, a camera, a mobile device (e.g., a personal digital assistant, a dedicated messaging device, and a portable game device), and may also be implemented as a server. In the following, an exemplary application will be explained when the device is implemented as a terminal or a server.
The method can be applied to a computer device, and the functions realized by the method can be realized by calling a program code by a processor in the computer device, although the program code can be stored in a computer storage medium, which at least comprises the processor and the storage medium.
An embodiment of the present application provides a method for training an object recognition network, where an architecture of the object recognition network includes a comprehensive network and at least two domain networks, for example, a first domain network and a second domain network, as shown in fig. 1, and is described with reference to steps shown in fig. 1:
step S101, respectively determining a first alignment loss between the comprehensive network and the first domain network and a second alignment loss between the comprehensive network and the second domain network based on a first source domain sample image and a second source domain sample image.
In some embodiments, there is no overlap between a first source domain sample image used to train the first domain network and the second source domain sample image used to train the second domain network. Taking pedestrian re-identification as an example, in the at least two domain networks, the first source domain sample image corresponding to the first domain network may be a complete image of a face and a body picture of the same pedestrian, the second source domain sample image corresponding to the second domain network may be an incomplete picture of the same pedestrian (for example, the pedestrian is blocked), and the third source domain sample image corresponding to the third domain network may be images of multiple pedestrians with similar faces (for example, images of multiple pedestrians with relatives), and the like. The first domain network is each of the at least two domain networks. A first source domain sample image corresponding to the first domain network is an image from the same source domain; the second source domain sample image and the first source domain sample image are a set of images belonging to different source domains. The image of the image comprises one or more marked objects, and can be an image with a complex appearance or an image with a simple appearance. The first source domain sample image may be an image including an annotated object acquired by any acquisition device in any scene. The marked object can be a pedestrian, an animal or an object to be identified. Each of the at least two domain networks has the same network architecture, for example, each domain network includes three parts: a feature extractor, a classifier and a projection network; the network parameters of each part of the different domain networks are different. The comprehensive network may be implemented using feature extractors, which may be structurally identical to the feature extractors in the domain network.
In some possible implementation manners, firstly, inputting a first source domain sample image into a first domain network for feature extraction and projective transformation to obtain a projective transformed feature; simultaneously, inputting the first source domain sample image into a comprehensive network for feature extraction to obtain image features; then, a first alignment loss between the comprehensive network and the first domain network is obtained by determining the distance between the projectively transformed features and the image features extracted by the comprehensive network in the feature space. Therefore, the first alignment loss can realize that the comprehensive network guides the first field network in the training process. Similarly, inputting the second source domain sample image into a second domain network for feature extraction and projective transformation to obtain the projective transformed features; simultaneously inputting the second source domain sample image into a comprehensive network for feature extraction to obtain image features; then, a second alignment loss between the comprehensive network and the second domain network is obtained by determining the distance between the projectively transformed features and the image features extracted by the comprehensive network in the feature space.
Step S102, determining the domain loss of the first domain network and determining the domain loss of the second domain network based on the first source domain sample image and the first alignment loss, and the second source domain sample image and the second alignment loss, respectively.
In some embodiments, for each domain network in the object recognition network, the domain loss of the network is determined through the source domain sample image and the alignment loss corresponding to the domain for the case. Wherein, based on the first source domain sample image and the first alignment loss, determining a domain loss of a first domain network comprises: classification loss, triple loss and other internal losses in the first-domain network can be determined through the first source-domain sample image; combining the internal loss with the first alignment loss to obtain a domain loss of the first domain network. In some possible implementation manners, a first source domain sample image is input into a first domain network for feature extraction, and projection transformation and classification are performed on the basis of the extracted features; determining the internal loss of the first domain network based on the prediction result output by the first domain network and the truth value label of the object in the sample image; and finally, fusing the first alignment loss between the comprehensive network and the first domain network with the internal loss to obtain the domain loss of the first domain network in the training process.
In some embodiments, determining a domain loss for a second domain network based on the second source domain sample image and the second alignment loss is by: classification loss, triple loss and other internal losses in the second-domain network can be determined through the second source-domain sample image; combining the internal loss with a second alignment loss to obtain a domain loss of the second domain network. In some possible implementation manners, the second source domain sample image is input into a second domain network for feature extraction, and projection transformation and classification are performed on the basis of the extracted features; determining the internal loss of the second field network based on the prediction result output by the second field network and the truth value label of the object in the sample image; and finally, fusing the second alignment loss between the comprehensive network and the second field network with the internal loss to obtain the field loss of the second field network in the training process. Therefore, the first alignment loss between the comprehensive network and the first-field network is applied to the training process of the first-field network, the second alignment loss between the comprehensive network and the second-field network is applied to the training process of the second-field network, and the guidance and supervision of the comprehensive network on the first-field network and the second-field network in the training process can be realized.
Step S103, determining the cooperative loss between the first domain network and the second domain network based on the second source domain sample image, the second alignment loss and the domain loss of the first domain network.
In some embodiments, the second source domain sample image belongs to a different source domain than the first source domain sample image; in this way, each of the two domain networks is trained through a sample image of a specific source domain, and domains to which the sample images learned in the training process of different domain networks belong are different; as such, the number of the at least two domain networks is the same as the number of source domains of the sample image.
The second domain network is any one of the plurality of domain networks except the first domain network. For example, the at least two domain networks include 5 domain networks, the first domain network is each of the 5 domain networks, the first domain network is a 1 st domain network in the domain networks, and the second domain network is any one of the 5 domain networks except the 1 st domain network. In some possible implementations, first, based on a domain loss of the first domain network, determining an adaptive parameter for implementing collaborative learning between the first domain network and the second domain network; then, taking the adaptive parameters as network parameters of a second domain network, and carrying out operations such as feature extraction, mapping transformation, classification and the like on the second source domain sample image based on the adaptive parameters to obtain the internal loss of the second domain network; and finally, determining a second alignment loss for guiding the second field network training by the comprehensive network based on the second source field sample image, and fusing the internal loss and the second alignment loss to obtain the cooperative loss between the first field network and the second field network. Therefore, based on the cooperative loss, training of the first domain network and the second domain network between sample images of different source domains can be achieved, and the first domain network and the second domain network have stronger generalization performance.
Step S104, determining the network parameters of the comprehensive network based on the network parameters of the first domain network and the second domain network.
In some embodiments, based on the network parameters of the at least two field networks, determining that the network parameters of the comprehensive network are in the iterative training process of the object recognition network to be trained, acquiring the network parameters of the feature extractors of all field networks in each iteration, processing the network parameters of the feature extractors of all field networks in the iterative process by using an exponential sliding average method, updating the network parameters of the comprehensive network, and acquiring the network parameters of the comprehensive network in the current iteration; in this way, by combining and collecting the network parameters of the multiple domain networks into the comprehensive network, the training of each domain network can be guided as feedback for the comprehensive network.
And step S105, adjusting the network parameters of the object identification network based on the field loss of the first field network, the field loss of the second field network and the cooperative loss between the first field network and the second field network, so that the adjusted loss of the overall network output meets the convergence condition.
In some embodiments, in an iterative process, a domain loss of each domain network is used as a meta-training loss in a meta-training stage and a collaborative loss between the domain networks is used as a meta-testing loss in a meta-testing stage in a meta-learning manner; fusing the meta-training loss and the meta-testing loss to form a total loss; and adjusting network parameters in the object recognition network to be trained based on the total loss until the total loss of the object recognition network to be trained meets the hand convergence condition, stopping the iteration process, and finishing the training of the object recognition network to be trained. In the iteration process, the network parameters of each field network and the network parameters of the overall network in the object recognition network are adjusted based on the total loss, so that the output field loss and the collaborative loss both meet the convergence condition. In some possible implementation manners, in one iteration process, the meta-training loss in the total loss is used for optimizing network parameters in the whole framework of the object recognition network to be trained; the meta-test loss does not optimize network parameters and is used for optimizing the algorithm implementation of the whole framework, the meta-training loss is used for learning the sample images of one source domain, and the meta-test loss is used for learning the sample images of different source domains. Therefore, the whole network is optimized by combining the meta-training loss and the meta-testing loss, and a comprehensive network with stronger generalization performance can be obtained.
In the embodiment of the application, a first alignment loss between the comprehensive network and the first domain network is determined based on the first source domain sample image, so that the training of the comprehensive network to the domain network can be guided by using the alignment loss. By utilizing the cooperative loss among the networks in different fields, the cooperative learning among the networks in different fields can be realized; determining the network parameters of the comprehensive network based on the network parameters of the at least two field networks, and collecting the network parameters of the multiple field networks to the comprehensive network so as to update the network parameters of the comprehensive network; therefore, the obtained object recognition network can have high recognition accuracy in any data domain and good generalization performance through the cooperative training among the multiple domain networks and the guidance and supervision of the comprehensive network on each domain network.
In some embodiments, the guidance of the comprehensive network to the domain network is realized by determining the spatial feature distance between the features, that is, the step S101 may be realized by the steps shown in fig. 2A, where fig. 2A is another implementation flow diagram of the training method for the object recognition network provided in the embodiment of the present application, and the following description is performed in conjunction with the steps shown in fig. 1 and 2A:
step S201, performing feature extraction on the first source domain sample image based on the comprehensive network to obtain a first image feature.
In some embodiments, the first source domain sample image is input into a comprehensive network, and the comprehensive network is used for performing feature extraction on the sample image to obtain a first image feature. In some possible implementations, the comprehensive network may be any type of feature extraction network, such as a residual network, a convolutional neural network, or a dense network.
Step S202, based on the first domain network, performing feature extraction and projection transformation on the first source domain sample image to obtain a second image feature.
In some embodiments, the first source domain sample image is input into the first domain network at the same time as the first source domain sample image is input into the comprehensive network. In the first field network, firstly, feature extraction is carried out on the sample image, then projection transformation is carried out on the extracted image feature, coordinate transformation of the image feature in different spatial levels is realized, and a transformed second image feature is obtained; such that the second image feature and the first image feature are in the same feature space.
In some possible implementations, the network architecture of each of the plurality of domain networks is the same, and includes: a feature extractor, a projection network, and a classifier; the feature extractor can be the same as the feature extractor in the comprehensive network in architecture, and the network parameters are different; the projection network may be any network capable of performing projective transformation of features; such as a feed forward neural network. The classifier can be any neural network capable of realizing feature classification; such as a convolutional neural network or a residual neural network, etc. In step S202, after the first source domain sample image is input to the first domain network, a second image feature may be obtained through the following steps S241 and S242 (not shown in the figure):
step S241, performing feature extraction on the first source domain sample image based on the feature extractor.
In some embodiments, the first source domain sample image is input to a feature extractor of the first domain network for feature extraction, resulting in extracted image features.
Step S242, based on the projection network, performs projection transformation on the image features extracted by the feature extractor to obtain the second image features.
In some embodiments, the extracted image features are input into a projection network of the first domain network for projection transformation, so as to obtain second image features in the same feature space as the first image features. Therefore, the second image feature can be obtained by adopting the feature extractor and the projection network in the first field network to perform feature extraction and projection conversion on the corresponding sample object, so that the spatial distance between the second image feature and the first image feature can be determined, and the comprehensive network can guide the training process of the first field network based on the alignment loss.
Step S203, determining a first alignment loss between the comprehensive network and the first domain network based on the second image feature and the first image feature.
In some embodiments, the first alignment loss is calculated by calculating a euclidean distance between the second image feature and the first image feature, and comparing the euclidean distance to a true value distance between features in the image; because the distance between features in a sample image should be as small as possible, the alignment loss can be used to guide the domain network in the training process. Therefore, the first alignment loss between the comprehensive network and the first field network is determined by determining the space characteristic distance between the characteristics, so that the guidance of the comprehensive network to the field network in the training process can be realized, the cooperation between the comprehensive network and the field network is realized, and the generalization performance of the network is improved.
The above steps S201 to S203 implement "determining the first alignment loss between the comprehensive network and the first domain network based on the first source domain sample image" in the above step S101. The process of determining the second alignment loss between the comprehensive network and the second-domain network based on the second source-domain sample image in the step S101 is similar to the process from the step S201 to the step S203, that is, firstly, the comprehensive network is adopted to perform feature extraction on the second source-domain sample image to obtain image features; then, a second domain network is adopted to perform feature extraction on the second source domain sample image to obtain another image feature; finally, a second alignment penalty between the global network and the second domain network is determined based on the two image features.
In some possible implementations, the first alignment loss between the global network and the first domain network is determined based on the true-value distance between the features in the same sample image and the predicted feature distance between the first image feature and the second image feature, that is, the step S203 can be implemented by the following steps S231 and S232 (not shown in the figure):
step S231, determining a feature distance between the second image feature and the first image feature in the feature space of the first source domain sample image.
In some embodiments, in step S231, the feature space of the first source domain sample image refers to a feature space where an image feature obtained by performing feature extraction on the sample image by using a comprehensive network is located. In the feature space of the sample image, the Euclidean distance between first image features obtained by projection transformation through a projection network of a first domain network and the first image features obtained by feature extraction of the sample image through a comprehensive network is determined, and the Euclidean distance is used as the feature distance.
Step S232, determining the first alignment loss based on the feature distance and a true feature distance between features in the first source domain sample image.
In some embodiments, if the first image feature and the second image feature are from the same image, the feature distance between the two image features should be small, i.e., the true value of the feature distance is a small value, indicating that the two image features are highly correlated; if the first image feature and the second image feature are from the same image, the feature distance between the two image features should be large, i.e. the two image features are uncorrelated. The first alignment loss is determined based on a difference between a predicted feature distance and a true feature distance determined by determining whether the first image feature and the second image feature belong to the same sample image. Therefore, the comprehensive network can utilize the correlation degree between the characteristics belonging to one image to feed back the training process of the first field network based on the sample image through the alignment loss, so that the correlation degree between the characteristics in the same sample image can be considered in the training process of the first field network.
In other embodiments, the method comprises the steps of determining image features of a second source domain sample image extracted by a comprehensive network and feature distances between the image features of the second source domain sample image extracted by a second domain network in a feature space of the second source domain sample image; and determining the second alignment loss based on the characteristic distance and the truth-valued characteristic distance between the characteristics in the second source domain sample image, thereby realizing the guidance of the comprehensive network to the second field network in the training process.
In some embodiments, the training of the domain networks is completed through a meta-training phase and a meta-testing phase by performing collaborative training on a plurality of domain networks in a meta-learning manner, that is, the above step S102 may be implemented through the following steps S121 to S123 (not shown in the figure):
step S121, training the first domain network by taking the first source domain sample image as meta-training data to obtain the internal loss of the first domain network.
In some embodiments, training the object recognition network to be trained includes training each domain network, and training the full network. Performing collaborative training on a first domain network and a second domain network, including: in the meta-training phase, the first domain network is trained. Inputting the first source domain sample image into a feature extractor of a first domain network for feature extraction; then, inputting the extracted features into a classifier of the first source domain sample image, and determining the classification loss of the classifier; and determining an image distance loss between the first source domain sample image and a positive sample image; and taking the classification loss and the image distance loss as internal losses for optimizing network parameters of the first-field network. And aiming at a plurality of domain networks in the object network, determining the internal loss of each domain network in the process of one iterative training.
In some possible implementations, the domain loss of the first domain network is determined by combining the alignment loss between the first domain network and the comprehensive network and the internal loss of the first domain network itself, that is, the step S102 may be implemented by:
the method comprises the following steps of firstly, determining image distance loss between a first source domain sample image and a positive sample image based on image features extracted by the first domain network and image features of the positive sample image.
In some embodiments, the truth value characteristics of the image characteristics extracted by the first domain network and the corresponding sample image and the characteristics of the anchor point sample images in the sample image set are determined; and then determining the image distance loss between the sample image and the anchor sample image and the positive sample image based on the image features extracted by the first domain network, the features of the anchor sample image and the positive sample image. Wherein the positive sample image and the anchor sample image belong to the same type of image; the image distance penalty is used to minimize the distance between the sample image and the positive sample image and the anchor image, and to maximize the distance between the sample image and the negative sample image.
In a specific example, the image distance penalty may be a triplet penalty for the sample image.
And secondly, classifying the image features extracted by the first field network in the first field network to obtain a first classification result.
In some embodiments, in the first domain network, after feature extraction is performed by a feature extractor of the first domain network, a category to which an image feature extracted by the first domain network belongs is predicted based on a category label labeled in a sample image, so as to obtain the first classification result.
In some possible implementation manners, the first domain network further includes a classifier for classifying features, and based on the classifier, the image features extracted by the feature extractor are classified to obtain the classification result. In this way, the extracted image features are input into the classifier of the first domain network, and feature classification is carried out to obtain a feature classification result.
And thirdly, determining the classification loss of the image features extracted by the first field network based on the first classification result and the truth value classification label of the first source field sample image.
In some embodiments, according to the classification result predicted by the classifier based on the extracted image feature and the true value classification label of the image feature, the difference between the predicted classification result and the true value classification label of the image feature is determined, and based on the difference, the classification loss of the image feature can be determined.
And fourthly, determining the classification loss and the image distance loss as the internal loss of the first domain network.
Through the first step to the fourth step, the first source domain sample image is input into the first domain network, and the classification loss and the ternary loss of the first domain network are respectively determined, so that the classification loss and the ternary loss are used for being fused into the domain loss for optimizing the network parameters of the first domain network, and the training of the first domain network can be accelerated.
Step S122, determining a meta-training loss for adjusting a network parameter of the first domain network based on the internal loss and the first alignment loss.
In some embodiments, the image distance loss, the classification loss, and the first alignment loss of the first-domain network are summed element-by-element to obtain a meta-training loss that optimizes the network parameters of the feature extractor, the projection network, and the classifier of the first-domain network.
In some possible implementations, the first alignment loss, the image distance loss, and the classification loss are fused to obtain a meta-training loss of the first domain network.
In some embodiments, the first alignment penalty, the image distance penalty, and the classification penalty are summed element-by-element to arrive at a domain penalty for the first domain network. Thus, the alignment loss between the first domain network and the comprehensive network is combined with the internal loss of the first domain network, and the alignment loss serves as the domain loss of the first domain network, so that the first domain network can be trained based on the domain loss, and the first domain network can receive guidance of the comprehensive network while learning the sample image.
Step S123, determining the meta-training loss of the first domain network as the domain loss.
In the embodiment of the application, the meta-training loss of the first domain network in the meta-training stage is determined in a meta-learning manner, and the meta-training loss is used as the domain loss for optimizing the network parameters of the first domain network, so that the flexibility of object identification of the first domain network can be improved.
In the embodiment of the application, each field network in the object recognition network is trained in a meta-learning mode, and the meta-training loss of each field network in the meta-training stage is determined, so that the network parameters of each field network are optimized.
In some embodiments, the collaborative loss of the second domain network in the meta-test stage is determined by updating the network parameters of the feature extractor of the second domain network based on the network parameters of the feature extractor of the first domain network, and combining the second alignment loss between the second domain network and the comprehensive network, that is, the step S103 may be implemented by the following steps S131 and S132 (not shown in the figure):
step S131, based on the network parameters of the feature extractor of the first domain network and the domain loss of the first domain network, determining adaptive parameters.
In some embodiments, the network parameters of the feature extractor of the first domain network are taken as the network parameters needing meta-learning; multiplying the adjustment parameter by the field loss of the first field network in the meta-training stage to obtain a product; for example, the value of the adjustment parameter is set to 0.1. And subtracting the network parameter of the feature extractor and the product element by element to obtain the adaptive parameter. In this way, the adaptive parameter is determined through the domain loss of the first domain network in the training phase, and the adaptive parameter is used as the network parameter of the second domain network, so that the collaborative learning among different domain networks in a plurality of source domain sample images can be realized.
Step S132, determining a coordination loss between the first domain network and the second domain network based on the adaptive parameter, the second alignment loss, and the second source domain sample image.
In some embodiments, feature extraction is performed on the second source domain sample image by first taking the adaptive parameters as network parameters of a feature extractor of the second domain network; secondly, inputting the extracted features into a classifier and a projection network of a second field network respectively; thirdly, determining the classification loss of the second domain network based on the output result of the classifier, and determining the second alignment loss between the second domain network and the comprehensive network based on the output of the projection network; and finally, determining the cooperative loss of the second field network in the meta-test stage through the second alignment loss, the classification loss and the ternary loss of the second field network. Therefore, by adopting a meta-learning mode, the network parameters of the feature extractor of the first-field network are used as the network parameters of the feature extractor of the second-field network, and the cooperative loss of the second-field network in the meta-testing stage is determined, so that the cooperative learning between the first-field network and the second-field network between the non-overlapped different source-field sample images is realized based on the cooperative loss, and the generalization capability of the whole network architecture is improved.
In some possible implementations, the cooperative loss between the second domain network and the first domain network is determined by combining the domain loss of the first domain network, the alignment loss between the second domain network and the overall network, and the internal loss of the second domain network itself, that is, the step S132 may be implemented by:
and step one, the adaptive parameters are used as network parameters of a feature extractor of the second field network to obtain the updated second field network.
In some embodiments, the adaptation parameter determined based on the feature extractor of the first domain network is taken as the network parameter of the feature extractor of the second domain network, and the second domain network having the network parameter as the adaptation parameter is taken as the updated second domain network. The network parameters of the feature extractor in the updated second domain network are the adaptive parameters, and the network parameters of the projection network in the updated second domain network and the network parameters of the classifier remain unchanged, i.e. are the same as the network parameters of the projection network in the original second domain network and the network parameters of the classifier.
And secondly, inputting the second source domain sample image as meta-test data into a feature extractor of the second domain network for feature extraction to obtain third image features.
In some embodiments, in the meta-test stage, the second source domain sample image is used as meta-test data and is simultaneously input to a feature extractor of a second domain network and a comprehensive network, and feature extraction is performed on the second source domain sample image based on the updated feature extractor of the second domain network to obtain a third image feature; and simultaneously, carrying out feature extraction on the second source domain sample image through a comprehensive network to obtain image features.
And thirdly, respectively carrying out projection conversion and feature classification on the third image features based on the second domain network to obtain fourth image features and a second classification result.
In some embodiments, in the meta-test stage, the third image features extracted by the feature extractor are respectively input to a classifier and a projection network of the second domain network, the classifier outputs a second classification result, and the projection network performs projection conversion on the third image features, so that the obtained fourth image features and the image features extracted by the comprehensive network are in the same feature space. The second classification result is used for representing the confidence degree that the third image characteristic belongs to the class of the object to be recognized.
And fourthly, determining the meta-test loss of the first domain network based on the fourth image characteristic, the second classification result and the second alignment loss.
In some embodiments, a second alignment penalty between the second domain network and the global network is determined by determining a feature distance between fourth image features output by the projection network and features of the second source domain sample image extracted by the global network, and based on the feature distance and a true feature distance in the second source domain sample image. Determining an image distance loss between the first source domain sample image and a positive sample image by determining an image distance between image features extracted by a feature extractor of the second domain network and image features of the positive sample image. And determining the classification loss of the classifier of the second domain network according to the second classification result and the true value classification result of the image features in the second source domain sample image. Based on this, the domain loss of the second domain network is obtained by element-wise summing based on the second alignment loss, the image distance loss, and the classification loss. Therefore, the adaptive parameters determined based on the field loss of the first field network are used as the network parameters of the feature extractor of the second field network, the second alignment loss between the second field network and the comprehensive network is combined with the internal loss of the second field network, and the second alignment loss is used as the cooperative loss between the second field network and the first field network, so that the first field network and the second field network can be cooperatively learned based on the sample images of a plurality of different source domains, and the cooperative loss in the meta-test stage is fed back to the first field network, so that the network parameters of the first field network are optimized in the meta-training stage.
And fifthly, determining the meta-test loss as the cooperative loss.
In the embodiment of the application, a sample image of another source domain is adopted to test based on the second field network in the meta-test stage in a meta-learning mode, the meta-test loss of the second field network in the meta-test stage is determined, and the meta-test loss is fed back to the first field network, so that the network parameters are optimized based on the field loss of the first field network.
In some embodiments, the training of the entire object recognition network is implemented through multiple iterations, losses of multiple domain networks and a comprehensive network are determined in one iteration process, and network parameters of the comprehensive network are optimized based on an exponential moving average method, that is, the above step S104 can be implemented through the following steps S141 to S143 (not shown in the figure):
step S141, in the iterative training process of the object recognition network to be trained, determining the historical network parameters of the previous iterative training of the comprehensive network.
In some embodiments, the object recognition network to be trained is iteratively trained for a plurality of times until the overall loss of the object recognition network satisfies the convergence condition, and the iterative process is ended. In an iteration process, for each domain network, determining the meta-training loss of the domain network based on the alignment loss of the domain network and the comprehensive network, further determining the meta-test loss based on the meta-training loss, and completing an iteration based on the losses of all the domain networks. Thus, the network parameters in the last iteration training include: and determining the historical network parameters of the comprehensive network from the network parameters of all the field networks and the network parameters of the comprehensive network.
And step S142, determining a predicted network parameter set of the feature extractors in the first domain network and the second domain network in the next iterative training.
In some embodiments, in the next iteration training of the previous iteration training, the network parameters of the feature extractors of all the field networks are determined, and a prediction network parameter set is obtained.
And step S143, updating the network parameters of the comprehensive network in the next iterative training based on the historical network parameters and the predicted network parameter set.
In some embodiments, based on an exponential moving average method, weighting the historical network parameters by using weighting coefficients in the exponential moving average method; determining the average value of the prediction network parameter set, weighting the average value by adopting another weighting coefficient (the sum of the other weighting coefficient and the weighting coefficient of the historical network parameter is 1), and carrying out element-by-element summation on the two weighting results to obtain a summation result; and taking the summation result as the network parameter of the comprehensive network in the next iteration training. Therefore, in one iterative training, the network parameters of the comprehensive network in the previous iterative training and the network parameters of the feature extractors of all the field networks in the next iterative training are combined, so that the comprehensive network can collect the learning knowledge of the network locks in a plurality of fields, supervision information can be better fed back to the networks in each field, and the cooperative learning of the comprehensive network and the networks in each field is realized.
In some embodiments, in order to make the image features uniformly distributed in the feature space, a uniformity loss is employed to promote more uniform feature distribution among sample images of different source domains, the uniformity loss is introduced into iterative training to determine a total loss, and the network parameters of the entire network are optimized by the total loss, that is, the above step S105 can be implemented by steps S151 to S155 (not shown in the figure):
step S151, determining a data amount of the first source domain sample image.
In some embodiments, a small batch of sample images, i.e., a first source domain sample image, is obtained by sampling in a first source domain, and a frame number of the first source domain sample image is a data amount of the first source domain sample image. For each domain network, sample images with the same data quantity as the first source domain sample images are collected in the corresponding source domain.
Step S152, determining a uniform loss for adjusting the distribution of image features based on the data amount, the image features of the second source domain sample image extracted by the comprehensive network, and the second image features extracted by the feature extractor of the first domain network.
In some embodiments, based on the image features of the second source domain sample image extracted by the comprehensive network and the second image features extracted by the feature extractor of the first domain network, determining a spatial distance between the two image features, and taking the spatial distance as an exponent of a power operation with e as a base; and determining the corresponding index for each source domain sample image, and averaging the indexes of all the source domain sample images based on the data volume of the domain sample image to obtain the uniform loss.
And step S153, adjusting the uniform loss by adopting a preset balance amount to obtain an adjusted uniform loss.
In some embodiments, the uniformity loss is adjusted using a predetermined balance (e.g., 0.1), e.g., the balance is multiplied by the uniformity loss to obtain an adjusted uniformity loss. And are
Step S154, the adjusted uniform loss, the field loss of the first field network, the field loss of the second field network and the cooperative loss are fused to obtain a total loss.
In some embodiments, the domain loss of each domain network and the cooperative loss between the two networks are weighted by a preset weighting coefficient (e.g., 0.5), and the adjusted uniform loss and the weighted domain loss and the weighted cooperative loss are summed element by element to obtain the total loss.
Step S155, based on the total loss, adjusts the network parameters of the object recognition network so that the adjusted loss of the overall network output satisfies a convergence condition.
In some embodiments, in an iterative process, overall optimization is performed on network parameters in an object recognition network by determining total loss of the network, so that a comprehensive network which can be applied to an inference stage for feature extraction so as to perform object recognition is obtained.
In the embodiment of the application, the uniform loss of the balanced feature distribution is fused into the total loss in the iterative training process, so that the distribution uniformity of the features can be fully considered in the process of optimizing the network parameters based on the total loss, and the network performance of the object recognition network can be improved.
An embodiment of the present application provides an object identification method, where feature extraction is performed through a comprehensive network in the foregoing embodiment, and pedestrian re-identification is implemented based on extracted image features, as shown in fig. 2B, fig. 2B is a schematic diagram of an implementation flow of the object identification method provided in the embodiment of the present application, and the following description is performed in combination with the steps shown in fig. 2B:
in step S21, a first image including an object to be recognized is acquired.
In some embodiments, the object to be identified may be a pedestrian, an animal, or other object that needs to be identified in a preset database. The first image may be an image including a pedestrian, an animal, or other object captured in any scene. In the overall frame of object recognition, the first image comprises a preset limiting condition of an object to be recognized; taking as an example that the object to be recognized is a pedestrian and the pedestrian is re-recognized in the preset image library, the first image may include a restriction condition describing the identity information, facial features, body features, or the like of the pedestrian, and an image satisfying these conditions is matched in the preset image library based on the restriction condition.
Step S22, respectively extracting the features of the first image and the second image in a preset image library based on a comprehensive network to obtain the image features of the first image and the second image.
In some embodiments, the comprehensive network may be obtained by training through the training method of the object recognition network provided in the above embodiments. The comprehensive network may be a feature extractor or a network that includes both feature extractors and enables object recognition. If the comprehensive network is a feature extractor, the comprehensive network is used as a feature extraction module and is embedded into a network architecture for object recognition to form a complete object recognition network. The preset image library can comprise a static image library of a large number of images, and by taking pedestrian re-identification as an example, the first image comprising the limiting condition of the object to be identified and each image in the preset image library are input into a comprehensive network for feature extraction.
Step S23, re-identifying the object to be identified in the preset image library based on the image features of the first image and the second image to obtain an identification result.
In some embodiments, if the comprehensive network is a feature extractor, both the image of a preset image library and the object to be identified are input into an object identification network, feature extraction is performed through the comprehensive network, and object identification is performed in the preset image library by using other modules in the network architecture based on the image features extracted through the comprehensive network, so as to identify a target image with similarity greater than a certain threshold with the object to be identified. The identification information of the object marked in the target image is the identification result.
In the embodiment of the application, the network parameters of the comprehensive network are updated through the field network obtained by collaborative learning based on different source domain sample images, so that the accuracy of feature extraction of the comprehensive network can be improved, the features in the extracted image feature classes are compact, and the features among the classes are dispersed, thereby carrying out object re-identification based on the image features and improving the accuracy of identification.
In the following, an exemplary application of the embodiment of the present application in an actual application scenario will be described, taking the implementation of pedestrian re-identification in a multi-domain network cooperation manner as an example.
Pedestrian re-identification is a key technology in an intelligent video acquisition system, and aims to find out a picture similar to a query picture from a large number of database pictures by measuring the similarity between a given query picture and the database pictures. With the rapid development of acquisition equipment, pedestrian data of the order of tens of millions is generated every day.
However, in the related art, many pedestrian re-recognition algorithms are based on an assumption that: the data of both the training set and the test set come from one domain. Due to the problem of domain migration, the above assumption is difficult to be established in real life, so the trained pedestrian re-recognition model has a good effect on the original domain, but the accuracy of pedestrian recognition is low in practical use. Thus, the pedestrian re-identification model faces a serious performance degradation problem when training data and test data come from different domains. As shown in fig. 3. Fig. 3 is a schematic view of an application scenario of the training method for the object recognition network according to the embodiment of the present application, and as can be seen from fig. 3, a trained model 31 obtained based on training on a data set 301 is respectively used for extracting features of the data set 301 and a data set 302. By visually displaying the extracted features, the data distribution of the two test sets has great difference on the feature space: the characteristics of the data set 301 are shown as characteristics 303, and the data in the classes are more compact in the characteristics 303, and the data between the classes have obvious differences; the data set 302 is characterized by features 304, and the distribution of the features 304 is disordered, showing the negative effects of the domain transfer problem.
Based on this, the embodiment of the present application provides a training method for an object recognition network, which is applied to a training process of a pedestrian re-recognition model, and the generalization capability of the obtained pedestrian re-recognition model is improved through the combined action of collaborative learning between the domain networks and the collaborative learning of a comprehensive expert and the domain networks, that is, the obtained model can be well applied to a real scene without much performance degradation. In this way, in the embodiment of the application, the performance of the pedestrian re-identification model on an unknown domain is improved by using the data sets from the multiple domains, and the pedestrian re-identification model trained based on the method can deal with the problem of domain transfer and has good generalization performance.
In the embodiment of the present application, a data set of N source domains, i.e., N object recognition networks for training, is obtained and expressed as
Figure BDA0003264157770000181
And M target domains for testing, denoted as
Figure BDA0003264157770000182
Wherein there is no overlap between the source domain and the target domain, i.e.
Figure BDA0003264157770000183
Having PkThe kth source domain of the image
Figure BDA0003264157770000184
Is shown as
Figure BDA0003264157770000185
Wherein the content of the first and second substances,
Figure BDA0003264157770000186
is the (i) th image and (ii) the (i) th image,
Figure BDA0003264157770000187
is from the label space YkThe corresponding tag of (2). In the embodiment of the application, the Source domains of a Multi-Source Domain Generalization people Re-Identification network (DG-ReiD) do not share a label space, i.e. DG-ReiD
Figure BDA0003264157770000188
Overall, the goal of multi-source DG-ReID is to fully utilize N sourcesThe domains are trained to a more generic model that can have better performance over the M target domains.
The embodiment of the application provides a network architecture based on Multi-task learning, namely a Multi-Domain equal baseline (MDE). In the MDE setting, each source domain has an independent classifier and shares a feature extractor network with other domains, from the perspective of multi-domain equal baseline multi-task learning. During the training process, all domains are treated equally. In some possible implementations, in each iteration, a small batch of data containing B images is sampled from each domain and trained, denoted as
Figure BDA0003264157770000189
The loss function is shown in equation (1):
Figure BDA00032641577700001810
wherein λ is balanced by LidAnd LtriAre softmax classification loss (e.g., cross entropy loss) and ternary loss (triplet loss), as shown in equations (2) and (3):
Figure BDA00032641577700001811
Figure BDA00032641577700001812
wherein F (·;. theta.) and C (.;. phi.) -)n) Representing a shared feature extractor and an nth domain classifier,
Figure BDA00032641577700001813
to represent
Figure BDA00032641577700001814
The farthest positive and nearest negative samples, m is a triplet distance matrix fixed to 0.3.
Fig. 4 shows a network architecture for multitask learning provided in an embodiment of the present application, where fig. 4 is a schematic diagram of an implementation framework of a training method for an object recognition network provided in an embodiment of the present application, and the framework includes: a knowledge phase of collecting learned knowledge from the domain networks by the plurality of source domains 401 to 4n1, the plurality of domain networks 41 to 4n of the multi-domain network learning phase, and the comprehensive expert network 402 as inputs; wherein:
first, a plurality of source domains 401 to 4n1 are input into the comprehensive network 402, and a plurality of domain networks 41 to 4 n; then, the plurality of domain networks 41 to 4n learn from the respective corresponding domains, for example, the domain network 41 corresponds to the source domain 401, the domain network 42 corresponds to the source domain 421, and the domain network 4n corresponds to the source domain 4n 1; finally, the comprehensive network 402 aggregates the knowledge learned by the domain networks 42 to 4n from the plurality of domain networks 42 to 4 n. In the embodiment of the present application, each domain network is composed of three sub-networks: a feature extractor, a classifier and a projection network. Representing N domain networks as
Figure BDA00032641577700001815
Wherein, thetannnAnd representing model parameters corresponding to the nth domain network. The comprehensive network includes a V parameterized feature extractor, denoted F (.;) V. The feature extractors of the domain networks 41 to 4n and the comprehensive network 402 share the same network architecture.
In the embodiment of the present application, model-independent meta-learning (MAML) is applied to the training of each domain network, so that not only the generalization capability of the model can be further improved, but also the interaction between the domain networks can be enhanced by using the classifiers and projection networks of other domain networks in the meta-test stage.
The training method of the object recognition network provided by the embodiment of the application can be realized through the following processes:
the input of the training object recognition network comprises: n source domains
Figure BDA0003264157770000191
The data of (a); n domain networks, a total of N feature extractors
Figure BDA0003264157770000192
N classifiers
Figure BDA0003264157770000193
N projection networks
Figure BDA0003264157770000194
1 comprehensive network F (.;) v. The output is: network parameters of the overall network.
In some embodiments, the training of the kth domain network is taken as an example to illustrate:
the first step is from
Figure BDA0003264157770000195
A batch of samples is sampled in the domain as meta-training data, denoted as (x)k,yk)。
Second, randomly selecting a domain from other domains
Figure BDA0003264157770000196
A batch of samples is then sampled as meta-test data, denoted as (x)j,yj)。
And thirdly, collaborative learning among the domain networks.
As shown in fig. 5, fig. 5 is a schematic diagram of a framework for collaborative learning between networks in different domains according to an embodiment of the present application, and includes the following processes:
first, in the meta-training phase 51, data will be meta-trained from the source domain 500
Figure BDA0003264157770000197
501(
Figure BDA0003264157770000198
Representing the kth source domain) to a feature extractor F (-) of the kth domain network; thetak)52, converting F (·; thetak)52, the features extracted are input to a classifier C (·; phi is ak)53 and a projection network P (·; psik)54, determining a meta-training loss Lmtr503 (which can be understood as the domain loss of the kth domain network), as shown in equation (4):
Lmtr(xk,yk;θkkk)=Lid+Ltri+Lalign (4);
then, determining an adaptive parameter theta'kAs shown in equation (5):
Figure BDA0003264157770000199
where α is the step size and can be set to 0.1.
Finally, in meta-test phase 505, meta-test data is provided
Figure BDA00032641577700001910
502 (representing the jth source domain) are input to a feature extractor F (·; θ'k)55, inputting the features extracted by the feature extractor 55 into the classifier
Figure BDA00032641577700001911
Figure BDA00032641577700001911
56 and projection network
Figure BDA00032641577700001912
Figure BDA00032641577700001912
57; and adjusting the adaptive parameter theta'kAs network parameters of feature extractor for jth domain network to determine meta-test loss function L mte504, which may be denoted as Lmte(xj,yj;θ′kjj) (corresponding to the synergy loss in the above embodiment), the meta-test loss may be determined based on an alignment loss between the comprehensive network and the jth domain network, a triplet loss corresponding to the feature extractor of the jth domain network, and a classification loss of the classifier.
Here, at meta-test stage, at θ'kIs determined with respect to (x) under the conditions ofj,yj) Is/are as followsMeta test loss of wherejjIs the network parameter of the jth domain network. And in the meta-test stage, the network parameters of the classifiers in the second-field network and the network parameters of the projection network are not optimized. L ismteAnd LmtrIn the same form but with different inputs and network parameters.
In some possible implementations, based on LmteAnd LmtrOptimizing theta separatelykkkI.e. by
Figure BDA00032641577700001913
And fourthly, comprehensively and collaboratively learning between the network and the domain network.
As shown in fig. 6, the cooperative learning between the comprehensive network and the domain network includes the following processes:
first, it will be at
Figure BDA0003264157770000201
Of samples in the domain (x)k,yk) Sample data 601 is input to the comprehensive network F (; v)602, extracting features; feature extractor F (;. theta.) simultaneously input to the kth domain networkk)603 and via a projection network P (·; psiK)604 are transformed to obtain features present in the feature space 605 and to determine a first alignment penalty L for the features in the feature space 605alignAs shown in equation (6):
Figure BDA0003264157770000202
in the spatial feature 605, the feature representations of the same shape are from the same sample image, and the feature representations of the same color are extracted from the same network. I | · | | represents the euclidean distance between the two.
The network parameters for the overall network are then updated as shown in equation (7).
Figure BDA0003264157770000203
Wherein v is(T-1)Represents the parameters of the comprehensive network in (T-1) iterations (i.e. the last iteration training),
Figure BDA0003264157770000204
is a parameter of the feature extractor of the nth domain network in the current iteration T, epsilon may be set to a data set adjustment of 0.999. Initialization of these parameters
Figure BDA0003264157770000205
In the embodiment of the present application, the first step to the fourth step are repeated to complete the training of the N domain networks, and then one iteration is completed. And then, continuing iteration until the object recognition network converges, and finally using the output comprehensive network for actual model deployment.
In the embodiment of the present application, in order to further make the feature uniformly distributed in the feature space, uniform loss is adopted to make the feature distribution between different domains more uniform, and the uniform loss is shown in formula (8):
Figure BDA0003264157770000206
wherein the content of the first and second substances,
Figure BDA0003264157770000207
q denotes the size of a small batch of data of the j-th field (corresponding to the second source field sample image in the above-described embodiment), and is equal to B, and t-0.5 denotes a temperature coefficient. It follows that L is minimizedunifIs equal to maximum
Figure BDA0003264157770000208
And a sample thereof derived from
Figure BDA0003264157770000209
Distance of sampled negative samples. In some possible implementations, the same dataThe samples of a domain are closer to its negative examples than the samples of other domains, but ideally the samples should be far from the negative examples no matter which domain they come from, so L is usedunifTo keep each sample away from negative samples from any domain.
Based on this, the overall loss function is shown in equation (9):
Figure BDA00032641577700002010
wherein γ is 0.1 for balancing LunifThe influence of (c).
In the embodiment of the application, model training is realized based on multi-domain network cooperation, and the generalization of the network is improved. In this way, each domain network is responsible for learning the knowledge of one domain, and prevents the overfitting of the network by cooperating with other domain networks; in addition, the embodiment of the application is provided with a comprehensive network for collecting and summarizing the knowledge learned by the network in each field, so that the comprehensive network can know the information of all the fields, and the generalization performance of the network can be further improved.
An embodiment of the present application provides a training apparatus for an object recognition network, where fig. 7 is a schematic structural composition diagram of the training apparatus for an object recognition network according to the embodiment of the present application, and as shown in fig. 7, the training apparatus 700 for an object recognition network includes:
a first determining module 701, configured to determine, based on a first source domain sample image and a second source domain sample image, a first alignment loss between the comprehensive network and a first domain network and a second alignment loss between the comprehensive network and a second domain network, respectively; wherein the first source domain sample image and the second source domain sample image are not overlapped, the first source domain sample image is used for training the first domain network, and the second source domain sample image is used for training the second domain network;
a second determining module 702, configured to determine a domain loss of the first domain network and a domain loss of the second domain network based on the first source domain sample image and the first alignment loss, and the second source domain sample image and the second alignment loss, respectively;
a third determining module 703, configured to determine a coordination loss between the first domain network and the second domain network based on the second source domain sample image, the second alignment loss, and a domain loss of the first domain network;
a fourth determining module 704, configured to determine a network parameter of the comprehensive network based on the network parameters of the first domain network and the second domain network;
a first adjusting module 705, configured to adjust a network parameter of the object recognition network based on a domain loss of the first domain network, a domain loss of the second domain network, and a cooperative loss between the first domain network and the second domain network, so that a loss of the adjusted overall network output satisfies a convergence condition.
In some embodiments, the first determining module 701 includes:
the first extraction submodule is used for extracting the characteristics of the first source domain sample image based on the comprehensive network to obtain first image characteristics;
the first transformation submodule is used for carrying out feature extraction and projection transformation on the first source domain sample image based on the first domain network to obtain second image features;
a first determination sub-module to determine a first alignment penalty between the comprehensive network and the first domain network based on the second image feature and the first image feature.
In some embodiments, the first determining sub-module includes:
a first determining unit, configured to determine a feature distance between the second image feature and the first image feature in a feature space of the first source domain sample image;
a second determination unit for determining the first alignment loss based on the feature distance and a true feature distance between features in the first source domain sample image.
In some embodiments, the second determining module 702 includes:
the first training submodule is used for training the first field network by taking the first source domain sample image as meta-training data to obtain the internal loss of the first field network;
a second determining sub-module, configured to determine a meta-training loss for adjusting a network parameter of the first domain network based on the internal loss and the first alignment loss;
a third determining submodule, configured to determine that the meta-training loss of the first domain network is the domain loss.
In some embodiments, the first training submodule comprises:
a third determining unit, configured to determine an image distance loss between the first source domain sample image and a positive sample image based on the image features extracted by the first domain network and the image features of the positive sample image;
the first classification unit is used for classifying the image features extracted by the first domain network in the first domain network to obtain a first classification result;
a fourth determining unit, configured to determine a classification loss of the image features extracted by the first domain network based on the first classification result and a true value classification label of the first source domain sample image;
a fifth determining unit configured to determine the classification loss and the image distance loss as an internal loss of the first-domain network;
the second determining submodule is further configured to:
and fusing the first alignment loss, the image distance loss and the classification loss to obtain the meta-training loss of the first field network.
In some embodiments, the first domain network comprises: the first transformation submodule is further used for: performing feature extraction on the first source domain sample image based on the feature extractor; based on the projection network, carrying out projection transformation on the image features extracted by the feature extractor to obtain second image features;
a first classification unit further to: and classifying the image features extracted by the feature extractor based on the classifier to obtain the first classification result.
In some embodiments, the third determining module 703 includes:
a fourth determining submodule, configured to determine an adaptive parameter based on the network parameter of the feature extractor of the first domain network and the domain loss of the first domain network;
a fifth determining sub-module for determining a synergy loss between the first domain network and the second domain network based on the adaptation parameter, the second alignment loss, and the second source domain sample image.
In some embodiments, the fifth determination submodule includes:
a first updating unit, configured to use the adaptive parameter as a network parameter of a feature extractor of the second-domain network to obtain the updated second-domain network;
the first extraction unit is used for inputting the second source domain sample image as meta-test data into a feature extractor of the second domain network for feature extraction to obtain a third image feature;
the second classification unit is used for respectively performing projection conversion and feature classification on the third image features based on the second domain network to obtain fourth image features and a second classification result;
a sixth determining unit, configured to determine a meta-test penalty of the first domain network based on the fourth image feature, the second classification result, and the second alignment penalty;
a seventh determining unit, configured to determine that the meta-test loss is the cooperative loss.
In some embodiments, the fourth determining module 704 includes:
a sixth determining submodule, configured to determine, in an iterative training process of the object recognition network to be trained, a historical network parameter of the comprehensive network in the last iterative training;
a seventh determining submodule, configured to determine a predicted network parameter set of the feature extractor in the first domain network and the second domain network in the next iterative training;
a first updating sub-module for updating the network parameters of the full network in the next iteration training based on the historical network parameters and the set of predicted network parameters.
In some embodiments, the first adjusting module 705 comprises:
an eighth determining submodule for determining a data amount of the first source domain sample image;
a ninth determining sub-module, configured to determine a uniform loss for adjusting distribution of image features based on the data amount, the image features of the second source domain sample image extracted by the comprehensive network, and the second image features extracted by the feature extractor of the first domain network;
the first adjusting submodule is used for adjusting the uniform loss by adopting a preset balance amount to obtain an adjusted uniform loss;
a first fusion submodule, configured to fuse the adjusted uniform loss, the domain loss of the first domain network, the domain loss of the second domain network, and the cooperative loss to obtain a total loss;
and the second adjusting submodule is used for adjusting the network parameters of the object identification network based on the total loss so as to enable the adjusted loss of the overall network output to meet the convergence condition.
An object recognition apparatus is provided in an embodiment of the present application, fig. 8 is a schematic structural component diagram of the object recognition apparatus in the embodiment of the present application, and as shown in fig. 8, the object recognition apparatus 800 includes:
a first obtaining module 801, configured to obtain a first image including an object to be identified;
a first extraction module 802, configured to perform feature extraction on the first image and a second image in a preset image library respectively based on a comprehensive network, so as to obtain an image feature of the first image and an image feature of the second image; the comprehensive network is obtained by training based on the training method of the object recognition network provided in the embodiment;
the first identification module 803 is configured to perform re-identification on the object to be identified in the preset image library based on the image feature of the first image and the image feature of the second image, so as to obtain an identification result.
It should be noted that the above description of the embodiment of the apparatus, similar to the above description of the embodiment of the method, has similar beneficial effects as the embodiment of the method. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.
It should be noted that, in the embodiment of the present application, if the training method of the object recognition network is implemented in the form of a software functional module and is sold or used as a standalone product, it may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a terminal, a server, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a hard disk drive, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.
Correspondingly, an embodiment of the present application further provides a computer program product, where the computer program product includes computer-executable instructions, and after the computer-executable instructions are executed, the steps in the training method for an object recognition network provided in the embodiment of the present application can be implemented.
Accordingly, an embodiment of the present application further provides a computer storage medium, where computer-executable instructions are stored on the computer storage medium, and when the computer-executable instructions are executed by a processor, the steps of the method for training an object recognition network provided in the foregoing embodiment are implemented.
Accordingly, an embodiment of the present application provides a computer device, fig. 9 is a schematic structural diagram of the computer device in the embodiment of the present application, and as shown in fig. 9, the computer device 900 includes: a processor 901, at least one communication bus, a communication interface 902, at least one external communication interface, and a memory 903. Wherein communications interface 902 is configured to enable connectivity communications between the components. The communication interface 902 may include a display screen, and the external communication interface may include a standard wired interface and a wireless interface. The processor 901 is configured to execute an image processing program in a memory to implement the steps of the training method for the object recognition network provided in the foregoing embodiments.
The above descriptions of the embodiments of the training apparatus, the computer device, and the storage medium of the object recognition network are similar to the descriptions of the above method embodiments, have similar technical descriptions and beneficial effects as the corresponding method embodiments, and are limited by the space. For technical details not disclosed in the embodiments of the training apparatus, the computer device and the storage medium of the object recognition network of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of features does not include only those features but may include other features not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, reference to a feature identified by the phrase "comprising an … …" does not exclude the presence of additional similar features in any process, method, article, or apparatus that comprises the feature.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit. Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.
Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code. The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (16)

1. A training method of an object recognition network, wherein the object recognition network comprises a comprehensive network, a first domain network and a second domain network, the method comprising:
respectively determining a first alignment loss between the comprehensive network and a first domain network and a second alignment loss between the comprehensive network and a second domain network based on a first source domain sample image and a second source domain sample image; wherein the first source domain sample image and the second source domain sample image are not overlapped, the first source domain sample image is used for training the first domain network, and the second source domain sample image is used for training the second domain network;
determining a domain loss of the first domain network and determining a domain loss of the second domain network based on the first source domain sample image and the first alignment loss, the second source domain sample image and the second alignment loss, respectively;
determining a synergy loss between the first domain network and the second domain network based on the second source domain sample image, the second alignment loss, and a domain loss of the first domain network;
determining network parameters of the comprehensive network based on the network parameters of the first domain network and the second domain network;
and adjusting the network parameters of the object recognition network based on the field loss of the first field network, the field loss of the second field network and the cooperative loss between the first field network and the second field network so as to enable the adjusted loss of the overall network output to meet the convergence condition.
2. The method of claim 1, wherein determining a first alignment penalty between the comprehensive network and a first domain network based on a first source domain sample image comprises:
performing feature extraction on the first source domain sample image based on the comprehensive network to obtain a first image feature;
based on the first domain network, performing feature extraction and projective transformation on the first source domain sample image to obtain a second image feature;
determining a first alignment penalty between the full network and the first domain network based on the second image features and the first image features.
3. The method of claim 2, wherein determining a first alignment penalty between the full network and the first domain network based on the second image feature and the first image feature comprises:
determining a feature distance between the second image feature and the first image feature in a feature space of the first source domain sample image;
determining the first alignment penalty based on the feature distance and a true feature distance between features in the first source domain sample image.
4. The method of any of claims 1 to 3, wherein determining a domain loss for the first domain network based on the first source domain sample image and the first alignment loss comprises:
taking the first source domain sample image as meta-training data, and training the first domain network to obtain the internal loss of the first domain network;
determining a meta-training penalty for adjusting network parameters of the first domain network based on the internal penalty and the first alignment penalty;
determining a meta-training penalty for the first domain network as the domain penalty.
5. The method of claim 4, wherein training the first domain network using the first source domain sample image as meta-training data to obtain the internal loss of the first domain network comprises:
determining an image distance loss between the first source domain sample image and a positive sample image based on the image features extracted by the first domain network and the image features of the positive sample image;
in the first domain network, classifying the image features extracted by the first domain network to obtain a first classification result;
determining a classification loss of the image features extracted by the first domain network based on the first classification result and a truth classification label of the first source domain sample image;
determining the classification loss and the image distance loss as the internal loss of the first domain network;
the determining a meta-training penalty for the first domain network based on the internal penalty and the first alignment penalty comprises:
and fusing the first alignment loss, the image distance loss and the classification loss to obtain the meta-training loss of the first field network.
6. The method of any of claims 2 to 5, wherein the first domain network comprises: the method comprises the following steps of performing feature extraction and projection transformation on the first source domain sample image to obtain a second image feature, wherein the feature extraction comprises:
performing feature extraction on the first source domain sample image based on the feature extractor;
based on the projection network, carrying out projection transformation on the image features extracted by the feature extractor to obtain second image features;
in the first domain network, classifying the image features extracted by the first domain network to obtain a first classification result, including: and classifying the image features extracted by the feature extractor based on the classifier to obtain the first classification result.
7. The method of any of claims 1 to 6, wherein determining a synergy loss between the first domain network and the second domain network based on the second source domain sample image, the second alignment loss, and a domain loss of the first domain network comprises:
determining an adaptive parameter based on a network parameter of a feature extractor of the first domain network and a domain loss of the first domain network;
determining a synergy loss between the first domain network and the second domain network based on the adaptation parameter, the second alignment loss, and the second source domain sample image.
8. The method of claim 7, wherein determining the synergy loss between the first domain network and the second domain network based on the adaptation parameter, the second alignment loss, and the second source domain sample image comprises:
taking the adaptive parameters as network parameters of a feature extractor of the second domain network to obtain the updated second domain network;
inputting the second source domain sample image as meta-test data into a feature extractor of the second domain network for feature extraction to obtain a third image feature;
based on the second domain network, respectively performing projection conversion and feature classification on the third image features to obtain fourth image features and a second classification result;
determining a meta-test penalty for the first domain network based on the fourth image feature, the second classification result, and the second alignment penalty;
determining the meta-test loss as the synergy loss.
9. The method of any of claims 1 to 8, wherein determining the network parameters of the comprehensive network based on the network parameters of the first domain network and the second domain network comprises:
determining historical network parameters of the previous iterative training of the comprehensive network in the iterative training process of the identification network of the object to be trained;
determining a predicted network parameter set of the feature extractors in the first domain network and the second domain network in the next iterative training;
updating the network parameters of the full network in the next iteration training based on the historical network parameters and the set of predicted network parameters.
10. The method according to any one of claims 1 to 9, wherein the adjusting the network parameters of the object recognition network based on the domain loss of the first domain network, the domain loss of the second domain network, and the cooperative loss between the first domain network and the second domain network so that the adjusted loss of the overall network output satisfies the convergence condition comprises:
determining a data volume of the first source domain sample image;
determining a uniform loss for adjusting image feature distribution based on the data volume, the image features of the second source domain sample image extracted by the comprehensive network, and the second image features extracted by the feature extractor of the first domain network;
adjusting the uniform loss by adopting a preset balance amount to obtain an adjusted uniform loss;
fusing the adjusted uniform loss, the domain loss of the first domain network and the domain loss of the second domain network and the cooperative loss to obtain a total loss;
and adjusting the network parameters of the object recognition network based on the total loss so as to enable the adjusted loss of the overall network output to meet the convergence condition.
11. An object recognition method, characterized in that the method comprises:
acquiring a first image including an object to be recognized;
respectively extracting the features of the first image and a second image in a preset image library based on a comprehensive network to obtain the image features of the first image and the image features of the second image; wherein the comprehensive network is trained based on the method of any one of claims 1 to 10;
and re-identifying the object to be identified in the preset image library based on the image characteristics of the first image and the image characteristics of the second image to obtain an identification result.
12. An apparatus for training an object recognition network, the object recognition network comprising a comprehensive network, a first domain network and a second domain network, the apparatus comprising:
a first determining module, configured to determine, based on a first source domain sample image and a second source domain sample image, a first alignment loss between the comprehensive network and a first domain network, and a second alignment loss between the comprehensive network and a second domain network, respectively; wherein the first source domain sample image and the second source domain sample image are not overlapped, the first source domain sample image is used for training the first domain network, and the second source domain sample image is used for training the second domain network;
a second determining module, configured to determine a domain loss of the first domain network and determine a domain loss of the second domain network based on the first source domain sample image and the first alignment loss, and the second source domain sample image and the second alignment loss, respectively;
a third determining module for determining a synergy loss between the first domain network and the second domain network based on the second source domain sample image, the second alignment loss, and a domain loss of the first domain network;
a fourth determining module, configured to determine a network parameter of the comprehensive network based on the network parameters of the first domain network and the second domain network;
and the first adjusting module is used for adjusting the network parameters of the object identification network based on the field loss of the first field network, the field loss of the second field network and the cooperative loss between the first field network and the second field network so as to enable the adjusted loss of the comprehensive network output to meet the convergence condition.
13. An object recognition apparatus, characterized in that the apparatus comprises:
the device comprises a first acquisition module, a second acquisition module and a recognition module, wherein the first acquisition module is used for acquiring a first image comprising an object to be recognized;
the first extraction module is used for respectively extracting the features of the first image and the second image in a preset image library based on a comprehensive network to obtain the image features of the first image and the image features of the second image; wherein the comprehensive network is trained based on the method of any one of claims 1 to 10;
and the first identification module is used for re-identifying the object to be identified in the preset image library based on the image characteristics of the first image and the image characteristics of the second image to obtain an identification result.
14. A computer storage medium having stored thereon computer-executable instructions that, when executed, perform the method steps of any one of claims 1 to 10 or that, when executed, perform the method steps of claim 11.
15. An electronic device, comprising a memory having computer-executable instructions stored thereon, and a processor capable of performing the method steps of any one of claims 1-10 when executing the computer-executable instructions on the memory, or capable of performing the method steps of claim 11 when executing the computer-executable instructions on the memory.
16. A computer program product, characterized in that it comprises computer-executable instructions capable, when executed, of carrying out the method steps of any one of claims 1 to 10, or capable, when executed, of carrying out the method steps of claim 11.
CN202111081370.9A 2021-09-15 2021-09-15 Object recognition method, network training method and device, equipment and medium Active CN113837256B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111081370.9A CN113837256B (en) 2021-09-15 2021-09-15 Object recognition method, network training method and device, equipment and medium
PCT/CN2022/077443 WO2023040195A1 (en) 2021-09-15 2022-02-23 Object recognition method and apparatus, network training method and apparatus, device, medium, and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111081370.9A CN113837256B (en) 2021-09-15 2021-09-15 Object recognition method, network training method and device, equipment and medium

Publications (2)

Publication Number Publication Date
CN113837256A true CN113837256A (en) 2021-12-24
CN113837256B CN113837256B (en) 2023-04-07

Family

ID=78959464

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111081370.9A Active CN113837256B (en) 2021-09-15 2021-09-15 Object recognition method, network training method and device, equipment and medium

Country Status (2)

Country Link
CN (1) CN113837256B (en)
WO (1) WO2023040195A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023040195A1 (en) * 2021-09-15 2023-03-23 上海商汤智能科技有限公司 Object recognition method and apparatus, network training method and apparatus, device, medium, and product

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116340833B (en) * 2023-05-25 2023-10-13 中国人民解放军海军工程大学 Fault diagnosis method based on countermeasure migration network in improved field

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190286986A1 (en) * 2018-01-11 2019-09-19 Huawei Technologies Co., Ltd. Machine Learning Model Training Method And Apparatus
CN111126360A (en) * 2019-11-15 2020-05-08 西安电子科技大学 Cross-domain pedestrian re-identification method based on unsupervised combined multi-loss model
WO2020186914A1 (en) * 2019-03-20 2020-09-24 北京沃东天骏信息技术有限公司 Person re-identification method and apparatus, and storage medium
CN112215280A (en) * 2020-10-12 2021-01-12 西安交通大学 Small sample image classification method based on meta-backbone network
CN112861995A (en) * 2021-03-15 2021-05-28 中山大学 Unsupervised few-sample image classification method and system based on model independent meta learning and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11138469B2 (en) * 2019-01-15 2021-10-05 Naver Corporation Training and using a convolutional neural network for person re-identification
CN111860823A (en) * 2019-04-30 2020-10-30 北京市商汤科技开发有限公司 Neural network training method, neural network training device, neural network image processing method, neural network image processing device, neural network image processing equipment and storage medium
CN111476168B (en) * 2020-04-08 2022-06-21 山东师范大学 Cross-domain pedestrian re-identification method and system based on three stages
CN112396119A (en) * 2020-11-25 2021-02-23 上海商汤智能科技有限公司 Image processing method and device, electronic equipment and storage medium
CN113837256B (en) * 2021-09-15 2023-04-07 深圳市商汤科技有限公司 Object recognition method, network training method and device, equipment and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190286986A1 (en) * 2018-01-11 2019-09-19 Huawei Technologies Co., Ltd. Machine Learning Model Training Method And Apparatus
WO2020186914A1 (en) * 2019-03-20 2020-09-24 北京沃东天骏信息技术有限公司 Person re-identification method and apparatus, and storage medium
CN111126360A (en) * 2019-11-15 2020-05-08 西安电子科技大学 Cross-domain pedestrian re-identification method based on unsupervised combined multi-loss model
CN112215280A (en) * 2020-10-12 2021-01-12 西安交通大学 Small sample image classification method based on meta-backbone network
CN112861995A (en) * 2021-03-15 2021-05-28 中山大学 Unsupervised few-sample image classification method and system based on model independent meta learning and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHIJIE YU ET AL.: "multiple domain experts collaborative learning:multi-source domain generalization for person re-identification", 《ARXIV》 *
陈晨;王亚立;乔宇;: "任务相关的图像小样本深度学习分类方法研究" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023040195A1 (en) * 2021-09-15 2023-03-23 上海商汤智能科技有限公司 Object recognition method and apparatus, network training method and apparatus, device, medium, and product

Also Published As

Publication number Publication date
CN113837256B (en) 2023-04-07
WO2023040195A1 (en) 2023-03-23

Similar Documents

Publication Publication Date Title
Zhu et al. Multi-attention Meta Learning for Few-shot Fine-grained Image Recognition.
CN108182394B (en) Convolutional neural network training method, face recognition method and face recognition device
Dioşan et al. Improving classification performance of support vector machine by genetically optimising kernel shape and hyper-parameters
CN113837256B (en) Object recognition method, network training method and device, equipment and medium
CN112541458A (en) Domain-adaptive face recognition method, system and device based on meta-learning
De Mathelin et al. Adversarial weighting for domain adaptation in regression
Xia et al. Metalearning-based alternating minimization algorithm for nonconvex optimization
Niu et al. Feature-based distant domain transfer learning
CN114548428B (en) Intelligent attack detection method and device of federated learning model based on instance reconstruction
Kong et al. Discriminative relational representation learning for RGB-D action recognition
Rezatofighi et al. Joint learning of set cardinality and state distribution
Brust et al. Active and incremental learning with weak supervision
Liu et al. Learning explicit shape and motion evolution maps for skeleton-based human action recognition
Chen et al. Sample balancing for deep learning-based visual recognition
Połap Hybrid image analysis model for hashtag recommendation through the use of deep learning methods
CN113920382A (en) Cross-domain image classification method based on class consistency structured learning and related device
CN112560823B (en) Adaptive variance and weight face age estimation method based on distribution learning
van Laarhoven et al. Unsupervised domain adaptation with random walks on target labelings
Savchenko et al. Fast search of face recognition model for a mobile device based on neural architecture comparator
Luo et al. Attention regularized Laplace graph for domain adaptation
Dong et al. A supervised dictionary learning and discriminative weighting model for action recognition
Zhang et al. Vehicle verification based on deep siamese network with similarity metric
Wang et al. TIToK: A solution for bi-imbalanced unsupervised domain adaptation
Wang et al. End-to-end training of CNN-CRF via differentiable dual-decomposition
Deng et al. Adversarial multi-label prediction for spoken and visual signal tagging

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40063418

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant