CN112329826A

CN112329826A - Training method of image recognition model, image recognition method and device

Info

Publication number: CN112329826A
Application number: CN202011150035.5A
Authority: CN
Inventors: 吴元明; 袁利娟; 孙茂; 李冰; 万军; 刘梓谕
Original assignee: Air Force Medical University of PLA
Current assignee: Air Force Medical University of PLA
Priority date: 2020-10-24
Filing date: 2020-10-24
Publication date: 2021-02-05

Abstract

The application discloses a training method of an image recognition model, an image recognition method, a device, equipment and a storage medium, and belongs to the field of image processing. In this embodiment of the present application, the server may train the image recognition model by using three images, including a reference sample image, a positive sample image, and a negative sample image, where the three images all include a sample object, the reference sample image and the positive sample image correspond to the same sample object, and the reference sample image and the negative sample image correspond to different sample objects. In the training process, the server is dedicated to identifying that the similarity between the reference sample image and the positive sample image is as high as possible and the similarity between the reference sample image and the negative sample image is as low as possible through the model, so that whether the shot image and the stored image contain the same face or not can be accurately determined in the subsequent use process.

Description

Training method of image recognition model, image recognition method and device

Technical Field

The present application relates to the field of image processing, and in particular, to a training method for an image recognition model, an image recognition method, an image recognition device, a server, and a storage medium.

Background

With the development of computer technology, the artificial intelligence technology is developed vigorously, and the image recognition technology is used as a branch of the artificial intelligence technology, and the application range of the image recognition technology is wider and wider.

The recognition accuracy of the image recognition model in the related art is limited by the number of sample images, and the number of sample images is often limited, so that the training effect of the image recognition model is poor.

Disclosure of Invention

The embodiment of the application provides a training method of an image recognition model, an image recognition method, an image recognition device, a server and a storage medium, and can improve the training effect of the image recognition model. The technical scheme is as follows:

in one aspect, a method for training an image recognition model is provided, the method including:

in an iteration process, acquiring a reference sample image, a positive sample image and a negative sample image, wherein the reference sample image and the positive sample image correspond to a first sample object, the negative sample image corresponds to a second sample object, and the first sample object and the second sample object are different sample objects;

inputting the reference sample image, the positive sample image and the negative sample image into an image recognition model, and determining a first similarity between the reference sample image and the positive sample image and a second similarity between the reference sample image and the negative sample image through the image recognition model;

and in response to the difference between the first similarity and the second similarity meeting a target condition, taking the image recognition model as a trained image recognition model.

In one possible embodiment, the determining, by the image recognition model, a first similarity between the reference sample image and the positive sample image, and a second similarity between the reference sample image and the negative sample image includes:

extracting reference characteristic information of the reference sample image, positive sample characteristic information of the positive sample image and negative sample characteristic information of the negative sample image through the image recognition model;

obtaining the first similarity according to the reference characteristic information and the positive sample characteristic information;

and obtaining the second similarity according to the reference characteristic information and the negative sample characteristic information.

In one possible embodiment, before the acquiring the reference sample image, the positive sample image and the negative sample image, the method further includes:

performing feature extraction on a plurality of first sample images in the first sample image set to obtain a plurality of first sample image features;

determining similarity between a plurality of first sample images in the first sample image set according to the plurality of first sample image features;

determining two first sample images with the lowest similarity as a first image and a second image;

performing feature extraction on a plurality of second sample images in the second sample image set to obtain a plurality of second sample image features;

determining a third image and a fourth image which have the lowest similarity with the first image and the second image from the second sample image set according to the plurality of second sample image features and the first sample image features of the first image and the second image;

determining the first image as the reference image, the second image as the positive sample image, and the third image as the negative sample image in response to a first similarity between the first image and the third image being less than a second similarity between the second image and the fourth image.

In a possible implementation, after determining, by the image recognition model, a first similarity between the reference sample image and the positive sample image, and a second similarity between the reference sample image and the negative sample image, the method further includes:

and adjusting model parameters of the image recognition model in response to the difference between the first similarity and the second similarity not meeting the target condition.

In a possible embodiment, the adjusting the model parameters of the image recognition model includes:

determining a loss function according to a difference value between the first similarity and the second similarity;

determining a resulting gradient of the image recognition model according to the loss function;

and adjusting the model parameters of the image recognition model according to a gradient descent method.

In a possible implementation, before the image recognition model is used as a trained image recognition model, the method further includes:

acquiring a first positive sample image, a second positive sample image, a first negative sample image and a second negative sample image, wherein the first positive sample image and the second positive sample image correspond to the same sample object, and the first negative sample image and the second negative sample image correspond to different sample objects;

inputting the first positive sample image, the second positive sample image, the first negative sample image, and the second negative sample image into an image recognition model, determining a third similarity between the first positive sample image and the second positive sample image, and a fourth similarity between the first negative sample image and the second negative sample image by the image recognition model;

and adjusting the model parameters of the image recognition model according to the difference information between the third similarity and the fourth similarity.

In one aspect, an image recognition method is provided, and the method includes:

acquiring a first image to be recognized and a second image to be recognized, wherein the first image to be recognized comprises a first object to be recognized, and the second image to be recognized comprises a second object to be recognized;

inputting the first image to be recognized and the second image to be recognized into an image recognition model, and extracting a first image feature of the first image to be recognized and a second image feature of the second image to be recognized through the image recognition model;

the image recognition model is obtained by training based on a plurality of reference sample images, positive sample images and negative sample images, wherein the reference sample images and the positive sample images correspond to first sample objects, the negative sample images correspond to second sample objects, and the first sample objects and the second sample objects are different sample objects;

and outputting the similarity between the first object to be recognized and the second object to be recognized according to the first image feature and the first image feature.

In one aspect, an apparatus for training an image recognition model is provided, the apparatus comprising:

a sample image obtaining module, configured to obtain a reference sample image, a positive sample image, and a negative sample image in an iteration process, where the reference sample image and the positive sample image correspond to a first sample object, the negative sample image corresponds to a second sample object, and the first sample object and the second sample object are different sample objects;

a first input module, configured to input the reference sample image, the positive sample image, and the negative sample image into an image recognition model, and determine a first similarity between the reference sample image and the positive sample image and a second similarity between the reference sample image and the negative sample image through the image recognition model;

and the training module is used for responding to the fact that the difference value between the first similarity and the second similarity meets a target condition, and taking the image recognition model as a trained image recognition model.

In a possible implementation manner, the first input module is configured to extract, through the image recognition model, reference feature information of the reference sample image, positive sample feature information of the positive sample image, and negative sample feature information of the negative sample image; obtaining the first similarity according to the reference characteristic information and the positive sample characteristic information; and obtaining the second similarity according to the reference characteristic information and the negative sample characteristic information.

In a possible embodiment, the apparatus further comprises:

the characteristic extraction module is used for carrying out characteristic extraction on a plurality of first sample images in the first sample image set to obtain a plurality of first sample image characteristics;

a similarity determining module, configured to determine, according to the plurality of first sample image features, a similarity between a plurality of first sample images in the first sample image set;

the image determining module is used for determining two first sample images with the lowest similarity as a first image and a second image;

the similarity determining module is further configured to perform feature extraction on a plurality of second sample images in the second sample image set to obtain a plurality of second sample image features;

the image determining module is further configured to determine, according to the plurality of second sample image features and first sample image features of the first image and the second image, a third image and a fourth image which are the lowest in similarity to the first image and the second image from the second sample image set, respectively;

the image determination module is further configured to determine the first image as the reference image, the second image as the positive sample image, and the third image as the negative sample image in response to a first similarity between the first image and the third image being less than a second similarity between the second image and the fourth image.

In a possible embodiment, the apparatus further comprises:

and the parameter adjusting module is used for adjusting the model parameters of the image recognition model in response to the fact that the difference value between the first similarity and the second similarity does not accord with the target condition.

In a possible implementation manner, the parameter adjusting module is configured to determine a loss function according to a difference between the first similarity and the second similarity; determining a resulting gradient of the image recognition model according to the loss function; and adjusting the model parameters of the image recognition model according to a gradient descent method.

In a possible implementation manner, the sample image obtaining module is further configured to obtain a first positive sample image, a second positive sample image, a first negative sample image, and a second negative sample image, where the first positive sample image and the second positive sample image correspond to the same sample object, and the first negative sample image and the second negative sample image correspond to different sample objects;

the first input module is further configured to input the first positive sample image, the second positive sample image, the first negative sample image, and the second negative sample image into an image recognition model, and determine a third similarity between the first positive sample image and the second positive sample image and a fourth similarity between the first negative sample image and the second negative sample image through the image recognition model;

the training module is further configured to adjust a model parameter of the image recognition model according to difference information between the third similarity and the fourth similarity.

In one aspect, an image recognition apparatus is provided, the apparatus including:

the image acquisition module is used for acquiring a first image to be recognized and a second image to be recognized, wherein the first image to be recognized comprises a first object to be recognized, and the second image to be recognized comprises a second object to be recognized;

the second input module is used for inputting the first image to be recognized and the second image to be recognized into an image recognition model, and extracting a first image feature of the first image to be recognized and a second image feature of the second image to be recognized through the image recognition model;

the image recognition model is obtained by training based on a plurality of reference sample images, a plurality of positive sample images and a negative sample image in a multi-order, the reference sample images and the positive sample images correspond to a first sample object, the negative sample images correspond to a second sample object, and the first sample object and the second sample object are different sample objects;

and the output module is used for outputting the similarity between the first object to be recognized and the second object to be recognized according to the first image characteristic and the first image characteristic.

In one aspect, a computer device is provided that includes one or more processors and one or more memories having at least one instruction stored therein, the instruction being loaded and executed by the one or more processors to implement operations performed by a training method or an image recognition method of the image recognition model.

In one aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the instruction is loaded and executed by a processor to implement the operation performed by the training method or the image recognition method of the image recognition model.

In this embodiment of the present application, the server may train the image recognition model by using three images, including a reference sample image, a positive sample image, and a negative sample image, where the three images all include a sample object, the reference sample image and the positive sample image correspond to the same sample object, and the reference sample image and the negative sample image correspond to different sample objects. In the training process, the server is dedicated to identifying that the similarity between the reference sample image and the positive sample image is as high as possible and the similarity between the reference sample image and the negative sample image is as low as possible through the model, so that whether the shot image and the stored image contain the same face or not can be accurately determined in the subsequent use process.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of an image recognition model provided in an embodiment of the present application;

FIG. 2 is a flowchart of a training method of an image recognition model according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of a training method of an image recognition model according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a training method of an image recognition model according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of an acquisition method of a cross-batch difficult-case triple image according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating a training method of an image recognition model according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram illustrating a network update method according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram illustrating a training method of an image recognition model according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a training apparatus structure of an image recognition model according to an embodiment of the present disclosure;

fig. 10 is a schematic diagram of a structure of an image recognition apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution.

The term "at least one" in this application means one or more, "a plurality" means two or more, for example, a plurality of reference face images means two or more reference face images.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge submodel to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

Monotonic functions include monotonically increasing functions in which the dependent variable increases with increasing independent variable, such as the presence of a monotonically increasing function F_i() In the presence of an argument a < b, then F_i(a)＜F_i(b) (ii) a For example, there is a monotone decreasing function F_d() In the presence of an argument a < b, then F_i(a)＞F_i(b)。

Cloud computing (cloud computing) is a computing model that distributes computing tasks over a pool of resources formed by a large number of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

As a basic capability provider of cloud computing, a cloud computing resource pool (called as an ifas (Infrastructure as a Service) platform for short is established, and multiple types of virtual resources are deployed in the resource pool and are selectively used by external clients.

According to the logic function division, a PaaS (Platform as a Service) layer can be deployed on an IaaS (Infrastructure as a Service) layer, a SaaS (Software as a Service) layer is deployed on the PaaS layer, and the SaaS can be directly deployed on the IaaS. PaaS is a platform on which software runs, such as a database, a web container, etc. SaaS is a variety of business software, such as web portal, sms, and mass texting. Generally speaking, SaaS and PaaS are upper layers relative to IaaS.

In the embodiment of the present application, the server or the terminal may be used as an execution subject to implement the technical solution provided in the embodiment of the present application, or the technical method provided in the present application may be implemented through interaction between the terminal and the server, which is not limited in the embodiment of the present application. The following description will take the execution subject as a server as an example:

in the embodiment of the present application, the image recognition model may be used to recognize whether the objects included in the two images are the same object. For example, it is identified in the two images containing vehicles whether the two vehicles are the same vehicle or whether the two fruits are the same type of fruit. Of course, the image recognition model can also compare whether the faces in the two images are the same face, for example, compare whether the faces in the image shot in real time and the identification card stored in the database are the same face.

In order to more clearly explain the technical solution provided by the present application, a training method of an image recognition model provided by the present application is first introduced:

it should be noted that, in training the image recognition model provided in the present application, a pre-training mode may be used to initialize parameters of the model, for example, a common face recognition data set (e.g., MS-Celeb-1M) is used to train the model, so that a network obtains good initialization. Therefore, the image recognition model can be well trained, and a good image recognition effect is achieved.

The pre-training steps are as follows:

step 1, the server carries out preprocessing on all pre-training images, including the steps of detecting whether human faces are included, aligning, cutting and the like. The server divides the preprocessed images into a plurality of batches, and one batch of images is adopted in each training process.

And 2, randomly selecting n images by the server according to batches, and inputting the n images into a network for training, wherein n is a positive integer.

And 3, forward transmission is carried out by the server to obtain network output, and L2 normalization is carried out on the characteristics.

And 4, the server calculates an AM-Softmax loss function based on the formula (1):

wherein n represents the number of pre-training images; y is_iA true attribute label representing the ith training image; j is other than y_iA true attribute tag; s is a scaling factor set to 30; cos θ_jRepresenting the angle cosine values of the feature vectors and the weight vectors; m is a cosine edge term.

Step 5, the server judges whether the training loss is converged, if so, the training is terminated to obtain a pre-training model; otherwise, continuing the next operation.

And 6, calculating the Gradient of the network parameters by the server, and updating the network parameters by adopting a Stochastic Gradient Descent (SGD) algorithm.

And 7, the server returns to the step (2).

Of course, the server can also acquire other open-source image recognition models from the network as pre-training models for training the image recognition models provided by the application, so that the calculation amount of the server can be reduced, and the model training efficiency can be improved.

In order to more clearly describe the training method of the image recognition model provided in the present application, a structure of the image recognition model provided in the embodiment of the present application is first described, and referring to fig. 1, the image recognition model may include: an input layer 101, a feature extraction layer 102, and an output layer 103.

Wherein the input layer 101 is used to input images into the model. The feature extraction layer 102 is used to extract feature information of the image, which may include, but is not limited to, geometric feature information and size feature information of an object in the image. The output layer 103 is configured to process the image feature information to obtain difference information for model training. Three or four input layers 101 may exist in the image recognition model, and if three input layers 101 exist in the image recognition model, the three input layers 101 are respectively used for inputting a reference sample image, a positive sample image and a negative sample image, that is, a triplet image. If there are four input layers 101 in the image recognition model, the four input layers 101 are respectively used for inputting a first positive sample image, a second positive sample image, a first negative sample image, and a second negative sample image, that is, a quadruple image. Correspondingly, a feature extraction layer 102 may be connected behind each input layer 101, and is respectively used for extracting reference feature information of a reference sample image, positive sample feature information of a positive sample image, and negative sample feature information of a negative sample, or first positive sample feature information of a first positive sample image, second positive sample feature information of a second positive sample image, first negative sample feature information of a first negative sample image, and second negative sample feature information of a second negative sample image, and parameters between the feature extraction layers may be shared. The number of output layers 103 may be two, where one output layer 1031 is used for outputting a first similarity between the positive sample image and the reference sample image, or for outputting a third similarity between the first positive sample image and the second positive sample image, and one output layer 1032 is used for outputting a second similarity between the reference sample image and the negative sample image, or for outputting a fourth similarity between the first negative sample image and the second negative sample image.

Of course, the structure of the image recognition model is shown only for exemplary description, and in other possible embodiments, an image recognition model with other structures may exist.

Fig. 2 is a flowchart of a training method of an image recognition model provided in an embodiment of the present application, and referring to fig. 2, the method includes:

201. in one iteration process, the server obtains a reference sample image, a positive sample image and a negative sample image, wherein the reference sample image and the positive sample image correspond to a first sample object, the negative sample image corresponds to a second sample object, and the first sample object and the second sample object are different sample objects.

202. The server inputs the reference sample image, the positive sample image and the negative sample image into an image recognition model, and determines a first similarity between the reference sample image and the positive sample image and a second similarity between the reference sample image and the negative sample image through the image recognition model.

203. And in response to the difference between the first similarity and the second similarity meeting the target condition, the server takes the image recognition model as a trained image recognition model.

In one possible embodiment, determining a first similarity between the reference sample image and the positive sample image and a second similarity between the reference sample image and the negative sample image by the image recognition model comprises:

and extracting reference characteristic information of the reference sample image, positive sample characteristic information of the positive sample image and negative sample characteristic information of the negative sample image through the image recognition model.

And obtaining a first similarity according to the reference characteristic information and the positive sample characteristic information.

And obtaining a second similarity according to the reference characteristic information and the negative sample characteristic information.

In one possible embodiment, before acquiring the reference sample image, the positive sample image, and the negative sample image, the method further includes:

and performing feature extraction on the plurality of first sample images in the first sample image set to obtain a plurality of first sample image features.

Determining similarity between the plurality of first sample images in the first sample image set according to the plurality of first sample image features.

And determining two first sample images with the lowest similarity as the first image and the second image.

And performing feature extraction on a plurality of second sample images in the second sample image set to obtain a plurality of second sample image features.

And determining a third image and a fourth image which have the lowest similarity with the first image and the second image from the second sample image set according to the plurality of second sample image characteristics and the first sample image characteristics of the first image and the second image.

In response to a first similarity between the first image and the third image being less than a second similarity between the second image and the fourth image, the first image is determined to be a reference image, the second image is determined to be a positive sample image, and the third image is determined to be a negative sample image.

In one possible embodiment, after determining a first similarity between the reference sample image and the positive sample image and a second similarity between the reference sample image and the negative sample image by the image recognition model, the method further includes:

and adjusting the model parameters of the image recognition model in response to the difference between the first similarity and the second similarity not meeting the target condition.

In one possible embodiment, adjusting the model parameters of the image recognition model comprises:

determining a loss function according to a difference between the first similarity and the second similarity.

From the loss function, the resulting gradient of the image recognition model is determined.

In one possible embodiment, before the image recognition model is used as the trained image recognition model, the method further includes:

the method comprises the steps of obtaining a first positive sample image, a second positive sample image, a first negative sample image and a second negative sample image, wherein the first positive sample image and the second positive sample image correspond to the same sample object, and the first negative sample image and the second negative sample image correspond to different sample objects.

And inputting the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image into an image recognition model, and determining a third similarity between the first positive sample image and the second positive sample image and a fourth similarity between the first negative sample image and the second negative sample image through the image recognition model.

Since the training of the image recognition model may include a plurality of iterative processes, the following steps 301-304 are described by taking an iterative process as an example. Fig. 3 is a flowchart of a training method of an image recognition model according to an embodiment of the present application, and referring to fig. 3, the method includes:

301. in one iteration process, the server obtains a reference sample image, a positive sample image and a negative sample image, wherein the reference sample image and the positive sample image correspond to a first sample object, the negative sample image corresponds to a second sample object, and the first sample object and the second sample object are different sample objects.

The sample object may be determined according to the purpose of the image recognition model, for example, if the image recognition model is used to recognize a human face in an image, the sample object may be a human face; if the image recognition model is used for recognizing cell nuclei in the image, the sample object can be the cell nuclei; if the image recognition model is used to identify a vehicle in the image, then the sample object may be a vehicle. The embodiment of the present application does not limit this. The following description takes a sample object as a face as an example:

in one possible implementation, a server may obtain a first set of images and a second set of images from a network, where the images in the first set of images correspond to different faces than the images in the second set of images. The server can divide the images in the first image set and the second image set into a plurality of batches, and one batch of images is adopted for training in each training process. The server may obtain the reference sample image and the positive sample image from the first set of images and the negative sample image from the second set of images. The server can label the reference sample image, the positive sample image and the negative sample image to indicate the corresponding human faces.

After the server acquires the reference sample image, the positive sample image and the negative sample image, the reference sample image, the positive sample image and the negative sample image can be cut to obtain sample images with the same size. Technicians can screen the cut sample images and remove the sample images without human faces. The image recognition model is trained based on the sample images with the same size, so that all numerical values in model parameters of the image recognition model can be obtained through massive training, and the accuracy of image recognition of the image recognition model can be improved.

In addition, in order to further improve the recognition capability of the image recognition model, a technician may manually screen the sample images in the two image sets to determine that the sample images in each of the two image sets correspond to the same face image. In this implementation, the server may obtain a more accurate recognition effect by using the model trained by the sample image.

Of course, the server may also acquire a plurality of images from the network, perform image recognition on the plurality of images, and discard one image if the image does not have a human face. The server may perform key point detection on the image including the face to obtain a position of the face in the image, and perform cropping on the image according to the position of the face, for example, the image is cropped to a predetermined size of 120 × 120. The server can classify the images according to the faces of the images to generate at least two image sets, wherein the images in each image set correspond to the same face. The server may determine a first set of images and a second set of images from at least two sets of images. The server may obtain the reference sample image and the positive sample image from the first set of images and the negative sample image from the second set of images.

The following describes a method for acquiring reference sample images, positive sample images and negative sample images from a first image set and a second image set by a server:

the reference sample image, the positive sample image and the negative sample image may be referred to as a triplet, wherein the reference sample image is also referred to as an anchor sample, the anchor sample and the positive sample are from the same image set, that is, corresponding to the same face, and the anchor sample and the negative sample are from different identities, that is, corresponding to different faces. The positive samples and the anchor samples are from the same image set, namely, the positive samples and the anchor samples are corresponding to the same face, and the positive samples and the anchor samples are from different identities, namely, the positive samples and the anchor samples are corresponding to different faces, and the positive samples and the anchor samples are paired to form positive sample pairs. The negative sample, the positive sample and the anchor sample do not belong to the same identity, and are matched with the anchor sample to form a negative sample pair.

In order to train an image recognition model more comprehensively and obtain an image recognition model with a better image recognition effect, the technical scheme provided by the application adopts a Batch hard mining algorithm (BHM) when acquiring a reference sample image, a positive sample image and a negative sample image, and in the sample image acquired by the BHM algorithm, the similarity between the reference sample image and the positive sample image is relatively low, and the similarity between the reference sample image and the negative sample image is relatively high, so that the difficulty of training the image recognition model is improved, and the image recognition effect of the image recognition model is finally improved.

For example, the server performs feature extraction on a plurality of first sample images in the first sample image set to obtain a plurality of first sample image features. The server determines the similarity between the plurality of first sample images in the first sample image set according to the characteristics of the plurality of first sample images. The server determines two first sample images with the lowest similarity as a first image and a second image. And the server performs feature extraction on the plurality of second sample images in the second sample image set to obtain a plurality of second sample image features. And the server determines a third image and a fourth image which have the lowest similarity with the first image and the second image from the second sample image set respectively according to the plurality of second sample image characteristics and the first sample image characteristics of the first image and the second image. In response to a first similarity between the first image and the third image being less than a second similarity between the second image and the fourth image, the server determines the first image as a reference image, the second image as a positive sample image, and the third image as a negative sample image.

In addition to employing a Batch Hard Mining (BHM), the present application also provides a cross-Batch Hard Mining algorithm to obtain reference sample images, positive sample images, and negative sample images. The method comprises the following steps:

since the training is performed in units of batches during the training process, it is considered that although the network parameters change along with the training process, the iteration characteristics of adjacent batches do not change too much. Therefore, the reference images of past batches can be used as important references when selecting the sample images. In other words, the selection of sample images is not limited to the current lot, but extends to the past M lots. Compared with Batch hard case mining (BHM), the selection space of the algorithm sample is larger, and correspondingly, the hard case can be better acquired for training. The principle is shown in fig. 5, which mainly comprises the following steps: and acquiring a sample image set and a training difficult example triple to obtain a cross-batch triple.

1. The server determines the distance between any two samples in the current batch to form a distance matrix D^currentWherein D_ij ^currentRepresenting the distance between the ith and jth samples. The server bases on the distance matrix and the label of the batch sample (two samples with the same label can be regarded as a positive sample image pair which is a reference sampleThis image and one positive sample image), the most difficult positive sample image pair is selected, i.e. the most distant positive sample image pair P₁ ^tri，P₂ ^tri。

2. Calculating the selected difficult positive sample image pair P₁ ^triAnd P₁ ^triAnd the previous batch sample Q_XMatrix of distance between D^crossAnd according to the sample label Q_XThe queue picks the corresponding most difficult negative sample N₁ ^tri，N₂ ^tri(two samples with different labels can be treated as a negative sample image pair). At this time P is compared₁ ^triAnd N₁ ^tri、P₂ ^triAnd N₂ ^triThe distance between them, only the most difficult, i.e. the most distant negative examples remain, and is marked N^triAnd is with N^triThe corresponding positive sample image is used as the reference sample image.

302. The server inputs the reference sample image, the positive sample image and the negative sample image into an image recognition model, and determines a first similarity between the reference sample image and the positive sample image and a second similarity between the reference sample image and the negative sample image through the image recognition model.

In one possible implementation, the server extracts the reference feature information of the reference sample image, the positive sample feature information of the positive sample image, and the negative sample feature information of the negative sample image through the image recognition model. And the server obtains the first similarity according to the reference characteristic information and the positive sample characteristic information. And the server obtains a second similarity according to the reference characteristic information and the negative sample characteristic information.

For example, the server may input the reference sample image, the positive sample image, and the amplified negative sample image into the image recognition model through the input layer 101 of the image recognition model, and perform convolution processing on the reference sample image, the positive sample image, and the negative sample image through the feature extraction layer 102 of the image recognition model to obtain reference feature information of the reference sample image, positive sample feature information of the positive sample image, and negative sample feature information of the negative sample image. The server may input the reference feature information of the reference sample image, the positive sample feature information of the positive sample image, and the negative sample feature information of the negative sample image into the output layer 1031 of the image recognition model, and obtain the first size difference information and the second size difference information through the output layer 1031.

For example, the server may perform convolution processing on the reference sample image, the positive sample image, and the negative sample image through the feature extraction layer 102 of the image recognition model to obtain a 128-dimensional reference feature vector, a 128-dimensional positive sample feature vector, and a 128-dimensional negative sample feature vector, respectively. The server may determine a first similarity from the reference feature vector and the positive sample feature vector, and determine a second similarity from the reference feature vector and the negative sample feature vector, i.e., determine a cosine similarity between the vectors.

It should be noted that, after the server performs step 302, it may determine whether the difference between the first similarity and the second similarity meets the target condition. In response to the difference not meeting the target condition, the server may perform step 303; in response to the difference meeting the target condition, the server may perform step 304.

303. In response to that the difference between the similarity and the second similarity does not meet the target condition, the server retrieves the reference sample image, the positive sample image, and the negative sample image based on adjusting the model parameters of the image recognition model according to the difference, and performs

steps

301 and 302.

In one possible embodiment, the server determines the loss function based on a difference between the first similarity and the second similarity. The server determines a resulting gradient of the image recognition model based on the loss function. And the server adjusts the model parameters of the image recognition model according to a gradient descent method.

For example, the server may construct a first loss function according to the first similarity and the second similarity, and adjust the model parameters of the image recognition model through the first loss function. Since the reference sample image and the positive sample image correspond to the same face, that is, the features of the face in the positive sample image and the reference sample image are close to each other, the negative sample image and the reference sample image or the positive sample image of the input image recognition model correspond to different faces. Therefore, the purpose of the server training the image recognition model is to make the first similarity as large as possible, and the second similarity as small as possible, that is, the difference between the first similarity and the second similarity is as large as possible. In this implementation manner, the server may adjust the model parameters of the image recognition model through the first similarity and the second similarity, so as to improve the recognition capability of the image recognition model for the size of the sample object in the image.

For example, the server may construct the first loss function by the first similarity and the second similarity through formula (2).

Wherein L is_triIs a triplet loss function, i.e. the first loss function, x_aFor reference sample image, x_pFor positive sample image, x_nAs a negative sample image, y_a、y_pAnd y_nFor the reference sample image, the positive sample image, the negative sample image corresponding to a human face, d (r)₁，r₂) Representing a metric function to measure the vector r₁，r₂And d (r) is greater than d₁，r₂) Is set to r₁And r₂Distance between m₁Representing edge hyper-parameters representing inter-class differences within control classes, where [ z]₊Max (z,0), i.e. the larger of z and 0 is chosen, if z is greater than 0 then z is taken, if z is less than 0 then 0 is taken. Under such a setting, the triplet penalty function expands the inter-class gap while reducing the intra-class gap.

The server may adjust the model parameters of the image recognition model according to the first loss function by using a Gradient Descent method, where the Gradient Descent method may be a Stochastic Gradient Descent (SGD) method, a Batch Gradient Descent (Batch Gradient Descent), a Mini-Batch Gradient Descent (Mini-Batch Gradient Descent), and the like, and the embodiment of the present application is not limited thereto. In addition, the server can also adopt a gradient descent method and a polynomial learning rate attenuation strategy to adjust the model parameters of the image recognition model. Under the implementation mode, the server can dynamically adjust the learning rate according to the training process, and the training effect of the image recognition model is improved.

304. And in response to the difference between the first similarity and the second similarity meeting the target condition, the server takes the image recognition model as a trained image recognition model.

It should be noted that, in the above step 201 and 204, the server is taken as an execution subject for illustration, and in other possible embodiments, the terminal may be taken as an execution subject to execute the method, or the method may be implemented by interaction between the terminal and the server, that is, a user inputs an image through the terminal, the terminal sends the image to the server, the server executes a training process of an image recognition model, the server sends the trained image recognition model to the terminal, and the user may use the image recognition model on the terminal to perform image recognition, which is not limited in this embodiment of the present application.

The above-mentioned step 301-:

401. the server obtains a first positive sample image, a second positive sample image, a first negative sample image and a second negative sample image, wherein the first positive sample image and the second positive sample image correspond to the same sample object, and the first negative sample image and the second negative sample image correspond to different sample objects.

in one possible implementation, the server may obtain a first sample image set, a second image set, and a third image set from the network, wherein images in the first sample image set, images in the second image set, and images in the third image set correspond to different human faces. The server can divide the images in the first image set, the second image set and the third image set into a plurality of batches, and one batch of images is adopted for training in each training process. The server may obtain a first positive sample image and a second positive sample image from the first set of sample images, a first negative sample image from the second set of images, and a second negative sample image from the third set of images. The server can add labels to the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image to indicate the corresponding faces of the people.

After the server obtains the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image, the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image can be cut to obtain sample images with the same size. Technicians can screen the cut sample images and remove the sample images without human faces. The image recognition model is trained based on the sample images with the same size, so that all numerical values in model parameters of the image recognition model can be obtained through massive training, and the accuracy of image recognition of the image recognition model can be improved.

Of course, the server may also acquire a plurality of images from the network, perform image recognition on the plurality of images, and discard one image if the image does not have a human face. The server may perform key point detection on the image including the face to obtain a position of the face in the image, and perform cropping on the image according to the position of the face, for example, the image is cropped to a predetermined size of 120 × 120. The server can classify the images according to the faces of the images to generate at least two image sets, wherein the images in each image set correspond to the same face. The server may determine a first set of sample images, a second set of images, and a third set of images from the at least three sets of images. The server may obtain a first positive sample image and a second positive sample image from the first set of sample images, a first negative sample image from the second set of images, and a second negative sample image from the third set of images.

The following describes a method for acquiring a first positive sample image, a second positive sample image, a first negative sample image and a second negative sample image from a first sample image set, a second image set and a third image set by a server:

the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image may be referred to as a quadruple, wherein the first positive sample and the second positive sample are from the same image set, i.e. correspond to the same face, and the first negative sample and the second negative sample are from different image sets, i.e. correspond to different faces. Any negative sample is from a different set of images than any positive sample, corresponding to a different face.

In order to train an image recognition model more comprehensively and obtain an image recognition model with a better image recognition effect, the technical scheme provided by the application adopts a Batch Hard Mining algorithm (BHM) when acquiring a reference sample image, a positive sample image and a negative sample image, and in the sample images acquired by the BHM algorithm, the similarity between the selected first positive sample image and the second positive sample image is relatively low, and the similarity between the first negative sample image and the second negative sample image is relatively high, so that the difficulty of image recognition model training is improved, and the image recognition effect of the image recognition model is finally improved.

For example, the server performs feature extraction on a plurality of first sample images in the first sample image set to obtain a plurality of first sample image features. The server determines the similarity between the plurality of first sample images in the first sample image set according to the characteristics of the plurality of first sample images. The server determines two first sample images with the lowest similarity as a first image and a second image. And the server performs feature extraction on the plurality of second sample images in the second sample image set to obtain a plurality of second sample image features. And the server determines a third image with the lowest similarity to the first image and the second image from the second sample image set according to the plurality of second sample image characteristics and the first sample image characteristics of the first image and the second image. The server determines a third image with the lowest similarity to the first image and the second image from the second sample image set according to the plurality of second sample image features and the first sample image features of the first image and the second image. And the server determines a fourth image which has the lowest similarity with the first image and the second image and the highest similarity with the third image from the third sample image set according to the plurality of third sample image characteristics, the first sample image characteristics of the first image and the second sample image characteristics of the third image. The server determines the first image as a first positive sample image, the second image as a second positive sample image, the third image as a first negative sample image, and the fourth image as a second negative sample image.

402. The server inputs the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image into an image recognition model, and determines a third similarity between the first positive sample image and the second positive sample image and a fourth similarity between the first negative sample image and the second negative sample image through the image recognition model.

In a possible implementation manner, the server extracts first positive sample feature information, second positive sample feature information, first negative sample feature information, and second negative sample feature information of the first positive sample image, the second positive sample image, the first negative sample image, and the second negative sample image, respectively, through the image recognition model. And the server obtains a third similarity according to the first positive sample characteristic information and the second positive sample characteristic information. And the server obtains a fourth similarity according to the first negative sample characteristic information and the second negative sample characteristic information.

For example, the server may input the first positive sample image, the second positive sample image, the first negative sample image, and the second negative sample image into the image recognition model through the input layer 101 of the image recognition model, and perform convolution processing on the first positive sample image, the second positive sample image, the first negative sample image, and the second negative sample image through the feature extraction layer 102 of the image recognition model to obtain first positive sample feature information, second positive sample feature information, first negative sample feature information, and second negative sample feature information. The server may input the first positive sample feature information, the second positive sample feature information, the first negative sample feature information, and the second negative sample feature information into the output layer 1031 of the image recognition model, and obtain the third similarity and the fourth similarity through the output layer 1031.

For example, the server may perform convolution processing on the first positive sample image, the second positive sample image, the first negative sample image, and the second negative sample image through the feature extraction layer 102 of the image recognition model to obtain a 128-dimensional first positive sample feature vector, a 128-dimensional second positive sample feature vector, a 128-dimensional first negative sample feature vector, and a 128-dimensional second negative sample feature vector, respectively. The server may determine a third similarity according to the first positive sample feature vector and the second positive sample feature vector, and determine a fourth similarity according to the first negative sample feature vector and the second negative sample feature vector, that is, determine cosine similarities between vectors.

It should be noted that, after the server executes step 402, it may determine whether the difference between the third similarity and the fourth similarity meets the target condition. In response to the difference not meeting the target condition, the server may perform step 403; in response to the difference meeting the target condition, the server may perform step 404.

403. And the server adjusts the model parameters of the image recognition model according to the difference information between the third similarity and the fourth similarity.

In one possible embodiment, the server determines the loss function based on a difference between the third similarity and the fourth similarity. The server determines a resulting gradient of the image recognition model based on the loss function. And the server adjusts the model parameters of the image recognition model according to a gradient descent method.

For example, the server may construct a second loss function according to the third similarity and the fourth similarity, and adjust the model parameters of the image recognition model through the second loss function. The first positive sample image and the second positive sample image correspond to the same face, that is, the features of the face in the second positive sample image and the face in the first positive sample image are close to each other, and the first negative sample image and the second negative sample image of the input image recognition model correspond to different faces, that is, the features of the face in the second negative sample image and the face in the first negative sample image are not close to each other. The purpose of the server training the image recognition model is therefore to make the third similarity as large as possible and the fourth similarity as small as possible, i.e. any third similarity is larger than any fourth similarity, in other words, the difference between the third similarity and the fourth similarity is as large as possible. In this implementation manner, the server may adjust the model parameters of the image recognition model through the third similarity and the fourth similarity, so as to improve the recognition capability of the image recognition model for the size of the sample object in the image.

For example, the server may construct the second loss function by the third similarity and the fourth similarity through formula (3).

Wherein L is_quiIs a quadruple loss function, i.e. a second loss function, x_iIs the first positive sample image, x_jIs the second positive sample image, x_lIs a first negative sample image, xk is a second negative sample image, y_i、y_j、y_lAnd y_lThe first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image correspond to a human face, d (r)₁，r₂) Representing a metric function to measure the vector r₁，r₂And d (r) is greater than d₁，r₂) Is set to r₁And r₂Distance between m₂Representing edge hyper-parameters representing inter-class differences within control classes, where [ z]₊Max (z,0), i.e. the larger of z and 0 is chosen, if z is greater than 0 then z is taken, if z is less than 0 then 0 is taken. With such an arrangement, the quadruple penalty function expands the inter-class gap while reducing the intra-class gap.

404. And in response to the difference value between the third similarity and the fourth similarity meeting the target condition, the server takes the image recognition model as a trained image recognition model.

Through the steps 801 and 805, the server may train an image recognition model using two images, a positive sample image and a negative sample image with a size smaller than that of the positive sample image, where the two images both include sample objects, and the negative sample image is from the positive sample image. The server may enlarge the negative sample image to be the same as the positive sample image, which may result in the number of sample objects in the enlarged negative sample image being less than or equal to the number of sample objects in the positive sample image, i.e. an annotated sample image is generated, i.e. the number of sample objects in the negative sample image is naturally less than the number of sample objects in the positive sample image. The method comprises the steps of inputting a positive sample image and an amplified negative sample image into an image recognition model, determining quantity information of sample objects in the positive sample image and quantity information of the sample objects in the amplified negative sample image through the image recognition model, and training to enable the difference between the two quantity information to be large enough, wherein when the difference is large enough, the model can recognize the quantity of the sample objects, namely the quantity information of the sample objects, so that priori knowledge is provided for subsequent model training, and the training effect of the image recognition model is improved.

In this embodiment of the present application, the server may train the image recognition model by using four images, each of the four images includes a sample object, the first positive sample image and the second positive sample image correspond to the same sample object, and the first negative sample image and the second negative sample image correspond to different sample objects. In the training process, the server is dedicated to recognize that the similarity between the first positive sample image and the second positive sample image is as high as possible and the similarity between the first negative sample image and the second negative sample image is as low as possible through the model, so that whether the shot image and the stored image contain the same face or not can be accurately determined in the subsequent use process.

On the basis of the steps 301-304 and 401-404, the present application further provides a triple and quadruple based joint training method, in which the reference sample image in the triple corresponds to the first positive sample image in the quadruple, the positive sample image in the triple corresponds to the second positive sample image in the quadruple, and the negative sample image in the triple corresponds to the first negative sample image in the quadruple, as shown in fig. 5, the method is as follows.

601. The server obtains a first positive sample image, a second positive sample image, a first negative sample image and a second negative sample image, wherein the first positive sample image and the second positive sample image correspond to the same sample object, and the first negative sample image and the second negative sample image correspond to different sample objects.

The image acquisition process is described in the above 401, and is not described in detail here.

602. The server inputs the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image into an image recognition model, determines a third similarity between the first positive sample image and the second positive sample image and a fourth similarity between the first negative sample image and the second negative sample image through the image recognition model, and determines a loss function according to a difference value between the third similarity and the fourth similarity.

In one possible approach, the server initializes a feature extraction network F (·; θ), where θ is a network parameter, and the server initializes a pseudo-large batch sequence x_plbIs empty, i.e. x_plb＝[]. The server randomly selects a sample image forming set I ═ I₁，…，I_n]And extracting the corresponding batch feature X ═ F (I; theta). The server removes the sample characteristics X extracted in the current iteration from the graph nodes (the pytorech has a special term called detach, and can remove the gradient and other information of X to make it not occupy the video memory), and inputs the sample characteristics X into the sequence X_plbIn (1). The server selects the difficult-to-sample triplets and quadruplets from the current iteration batch (according to the batch difficult-to-sample mining mode), and respectively calculates the loss

And

in addition, the server is based on the current batch sample image X and the past batch sample image X_plbSelecting cross-iteration hard-example triples, and calculating a loss function according to the following formula (4)

Wherein x_aIs the anchor sample image, i.e. the first positive sample image, x_pFor the positive samples, i.e. the second positive sample image, these two samples will only come from the current iteration, because if the samples from the previous iteration are used as the positive sample pair, the calculated gradients will not be transmitted back to them, so that the training is meaningless (the samples from the previous iteration are both truncated). x is the number of_nIs the first negative sample image, x_nMay be selected from past iterations (i.e. from x)_plbSelected from) is obtained, so that the selection space is larger, and the difficult negative sample pair is easy to select. In addition, y_iIs the label of the corresponding sample, and N is the number of samples.

The server calculates the joint loss of the current batch according to the following equation (5), and the principle diagram of the joint loss is shown in fig. 7.

603. The server determines a resulting gradient of the image recognition model based on the loss function. And the server adjusts the model parameters of the image recognition model according to a gradient descent method.

In one possible implementation, the server performs gradient back-propagation and collects the gradient magnitude generated by the current iteration according to the following formula (6).

And the server judges whether the iteration times reach a fixed time pseudo-N, calculates the network parameter gradient if the iteration times reach the fixed time pseudo-N, and updates the network parameters by adopting an SGD algorithm, which is shown in a formula (7).

604. And in response to the difference value between the third similarity and the fourth similarity meeting the target condition, the server takes the image recognition model as a trained image recognition model.

In a possible implementation manner, the server judges whether the training loss converges, and if so, terminates the training to obtain the image recognition model.

In addition to the training method of the image recognition model, an embodiment of the present application further provides an image recognition method, where the image recognition method is implemented based on the image recognition model trained by the training method of the image recognition model, and the method includes:

801. the server acquires a first image to be recognized and a second image to be recognized, wherein the first image to be recognized comprises a first object to be recognized, and the second image to be recognized comprises a second object to be recognized.

The first image to be recognized may be an image obtained in real time, the second image to be recognized is an image stored in the server, for example, the second image to be recognized is an identification card photo of a person to be recognized, and the first image to be recognized is an image of the person to be recognized shot by the camera on site.

802. The server inputs the first image to be recognized and the second image to be recognized into the image recognition model, and extracts the first image feature of the first image to be recognized and the second image feature of the second image to be recognized through the image recognition model.

The image recognition model is obtained by training based on a plurality of reference sample images, positive sample images and negative sample images, the reference sample images and the positive sample images correspond to first sample objects, the negative sample images correspond to second sample objects, and the first sample objects and the second sample objects are different sample objects.

803. And the server outputs the similarity between the first object to be recognized and the second object to be recognized according to the first image characteristic and the first image characteristic.

804. And in response to the fact that the similarity between the first object to be recognized and the second object to be recognized meets the similarity condition, the server determines that the first generation recognition object and the second object to be recognized are the same object.

Fig. 9 is a schematic structural diagram of a training apparatus for an image recognition model according to an embodiment of the present application, and referring to fig. 9, the apparatus includes a sample image obtaining module 901, a first input module 902, and a training module 903.

A sample image obtaining module 901, configured to obtain, in an iterative process, a reference sample image, a positive sample image, and a negative sample image, where the reference sample image and the positive sample image correspond to a first sample object, the negative sample image corresponds to a second sample object, and the first sample object and the second sample object are different sample objects.

A first input module 902, configured to input the reference sample image, the positive sample image and the negative sample image into an image recognition model, and determine a first similarity between the reference sample image and the positive sample image and a second similarity between the reference sample image and the negative sample image through the image recognition model.

And the training module 903 is used for responding to the difference value between the first similarity and the second similarity meeting the target condition, and taking the image recognition model as the trained image recognition model.

In a possible implementation manner, the first input module is configured to extract, through the image recognition model, reference feature information of the reference sample image, positive sample feature information of the positive sample image, and negative sample feature information of the negative sample image. And obtaining a first similarity according to the reference characteristic information and the positive sample characteristic information. And obtaining a second similarity according to the reference characteristic information and the negative sample characteristic information.

In one possible embodiment, the apparatus further comprises:

and the characteristic extraction module is used for carrying out characteristic extraction on the plurality of first sample images in the first sample image set to obtain a plurality of first sample image characteristics.

And the similarity determining module is used for determining the similarity among the plurality of first sample images in the first sample image set according to the characteristics of the plurality of first sample images.

And the image determining module is used for determining the two first sample images with the lowest similarity as the first image and the second image.

And the similarity determining module is further used for performing feature extraction on the plurality of second sample images in the second sample image set to obtain a plurality of second sample image features.

And the image determining module is further used for determining a third image and a fourth image which have the lowest similarity with the first image and the second image from the second sample image set respectively according to the plurality of second sample image characteristics and the first sample image characteristics of the first image and the second image.

And the image determining module is further used for determining the first image as a reference image, determining the second image as a positive sample image and determining the third image as a negative sample image in response to the first similarity between the first image and the third image being smaller than the second similarity between the second image and the fourth image.

In one possible embodiment, the apparatus further comprises:

In a possible embodiment, the parameter adjusting module is configured to determine the loss function according to a difference between the first similarity and the second similarity. From the loss function, the resulting gradient of the image recognition model is determined. And adjusting the model parameters of the image recognition model according to a gradient descent method.

In a possible embodiment, the sample image obtaining module is further configured to obtain a first positive sample image, a second positive sample image, a first negative sample image, and a second negative sample image, where the first positive sample image and the second positive sample image correspond to the same sample object, and the first negative sample image and the second negative sample image correspond to different sample objects.

The first input module is further used for inputting the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image into an image recognition model, and determining a third similarity between the first positive sample image and the second positive sample image and a fourth similarity between the first negative sample image and the second negative sample image through the image recognition model.

And the training module is also used for adjusting the model parameters of the image recognition model according to the difference information between the third similarity and the fourth similarity.

Fig. 10 is a schematic structural diagram of an image recognition apparatus provided in an embodiment of the present application, and referring to fig. 10, the apparatus includes: an image acquisition module 1001, a second input module 1002, and an output module 1003.

The image acquiring module 1001 is configured to acquire a first image to be recognized and a second image to be recognized, where the first image to be recognized includes a first object to be recognized, and the second image to be recognized includes a second object to be recognized;

a second input module 1002, configured to input the first image to be recognized and the second image to be recognized into an image recognition model, and extract a first image feature of the first image to be recognized and a second image feature of the second image to be recognized through the image recognition model;

an output module 1003, configured to output a similarity between the first object to be recognized and the second object to be recognized according to the first image feature and the first image feature.

The embodiment of the present application provides a computer device, configured to execute the methods provided in the foregoing embodiments, where the computer device may be implemented as a terminal or a server, and the structure of the terminal is described below first:

fig. 11 is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal 1100 may be: a tablet, laptop, or desktop computer. Terminal 1100 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and so forth.

In general, terminal 1100 includes: one or more processors 1101 and one or more memories 1102.

Processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1101 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable logic Array). The processor 1101 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit) that is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, the processor 1101 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 can also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one instruction for execution by processor 1101 to implement the methods provided by the method embodiments herein.

In some embodiments, the terminal 1100 may further include: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102 and peripheral interface 1103 may be connected by a bus or signal lines. Various peripheral devices may be connected to the peripheral interface 1103 by buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1104, display screen 1105, camera assembly 1106, audio circuitry 1107, positioning assembly 1108, and power supply 1109.

The peripheral interface 1103 may be used to connect at least one peripheral associated with I/O (Input/Output) to the processor 1101 and the memory 1102. In some embodiments, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1101, the memory 1102 and the peripheral device interface 1103 may be implemented on separate chips or circuit boards, which is not limited by this embodiment.

The Radio Frequency circuit 1104 is used to receive and transmit RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuit 1104 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1104 converts an electric signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electric signal. Optionally, the radio frequency circuit 1104 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth.

The display screen 1105 is used to display a UI (user interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1105 is a touch display screen, the display screen 1105 also has the ability to capture touch signals on or over the surface of the display screen 1105. The touch signal may be input to the processor 1101 as a control signal for processing. At this point, the display screen 1105 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard.

Camera assembly 1106 is used to capture images or video. Optionally, camera assembly 1106 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal.

The audio circuitry 1107 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1101 for processing or inputting the electric signals to the radio frequency circuit 1104 to achieve voice communication.

Positioning component 1108 is used to locate the current geographic position of terminal 1100 for purposes of navigation or LBS (Location Based Service).

Power supply 1109 is configured to provide power to various components within terminal 1100. The power supply 1109 may be alternating current, direct current, disposable or rechargeable.

In some embodiments, terminal 1100 can also include one or more sensors 1110. The one or more sensors 1110 include, but are not limited to: acceleration sensor 1111, gyro sensor 1112, pressure sensor 1113, fingerprint sensor 1114, optical sensor 1115, and proximity sensor 1116.

Acceleration sensor 1111 may detect acceleration levels in three coordinate axes of a coordinate system established with terminal 1100.

The gyro sensor 1112 may detect a body direction and a rotation angle of the terminal 1100, and the gyro sensor 1112 may cooperate with the acceleration sensor 1111 to acquire a 3D motion of the user with respect to the terminal 1100.

Pressure sensor 1113 may be disposed on a side bezel of terminal 1100 and/or underlying display screen 1105. When the pressure sensor 1113 is disposed on the side frame of the terminal 1100, the holding signal of the terminal 1100 from the user can be detected, and the processor 1101 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1113. When the pressure sensor 1113 is disposed at the lower layer of the display screen 1105, the processor 1101 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 1105.

The fingerprint sensor 1114 is configured to collect a fingerprint of the user, and the processor 1101 identifies the user according to the fingerprint collected by the fingerprint sensor 1114, or the fingerprint sensor 1114 identifies the user according to the collected fingerprint.

Optical sensor 1115 is used to collect ambient light intensity. In one embodiment, the processor 1101 may control the display brightness of the display screen 1105 based on the ambient light intensity collected by the optical sensor 1115. Proximity sensor 1116 is used to capture the distance between the user and the front face of terminal 1100.

Those skilled in the art will appreciate that the configuration shown in fig. 11 does not constitute a limitation of terminal 1100, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.

The computer device may also be implemented as a server, and the following describes a structure of the server:

fig. 12 is a schematic structural diagram of a server 1200 according to an embodiment of the present application, where the server 1200 may generate a relatively large difference due to a difference in configuration or performance, and may include one or more processors (CPUs) 1201 and one or more memories 1202, where the one or more memories 1202 store at least one instruction, and the at least one instruction is loaded and executed by the one or more processors 1201 to implement the methods provided by the foregoing method embodiments. Certainly, the server 1200 may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 1200 may further include other components for implementing the functions of the device, which is not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, is also provided that includes instructions executable by a processor to perform the methods provided by the various method embodiments described above. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A training method of an image recognition model is characterized by comprising the following steps:

2. The method of claim 1, wherein the determining, by the image recognition model, a first similarity between the reference sample image and the positive sample image, and a second similarity between the reference sample image and the negative sample image comprises:

3. The method of claim 1, wherein prior to the acquiring the reference sample image, the positive sample image, and the negative sample image, the method further comprises:

4. The method of claim 1, wherein after determining, by the image recognition model, a first similarity between the reference sample image and the positive sample image and a second similarity between the reference sample image and the negative sample image, the method further comprises:

5. The method of claim 4, wherein the adjusting model parameters of the image recognition model comprises:

6. The method of claim 1, wherein prior to applying the image recognition model as a trained image recognition model, the method further comprises:

7. An image recognition method, characterized in that the image recognition method comprises:

8. An apparatus for training an image recognition model, comprising:

9. An image recognition apparatus, characterized in that the image recognition apparatus comprises:

10. A computer device, comprising one or more processors and one or more memories having stored therein at least one instruction, the instruction being loaded and executed by the one or more processors to implement the operations performed by the training method of the image recognition model according to any one of claims 1 to 6; or as performed by the image recognition method of claim 7.