CN111368934B

CN111368934B - Image recognition model training method, image recognition method and related device

Info

Publication number: CN111368934B
Application number: CN202010187873.3A
Authority: CN
Inventors: 卓炜; 范琦; 戴宇榮
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-03-17
Filing date: 2020-03-17
Publication date: 2023-09-19
Anticipated expiration: 2040-03-17
Also published as: CN111368934A

Abstract

The application discloses an image recognition model training method, an image recognition method and a related device, wherein a target sample is obtained based on a target recognition image; then determining a training triplet according to the target sample; and inputting the training triplets into a preset network model for training to obtain a target network model. Therefore, the network model training based on the triples is realized, as the triples contain positive samples for indicating the similarity between the labels and negative samples for indicating the difference between the labels, the trained network model can more comprehensively indicate the image characteristics, the construction process of the triples does not need manual intervention, the method can be applied to the identification of new samples, the training time is greatly saved, and the accuracy and the efficiency of the network model training are improved.

Description

Image recognition model training method, image recognition method and related device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an image recognition model training method, an image recognition method and a related device.

Background

With the continuous improvement of the requirements of users on picture processing, the application of the target detection technology is also becoming wider. The user needs to train the model using a large number of high quality target detection training samples, and then use this model in a target detection task. However, in an actual application scene, a large amount of high-quality target detection training samples need a large amount of manpower and material resources to mark, often cannot be obtained quickly, so that a detection model cannot be deployed into detection of a new sample quickly, and the small sample target detection method can well solve the problem.

Generally, a large number of training samples are adopted for model training in a target detection method of a small sample, so that a trained model can have recognition capability on images similar to the training samples.

However, with the increase of small sample types in the image, the method of fixing the training sample cannot completely extract the characteristics of the sample, and a large amount of manpower and material resources are required to collect the training sample, which affects the accuracy and efficiency of model training.

Disclosure of Invention

In view of the above, the application provides a method for training an image recognition model, which can effectively avoid inefficiency and incompleteness caused by manually labeling training samples and improve the efficiency and accuracy of the model training process.

The first aspect of the present application provides a method for training an image recognition model, which can be applied to a system or a program including a model training function in a terminal device, and specifically includes: acquiring a target sample based on the target identification image;

determining a training triplet according to the target sample, wherein the training triplet comprises at least one positive sample pair and at least one negative sample pair, the positive sample pair consists of the target sample and the positive sample, the negative sample pair consists of the target sample and the negative sample, the positive sample is obtained based on a similar label corresponding to the target sample, and the negative sample is obtained based on a difference label corresponding to the target sample;

And performing supervised learning on a preset network model based on the positive sample pair and the negative sample pair to obtain a target network model, wherein the target network model is used for identifying the target identification image.

Optionally, in some possible implementations of the present application, the performing, based on the positive pair of samples and the negative pair of samples, comparative training on a preset network model to obtain a target network model includes:

determining a matching tag of the positive sample pair, the matching tag being determined based on a similarity of the positive sample and the target sample;

determining a foreground region and a background region of the target sample based on the matching tag to obtain a classified positive sample pair;

and inputting the positive sample pair and the negative sample pair after classification into the preset network model for supervised learning so as to obtain the target network model.

Optionally, in some possible implementations of the present application, the performing supervised learning on the classified positive sample pair and the negative sample pair input to the preset network model to obtain the target network model includes:

acquiring a first loss value according to the feature similarity of the foreground region and the positive sample;

Acquiring a second loss value according to the feature similarity of the background area and the positive sample, wherein the type of the label indicated by the second loss value is opposite to that indicated by the first loss value;

obtaining a third loss value according to the feature similarity of the negative sample pair, wherein the type of the label indicated by the third loss value is opposite to that indicated by the first loss value;

and carrying out back propagation calculation on the preset network model based on the first loss value, the second loss value and the third loss value to obtain the target network model.

Optionally, in some possible implementations of the present application, the determining a training triplet according to the target sample includes:

extracting image features of the target sample based on an attention mechanism;

determining a corresponding positive sample and negative sample according to the target sample;

extracting image features of the positive sample by adopting a detection frame, and generating a positive sample pair with the image features of the target sample;

extracting image features of the negative sample by adopting a detection frame, and generating a negative sample pair with the image features of the target sample;

a training triplet is determined based on the positive sample pair and the negative sample pair.

Optionally, in some possible implementations of the present application, the determining the corresponding positive sample and negative sample according to the target sample includes:

Determining a target label in the target sample;

acquiring a sample with the same label based on the target label so as to obtain the positive sample;

and acquiring different samples of the label based on the target label so as to obtain the negative sample.

Optionally, in some possible implementations of the present application, the target tag is included in a tag database, and the determining the target tag in the target sample includes:

determining a corresponding candidate tag in response to at least one tag selection instruction;

traversing in the tag database based on the candidate tag to obtain the target tag.

Optionally, in some possible implementations of the present application, traversing the tag database based on the candidate tag to obtain the target tag includes:

traversing in the tag database based on the candidate tags to obtain at least one retrieval tag;

acquiring the label similarity of the candidate label and the search label;

and determining the target label based on the label similarity.

Optionally, in some possible implementations of the present application, the determining the target tag in the target sample includes:

Determining a template picture in response to the image in the at least one target detection frame;

and determining the target label in the target sample according to the template picture.

Optionally, in some possible implementations of the present application, the acquiring the target sample based on the target identification image includes:

acquiring an image tag in the target identification image;

and determining the target sample meeting a preset condition according to the image tag, wherein the preset condition is determined based on the matching degree of the image tag and the target sample.

Optionally, in some possible implementations of the present application, the preset network model is a small sample target detection model, and the target network model is a trained small sample target detection model.

A second aspect of the present application provides an apparatus for training an image recognition model, including: an acquisition unit configured to acquire a target sample based on a target recognition image;

a determining unit, configured to determine a training triplet according to the target sample, where the training triplet includes at least one positive sample pair and at least one negative sample pair, where the positive sample pair is composed of the target sample and a positive sample, the negative sample pair is composed of the target sample and a negative sample, the positive sample is obtained based on a similar label corresponding to the target sample, and the negative sample is obtained based on a difference label corresponding to the target sample;

The training unit is used for performing supervised learning on a preset network model based on the positive sample pair and the negative sample pair to obtain a target network model, and the target network model is used for identifying the target identification image.

Optionally, in some possible implementations of the present application, the training unit is specifically configured to determine a matching tag of the positive sample pair, where the matching tag is determined based on a similarity between the positive sample and the target sample;

the training unit is specifically configured to determine a foreground area and a background area of the target sample based on the matching tag, so as to obtain a classified positive sample pair;

the training unit is specifically configured to perform supervised learning on the classified positive sample pair and the negative sample pair input to the preset network model, so as to obtain the target network model.

Optionally, in some possible implementations of the present application, the training unit is specifically configured to obtain a first loss value according to a feature similarity between the foreground area and the positive sample;

the training unit is specifically configured to obtain a second loss value according to the feature similarity between the background area and the positive sample, where the second loss value is opposite to the label indicated by the first loss value;

The training unit is specifically configured to obtain a third loss value according to the feature similarity of the negative sample pair, where the third loss value is opposite to the label indicated by the first loss value;

the training unit is specifically configured to perform back propagation calculation on the preset network model based on the first loss value, the second loss value, and the third loss value, so as to obtain the target network model.

Optionally, in some possible implementations of the present application, the determining unit is specifically configured to extract image features of the target sample based on an attention mechanism;

the determining unit is specifically configured to determine a positive sample and a negative sample according to the target sample;

the determining unit is specifically configured to extract an image feature of the positive sample by using a detection frame, so as to generate a positive sample pair with the image feature of the target sample;

the determining unit is specifically configured to extract an image feature of the negative sample by using a detection frame, so as to generate a negative sample pair with the image feature of the target sample;

the determining unit is specifically configured to determine a training triplet based on the positive sample pair and the negative sample pair.

Optionally, in some possible implementations of the present application, the determining unit is specifically configured to determine a target tag in the target sample;

The determining unit is specifically configured to obtain a sample with the same label based on the target label, so as to obtain the positive sample;

the determining unit is specifically configured to obtain different samples of the tag based on the target tag, so as to obtain the negative sample.

Optionally, in some possible implementations of the present application, the determining unit is specifically configured to determine the corresponding candidate tag in response to at least one tag selection instruction;

the determining unit is specifically configured to traverse in the tag database based on the candidate tag, so as to obtain the target tag.

Optionally, in some possible implementations of the present application, the determining unit is specifically configured to traverse in the tag database based on the candidate tag to obtain at least one search tag;

the determining unit is specifically configured to obtain a tag similarity between the candidate tag and the search tag;

the determining unit is specifically configured to determine the target tag based on the tag similarity.

Optionally, in some possible implementations of the present application, the determining unit is specifically configured to determine the template picture in response to the image in the at least one target detection frame;

The determining unit is specifically configured to determine a target tag in the target sample according to the template picture.

Optionally, in some possible implementations of the present application, the acquiring unit is specifically configured to acquire an image tag in the target identification image;

the acquisition unit is specifically configured to determine, according to the image tag, the target sample that meets a preset condition, where the preset condition is determined based on a matching degree between the image tag and the target sample.

The third aspect of the present application provides a method for image recognition, specifically including: responding to the identification instruction to obtain at least one template picture, wherein the template picture is used for indicating an identification target in a target identification image;

inputting the template picture and the target recognition image into a target network model to obtain a recognition result, wherein the recognition result is a set of recognition targets, and the target network model is trained based on the model training method of any one of the first aspect.

A fourth aspect of the present application provides an image recognition apparatus, specifically including: an acquisition unit for acquiring at least one template picture in response to an identification instruction, the template picture being used for indicating an identification target in a target identification image;

The recognition unit is used for inputting the template picture and the target recognition image into a target network model to obtain a recognition result, wherein the recognition result is a set of recognition targets, and the target network model is trained based on the model training method in any one of the first aspect.

A fifth aspect of the present application provides a computer apparatus comprising: a memory, a processor, and a bus system; the memory is used for storing program codes; the processor is configured to perform the method of model training according to the first aspect or any one of the first aspects, or the method of image recognition according to the third aspect, according to instructions in the program code.

A fourth aspect of the application provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of model training of the first aspect or any of the first aspects described above, or the method of image recognition of the third aspect.

From the above technical solutions, the embodiment of the present application has the following advantages:

acquiring a target sample by identifying an image based on the target; then determining a corresponding positive sample and a negative sample according to the target sample, wherein the positive sample is the same as the label contained in the target sample, and the negative sample is different from the label contained in the target sample; and inputting the target sample, the positive sample and the negative sample into a preset network model for training to obtain a target network model, wherein the target network model is used for identifying the target identification image. Therefore, the network model training based on the triples is realized, as the triples contain positive samples for indicating the similarity between the labels and negative samples for indicating the difference between the labels, the trained network model can more comprehensively contain the image features under different labels, the construction process of the triples does not need manual intervention, the training time is greatly saved, and the accuracy and the efficiency of the network model training are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of a network architecture in which a model training system operates;

FIG. 2 is a flowchart of training an image recognition model according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for training an image recognition model according to an embodiment of the present application;

FIG. 4 is a schematic view of a scenario for training an image recognition model according to an embodiment of the present application;

FIG. 5 is a flow chart of a model training method according to an embodiment of the present application;

FIG. 6 is a flowchart of a method for image recognition according to an embodiment of the present application;

fig. 7 is a schematic view of a scenario of an image recognition method according to an embodiment of the present application;

fig. 8 is a flowchart of a terminal architecture implementation provided in an embodiment of the present application;

fig. 9 is a schematic view of a scene of another image recognition method according to an embodiment of the present application;

Fig. 10 is a schematic view of a scene of another image recognition method according to an embodiment of the present application;

FIG. 11 is a schematic view of a scene of another image recognition method according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an image recognition model training device according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of an image recognition device according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a method and a related device for training an image recognition model, which can be applied to a system or a program containing a model training function in terminal equipment, and acquire a target sample by based on a target recognition image; then determining a corresponding positive sample and a negative sample according to the target sample, wherein the positive sample is the same as the label contained in the target sample, and the negative sample is different from the label contained in the target sample; and inputting the target sample, the positive sample and the negative sample into a preset network model for training to obtain a target network model, wherein the target network model is used for identifying the target identification image. Therefore, the network model training based on the triples is realized, as the triples contain positive samples for indicating the similarity between the labels and negative samples for indicating the difference between the labels, the trained network model can more comprehensively contain the image features under different labels, the construction process of the triples does not need manual intervention, the training time is greatly saved, and the accuracy and the efficiency of the network model training are improved.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "includes" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

First, some terms that may appear in the embodiments of the present application will be explained.

Object Detection technique (Object Detection) each Object in the picture is marked with a rectangular box and the class of this Object is given.

Small sample object detection techniques (Few-shot Object Detection, FSOD) refer to training an object detection model using only a small number of samples and allowing objects of the same class in a picture to be detected from a given small number of template objects when performing object detection.

Twin networks (Siamese networks) refer to the simultaneous processing of two different inputs using a weight-shared Network, which is a twin Network.

Feature map (Feature map) is image information obtained by convolving an image with a filter; the Feature map may be convolved with a filter to generate a new Feature map.

Attention profile (Attention feature map) means that the attention map is focused on the region containing the target object by the attention mechanism, which will have a stronger response.

The target recognition image (Query image) is a picture for target detection, and the network model detects objects in the Query image.

Template pictures (Support images) are template pictures used for small sample target detection, and the model detects all objects with the same category in the query image according to the template pictures.

Depth cross-correlation (Depth-wise Cross correlation) refers to one-to-one convolution of a feature map of a support image as a filter on a channel-by-channel basis. The feature maps of the Support image and the query image have the same number of input channels, and the number of output channels is the same as the number of input channels.

Feature pooling (ropooling) refers to pooling corresponding regions in a feature map to a fixed size feature map according to the position of an input rectangular box.

Comparative training (Contrastive training) refers to a method of training a multiple twin network with (target sample, positive sample, negative sample) triplet training samples.

It should be understood that the model training method provided by the application can be applied to a system or a program containing a model training function in a terminal device, such as an image recognition program, specifically, the model training system can be operated in a network architecture shown in fig. 1, and is a network architecture diagram operated by the model training system, as shown in fig. 1, the model training system can provide model training with a plurality of information sources, a terminal establishes connection with a server through a network, and then sends a recognition request and a target recognition image to the server, the server builds a triplet training sample according to the target recognition image, and then trains a preset network model, recognizes the target recognition image, and then returns a result to the terminal; it will be appreciated that in fig. 1, various terminal devices are shown, and in an actual scenario, there may be more or fewer terminal devices participating in the model training process, and the specific number and types are not limited herein, depending on the actual scenario, and in addition, in fig. 1, one server is shown, but in an actual scenario, there may also be multiple servers participating, and in particular, in a scenario of multi-content application interaction, the specific number of servers depends on the actual scenario.

It should be noted that, the model training method provided in this embodiment may also be performed offline, that is, without participation of a server, where a terminal is locally connected to another terminal, so as to perform a model training process between terminals.

It will be appreciated that the model training system described above may be run on a personal mobile terminal, for example: the application can be used as image recognition, can also be used as a server, and can also be used as a third party device to provide model training so as to obtain a model training processing result of an information source; the specific model training system may be in the form of a program, or may be operated as a system component in the device, or may be used as a cloud service program, where the specific operation mode is determined according to the actual scenario, and is not limited herein.

In order to solve the above problems, the present application provides a method for training an image recognition model, which is applied to a flow frame of model training shown in fig. 2, as shown in fig. 2, and is a flow frame diagram of image recognition model training provided in an embodiment of the present application.

It can be understood that the method provided by the application can be a program writing method, which is used as a processing logic in a hardware system, and can also be used as an image recognition model training device, and the processing logic can be realized in an integrated or external mode. As one implementation, the model training apparatus acquires a target sample by based on a target recognition image; then determining a corresponding positive sample and a negative sample according to the target sample, wherein the positive sample is the same as the label contained in the target sample, and the negative sample is different from the label contained in the target sample; and inputting the target sample, the positive sample and the negative sample into a preset network model for training to obtain a target network model, wherein the target network model is used for identifying the target identification image. Therefore, the network model training based on the triples is realized, as the triples contain positive samples for indicating the similarity between the labels and negative samples for indicating the difference between the labels, the trained network model can more comprehensively contain the image features under different labels, the construction process of the triples does not need manual intervention, the training time is greatly saved, and the accuracy and the efficiency of the network model training are improved.

With reference to fig. 3, fig. 3 is a flowchart of a method for training an image recognition model according to an embodiment of the present application, where the method includes at least the following steps:

301. a target sample is acquired based on the target identification image.

In this embodiment, the process of obtaining the target sample based on the target identification image may be performed based on a preset template image, that is, the template image in the target identification image is extracted, and then the target sample including the template image is obtained based on the template image. For example: and if the template picture of the target identification image is a cap, acquiring a corresponding target sample containing the image characteristics of the cap according to the image characteristics of the cap.

Alternatively, the process of acquiring the target sample based on the template picture may also be performed based on the image tag in the target identification image, that is, the tag corresponding to the template picture. For example: the template picture of the target identification image is a bicycle, and a corresponding picture containing a bicycle label or bicycle image characteristics is obtained as a target sample according to the bicycle label; the target sample is obtained through the tag, so that a quick searching process can be realized, and the model training efficiency is improved.

It can be understood that a plurality of target samples may be recalled in the process of acquiring the target samples based on the labels, and at this time, a picture with higher matching degree can be determined as the target sample according to the matching degree of the image labels and the target samples, so that the relevance between the target sample and the target identification image is improved, and the accuracy of the subsequent training process is improved.

In this embodiment, the source of the target sample may be a preset training database, and at this time, the training sample is not required to be set manually, so that time is saved; the source of the target sample may also be obtained over a network. The small part with higher representativeness in the target identification image indicated by the target sample is a small sample detection process, and compared with the characteristic identification process of the whole image, the system resource is saved.

302. And determining a training triplet according to the target sample.

In this embodiment, the training triplets include at least one positive sample pair and at least one negative sample pair, the positive sample pair is composed of a target sample and a positive sample, the negative sample pair is composed of a target sample and a negative sample, the positive sample is obtained based on a similar label corresponding to the target sample, and the negative sample is obtained based on a difference label corresponding to the target sample. That is, the positive sample and the target sample belong to the same category, the negative sample and the target sample belong to different categories, as shown in fig. 4, which is a schematic view of a scene of training an image recognition model provided by the embodiment of the present application, wherein the figure includes a target sample A1 in a target recognition image, the corresponding positive sample A2 is a bicycle with similar image features or similar labels, and the negative sample A3 is an automobile with different labels.

It can be understood that the labels of the positive samples and the negative samples can be preset, that is, a database comprising a large number of small sample materials exists, the small samples are provided with corresponding labels, the relativity among the labels can be different, that is, the negative samples, or the preset corresponding relation can be set manually as the selection basis of the positive samples and the negative samples.

Specifically, the determining process for the positive sample and the negative sample may be a process performed by association between tags, and first, the target tag in the target sample needs to be determined; then obtaining a sample with the same label based on the target label to obtain a positive sample; and then, different samples of the label are obtained based on the target label to obtain a negative sample, so that a positive sample and a negative sample are obtained rapidly, the process of manually marking the training sample is not needed, and the training time is saved greatly.

Optionally, due to the diversity of the tags, there may be multiple target tags of the target sample, and at this time, the determination may be made based on the presence of the tags in the tag database, that is, the corresponding candidate tag is determined first in response to at least one tag selection instruction; and traversing in a label database based on the candidate labels to obtain target labels. For example: the label indicated by the label selection instruction is a bicycle, the candidate label can be a bicycle, a bicycle or a bicycle, and only the label of the bicycle is traversed in the database at the moment, the bicycle is determined to be the target label, so that the stability of the label association process is ensured.

In one possible scenario, a completely corresponding tag may not be retrieved in the tag database, and at this time, the target tag may be determined according to the tag similarity between the candidate tag and the retrieved tag, that is, a tag with a higher similarity may be selected as the target tag.

Alternatively, the determining process for the target label may also be to respond to the template picture framed by the user, and then use the label corresponding to the template picture as the target label, for example: the template picture framed by the user comprises the puppy, and the target label is the puppy, so that the operation modes of the user are enriched, and the user experience is improved.

303. And performing supervised learning on the preset network model based on the positive sample pair and the negative sample pair to obtain a target network model.

In this embodiment, the target network model is used to identify the target identification image.

Specifically, the contrast training process for the target sample, the positive sample and the negative sample may be to generate a positive sample pair and a negative sample pair according to the target sample, the positive sample and the negative sample, where the positive sample pair includes the target sample and the positive sample, and the negative sample pair includes the target sample and the negative sample; then, respectively acquiring the feature similarity of the positive sample pair and the feature similarity of the negative sample pair; and performing supervised learning on the preset network model according to the feature similarity of the positive sample pair and the feature similarity of the negative sample pair to obtain a target network model.

It can be appreciated that, since the supervised learning process involves parameter training in different dimensions, i.e. the real labels of the target sample and the positive sample are matched, and the real labels of the target sample and the negative sample are not matched, at this time, the loss value needs to be obtained according to the feature similarity of the positive sample pair, and since the foreground area and the background area of the target sample are determined based on the matched labels, for example: the foreground area of the target sample is a horse, and the background area is a tree; the feature similarity obtaining loss value of the positive sample pair comprises obtaining a first loss value according to the feature similarity of the foreground region and the positive sample and obtaining a second loss value according to the feature similarity of the background region and the positive sample; then, a third loss value is obtained according to the feature similarity of the negative sample pair, and the type of the label indicated by the third loss value is opposite to that indicated by the first loss value; performing back propagation calculation on the preset network model based on the first loss value, the second loss value and the third loss value, so as to obtain a target network model; the training process gives consideration to the similarity and the difference between the sample labels, so that the accuracy of model training can be improved, and the trained model is not only suitable for marked samples, but also can identify newly added samples.

In a possible scenario, the process of model training may refer to fig. 5, and fig. 5 is a flow chart of a model training method provided by an embodiment of the present application; the following steps are shown:

(1) Training triples (target samples, positive samples, negative samples) are constructed.

Wherein (target sample, positive sample) constitutes a positive sample pair and (target sample, negative sample) constitutes a negative sample pair. Then, a true value of the sample pair is set, because the target recognition image is a scene object, and the true value is a matching label for determining a rectangular frame (anchor) in the positive sample and the target recognition image.

Alternatively, the anchor of the target sample may be selected by selecting different sizes at 16 pixels intervals, and selecting different aspect ratios; for example: according to the image size, 4 sizes and 3 length-width ratios are selected, and specific parameters are determined according to actual scenes.

Specifically, since a positive sample pair may be mismatched, the positive sample pair needs to be screened. In the positive sample pair (target sample, positive sample), screening is performed based on the mark coincidence ratio of the anchor in the target sample and the recognition target. For example: the method comprises the steps that a positive sample pair with an Intersection-over-Union (IOU) ratio greater than 0.5 is successfully matched, and a matched tag A is given; correspondingly, the matching tag B is unsuccessful. Then in the positive sample pair some prediction candidate boxes are given label a, such as a prediction candidate box coincident with the target; some prediction candidates are given a label B, such as a prediction candidate where the region image is background.

In addition, in the negative sample pair (target sample, negative sample), the labels of the prediction frames of the target sample are B, i.e., do not match.

(2) Sample features are extracted based on the twinning network.

After the sample pairs are constructed, sample features are extracted respectively, i.e. triples (target samples, positive samples, negative samples) are input into the backbone network (resnet 50) respectively to extract features. Wherein, the positive sample is extracted by a detection frame (box) and globally pooled to obtain the characteristic f of 1 x 1024 _p The negative sample is subjected to box extraction and global pooling to obtain 1 x 1024 features f _n And the target sample is subjected to global pooling to obtain M [ N ] 1024 features f _q M, N is the image parameter of the target sample.

(3) Feature vectors of positive and negative pairs of samples are extracted based on an attention mechanism. Namely, the positive sample pair and the negative sample input area candidate network (attention RPN) extract attention characteristics, predict the probability value of whether the candidate frame in the target sample is matched with the positive sample on the attention characteristics of the positive sample pair, and predict the probability value of whether the candidate frame in the target sample is matched with the negative sample on the attention characteristics of the negative sample pair.

The specific attention feature extraction based on the attrition rpn is a convolution process of deep convolution (depthwise), and the loss function is as follows:

Wherein f _p Is a positive sample feature; f (f) _qf,i Foreground region features of the target sample; f (f) _qb,j Background area characteristics of a target sample; f (f) _n Is a negative sample feature; f (f) _k Is a target sample feature; l (L) _p-qf ，L _p-qb ，L _n-q Standard loss of the area candidate network; n (N) _pqf ，N _pqb ，N _nq For the number of samples, N can be set here _pqf ：N _pqb ：N _nq =1:1:1, the specific ratio depends on the actual scenario.

It will be appreciated that the loss objective function includes a matching loss function of the foreground region of the target sample and the positive sample, i.e. the first loss value; a matching loss function of the target sample background area and the positive sample, namely a second loss value; the matching loss function of the target sample and the negative sample, i.e. the third loss value.

After the candidate frames are filtered by the attritionRPN, the rest candidate frames pass through a detection network to obtain a final matching predicted value, namely a matching result.

It will be appreciated that each matching loss function is a loss function of a standard detection task, namely a detection frame regression loss function (box regression loss) and a class loss function (classification loss), wherein the class loss function is a binary cross entropy loss function of whether the matching is performed, namely the above-mentioned matching tag value process, for example, the tag a takes 1 and the tag b takes 0.

In one possible scenario, the training process of the FSOD model can be improved based on the model training method provided by the application, namely, the sample input of the FSOD adopts the triplet sample construction mode provided by the embodiment; on one hand, the method can inherit all advantages of the FSOD and greatly improve the performance of the FSOD.

It should be noted that, in this embodiment, FSOD is used as a preset network model only as an example, and the model training method described in this embodiment in the actual scenario may also use a different basic network structure.

As can be seen from the above embodiments, a target sample is obtained by identifying an image based on a target; then determining a corresponding positive sample and a negative sample according to the target sample, wherein the positive sample is the same as the label contained in the target sample, and the negative sample is different from the label contained in the target sample; and inputting the target sample, the positive sample and the negative sample into a preset network model for training to obtain a target network model, wherein the target network model is used for identifying the target identification image. Therefore, the network model training based on the triples is realized, as the triples contain positive samples for indicating the similarity between the labels and negative samples for indicating the difference between the labels, the trained network model can more comprehensively contain the image features under different labels, the construction process of the triples does not need manual intervention, the method can be applied to the identification of new samples, the training time is greatly saved, and the accuracy and the efficiency of the network model training are improved.

The foregoing embodiment describes a model training process, and is described below with reference to image recognition as a specific scenario, referring to fig. 6, fig. 6 is a flowchart of a method for image recognition according to an embodiment of the present application, where the embodiment of the present application at least includes the following steps:

601. at least one template picture is acquired in response to the identification instruction.

In this embodiment, the template image is used to indicate the identification target in the target identification image, where the template image may be an image obtained through an identification instruction, or may be obtained by a tag determined by the identification instruction; in addition, the identification instruction can be initiated by a user, specifically, the identification button can be clicked, a certain part in the frame selection image can be selected, and the label responding to the input of the user can be selected, wherein the specific form is dependent on the actual scene.

It should be noted that the template pictures acquired in response to the identification instruction may be one or more, and specifically, when the template pictures are more than one, they are respectively identified and the corresponding identification results are output.

Optionally, if the acting object of the identification instruction is a video, the target identification image is a video frame corresponding to the moment when the identification instruction is sent, and the image features under the video frame are associated with similar images in the subsequent video frames, so as to realize the identification process of the video.

602. And inputting the template picture and the target identification image into a target network model to obtain an identification result.

In this embodiment, the recognition result is a set of recognition targets, as shown in fig. 7, which is a scene schematic diagram of an image recognition method provided in the embodiment of the present application, two template images are input in the scene schematic diagram, and are respectively a hat and a bicycle, and then these template images and the target recognition images are input into a target network model for recognition, so that elements similar to the template images in the target recognition images can be obtained, and frame selection is performed for highlighting.

It should be understood that the form of the frame selection highlighting in the above embodiment is merely an example, and specific other ways of highlighting the recognition result should also be used as a solution provided by the present application, for example: highlighting, changing hue, etc.

In this embodiment, the relevant features of the training process of the target network model are similar to those of steps 301-303 in the embodiment described in fig. 3, and reference may be made thereto, which is not repeated here.

It can be understood that the result of image recognition may be a framed image feature area, or tag information of the framed area that is output, or continuous pursuit of the framed image feature, that is, a feature recognition process in the video.

Because the training process of the preset network model is carried out aiming at the target identification image, and the small sample detection network is effectively trained by constructing an efficient (target sample, positive sample and negative sample) training triplet, the detection performance of the network on the small sample is greatly improved, and the training time is shortened. The model training process is convenient, efficient and quick, and the application scene of image recognition is expanded.

The image recognition method provided by the present application is described below with reference to a hardware architecture of a terminal, as shown in fig. 8, which is a flowchart of terminal architecture execution provided by an embodiment of the present application, where the method includes:

801. front end a receives input data.

In this embodiment, the input data includes a target identification picture and a template picture, where the template picture may be user-defined or may be generated based on a tag.

802. And the back end performs model training.

In this embodiment, the process of model training by the back end is described with reference to the embodiment of fig. 3, which is not described herein.

It can be understood that the back end can be a processing system of the terminal, a server, or a service device of the cloud, for example, performing a model training process through the messenger cloud.

803. The front end B displays the recognition result.

In this embodiment, the front end B and the front end a may be display interfaces of the same terminal, or may be display interfaces of different terminals, where the specific form depends on the actual scenario.

Alternatively, the recognition result may be performed in an interactive manner as shown in fig. 9, and as shown in fig. 9, a schematic scene of another image recognition method provided by the embodiment of the present application is shown. In the figure, a user firstly clicks an identification button to identify a target identification image, and at the moment, the template picture can be a person and a bicycle, and then the image part frames corresponding to the person and the bicycle are selected in the display interface. Further, the user may click on the detail button to query specific recognition processes, namely positive and negative examples involved in the model training process of the present application, such as: the positive samples are bicycles with different forms, and the negative samples are automobiles and trucks; further, the user can know the matching process of the positive sample and the negative sample by clicking more buttons, at this time, the user can check whether the matching process is accurate, if not, can click to report errors to record errors, so that the background can revise model training parameters, and the accuracy of model training is guaranteed.

Alternatively, the above-mentioned identification process may be performed in an interactive manner as shown in fig. 10, and as shown in fig. 10, a schematic view of a scene of another image identification method according to an embodiment of the present application is shown. In the image recognition method, a mode of label input is adopted at the front end A, in the figure, a user can input a bicycle in the label column B1, a target sample of the bicycle is generated at the background, a positive sample and a negative sample are correspondingly generated, and then the model training process is carried out, so that a training result is obtained by clicking a recognition button.

Optionally, the above identification process may also be performed in an interactive manner as shown in fig. 11, and as shown in fig. 11, the identification process is a schematic view of another model image identification scene provided by the embodiment of the present application. And determining the template picture in a frame selection mode, and determining a target sample. In the figure, the user needs to mark all the image features of the bicycle, and then the user can slide the touch screen to obtain a sample frame C1, so that a background generates a target sample of the corresponding bicycle in the sample frame C1, and correspondingly generates a positive sample and a negative sample, and then performs the model training process, so that a training result is obtained by clicking an identification button.

The method and the system can perform quick image recognition process in the interaction mode, greatly improve the detection performance of the network model on the small sample and shorten the training time due to the training process of the triples, and the user can conveniently perform image recognition in various modes, so that the user experience is improved.

In order to better implement the above-described aspects of the embodiments of the present application, the following provides related apparatuses for implementing the above-described aspects. Referring to fig. 12, fig. 12 is a schematic structural diagram of an image recognition model training apparatus according to an embodiment of the present application, and a model training apparatus 1200 includes:

an acquisition unit 1201 for acquiring a target sample based on a target recognition image;

a determining unit 1202, configured to determine a training triplet according to the target sample, where the training triplet includes at least one positive sample pair and at least one negative sample pair, the positive sample pair is composed of the target sample and a positive sample, the negative sample pair is composed of the target sample and a negative sample, the positive sample is obtained based on a similar label corresponding to the target sample, and the negative sample is obtained based on a difference label corresponding to the target sample;

the training unit 1203 is configured to perform supervised learning on a preset network model based on the positive sample pair and the negative sample pair, so as to obtain a target network model, where the target network model is used for identifying the target identification image.

Optionally, in some possible implementations of the present application, the training unit 1203 is specifically configured to determine a matching tag of the positive sample pair, where the matching tag is determined based on a similarity between the positive sample and the target sample;

the training unit 1203 is specifically configured to determine a foreground area and a background area of the target sample based on the matching tag, so as to obtain a classified positive sample pair;

the training unit 1203 is specifically configured to perform supervised learning on the classified positive sample pair and the negative sample pair input to the preset network model, so as to obtain the target network model.

Optionally, in some possible implementations of the present application, the training unit 1203 is specifically configured to obtain a first loss value according to a feature similarity between the foreground area and the positive sample;

the training unit 1203 is specifically configured to obtain a second loss value according to the feature similarity between the background area and the positive sample, where the second loss value is opposite to the tag indicated by the first loss value;

the training unit 1203 is specifically configured to obtain a third loss value according to the feature similarity of the negative sample pair, where the third loss value is opposite to the label indicated by the first loss value;

The training unit 1203 is specifically configured to perform back propagation calculation on the preset network model based on the first loss value, the second loss value, and the third loss value, so as to obtain the target network model.

Optionally, in some possible implementations of the present application, the determining unit 1202 is specifically configured to extract image features of the target sample based on an attention mechanism;

the determining unit 1202 is specifically configured to determine a positive sample and a negative sample according to the target sample;

the determining unit 1202 is specifically configured to extract an image feature of the positive sample by using a detection frame, so as to generate a positive sample pair with the image feature of the target sample;

the determining unit 1202 is specifically configured to extract an image feature of the negative sample by using a detection frame, so as to generate a negative sample pair with the image feature of the target sample;

the determining unit 1202 is specifically configured to determine a training triplet based on the positive sample pair and the negative sample pair.

Optionally, in some possible implementations of the present application, the determining unit 1202 is specifically configured to determine a target tag in the target sample;

the determining unit 1202 is specifically configured to obtain a sample with the same label based on the target label, so as to obtain the positive sample;

The determining unit 1202 is specifically configured to obtain samples of different labels based on the target label, so as to obtain the negative sample.

Optionally, in some possible implementations of the present application, the determining unit 1202 is specifically configured to determine the corresponding candidate tag in response to at least one tag selection instruction;

the determining unit 1202 is specifically configured to traverse in the tag database based on the candidate tag, so as to obtain the target tag.

Optionally, in some possible implementations of the present application, the determining unit 1202 is specifically configured to traverse in the tag database based on the candidate tag to obtain at least one search tag;

the determining unit 1202 is specifically configured to obtain a tag similarity between the candidate tag and the search tag;

the determining unit 1202 is specifically configured to determine the target tag based on the tag similarity.

Optionally, in some possible implementations of the present application, the determining unit 1202 is specifically configured to determine the template picture in response to the image in the at least one target detection frame;

the determining unit 1202 is specifically configured to determine a target tag in the target sample according to the template picture.

Optionally, in some possible implementations of the present application, the acquiring unit 1201 is specifically configured to acquire an image tag in the target identification image;

the obtaining unit 1201 is specifically configured to determine, according to the image tag, the target sample that meets a preset condition, where the preset condition is determined based on a matching degree between the image tag and the target sample.

The embodiment of the present application further provides an image recognition apparatus 1300, as shown in fig. 13, which is an apparatus for image recognition according to the embodiment of the present application, specifically including: an obtaining unit 1301 configured to obtain at least one template picture in response to an identification instruction, where the template picture is used to indicate an identification target in a target identification image;

the identifying unit 1302 is configured to input the template picture and the target identifying image into a target network model to obtain the identifying result, where the identifying result is the set of identifying targets, and the target network model is trained based on the model training method according to any one of the first aspect.

The embodiment of the present application further provides a terminal device, as shown in fig. 14, which is a schematic structural diagram of another terminal device provided in the embodiment of the present application, for convenience of explanation, only the portion related to the embodiment of the present application is shown, and specific technical details are not disclosed, please refer to the method portion of the embodiment of the present application. The terminal may be any terminal device including a mobile phone, a tablet computer, a personal digital assistant (personal digital assistant, PDA), a point of sale (POS), a vehicle-mounted computer, and the like, taking the terminal as an example of the mobile phone:

Fig. 14 is a block diagram showing a part of the structure of a mobile phone related to a terminal provided by an embodiment of the present application. Referring to fig. 14, the mobile phone includes: radio Frequency (RF) circuitry 1410, memory 1420, input unit 1430, display unit 1440, sensor 1450, audio circuitry 1460, wireless fidelity (wireless fidelity, wiFi) module 1470, processor 1480, and power supply 1490. It will be appreciated by those skilled in the art that the handset construction shown in fig. 14 is not limiting of the handset and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The following describes the components of the mobile phone in detail with reference to fig. 14:

the RF circuit 1410 may be used for receiving and transmitting signals during a message or a call, and particularly, after receiving downlink information of a base station, the downlink information is processed by the processor 1480; in addition, the data of the design uplink is sent to the base station. Typically, the RF circuitry 1410 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (low noise amplifier, LNA), a duplexer, and the like. In addition, the RF circuitry 1410 may also communicate with networks and other devices through wireless communications. The wireless communications may use any communication standard or protocol including, but not limited to, global system for mobile communications (global system of mobile communication, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), long term evolution (long term evolution, LTE), email, short message service (short messaging service, SMS), and the like.

The memory 1420 may be used to store software programs and modules, and the processor 1480 performs various functional applications and data processing of the cellular phone by executing the software programs and modules stored in the memory 1420. The memory 1420 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 1420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 1430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the handset. In particular, the input unit 1430 may include a touch panel 1431 and other input devices 1432. The touch panel 1431, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 1431 or thereabout by using any suitable object or accessory such as a finger, a stylus, etc., and spaced touch operations within a certain range on the touch panel 1431) and drive the corresponding connection device according to a predetermined program. Alternatively, the touch panel 1431 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device and converts it into touch point coordinates, which are then sent to the processor 1480, and can receive commands from the processor 1480 and execute them. Further, the touch panel 1431 may be implemented in various types such as a resistive type, a capacitive type, an infrared type, and a surface acoustic wave type. The input unit 1430 may include other input devices 1432 in addition to the touch panel 1431. In particular, the other input devices 1432 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 1440 may be used to display information input by a user or information provided to the user and various menus of the mobile phone. The display unit 1440 may include a display panel 1441, and alternatively, the display panel 1441 may be configured in the form of a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), or the like. Further, the touch panel 1431 may overlay the display panel 1441, and when the touch panel 1431 detects a touch operation thereon or nearby, the touch operation is transferred to the processor 1480 to determine the type of the touch event, and then the processor 1480 provides a corresponding visual output on the display panel 1441 according to the type of the touch event. Although in fig. 14, the touch panel 1431 and the display panel 1441 are two separate components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1431 may be integrated with the display panel 1441 to implement the input and output functions of the mobile phone.

The handset can also include at least one sensor 1450, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 1441 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 1441 and/or the backlight when the phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for applications of recognizing the gesture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the handset are not described in detail herein.

Audio circuitry 1460, speaker 1461, microphone 1462 may provide an audio interface between the user and the handset. The audio circuit 1460 may transmit the received electrical signal after the audio data conversion to the speaker 1461, and the electrical signal is converted into a sound signal by the speaker 1461 and output; on the other hand, the microphone 1462 converts the collected sound signals into electrical signals, which are received by the audio circuit 1460 and converted into audio data, which are processed by the audio data output processor 1480 and sent via the RF circuit 1410 to, for example, another cell phone, or which are output to the memory 1420 for further processing.

WiFi belongs to a short-distance wireless transmission technology, and a mobile phone can help a user to send and receive emails, browse webpages, access streaming media and the like through a WiFi module 1470, so that wireless broadband Internet access is provided for the user. Although fig. 14 shows a WiFi module 1470, it is understood that it does not belong to the necessary components of a cell phone, and can be omitted entirely as needed within the scope of not changing the essence of the invention.

The processor 1480 is a control center of the handset, connects various parts of the entire handset using various interfaces and lines, performs various functions of the handset and processes data by running or executing software programs and/or modules stored in the memory 1420, and invoking data stored in the memory 1420. In the alternative, processor 1480 may include one or more processing units; alternatively, the processor 1480 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1480.

The handset further includes a power supply 1490 (e.g., a battery) for powering the various components, optionally in logical communication with the processor 1480 via a power management system, thereby implementing functions such as managing charge, discharge, and power consumption via the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which will not be described herein.

In the embodiment of the present application, the processor 1480 included in the terminal also has a function of executing each step of the page processing method as described above.

Embodiments of the present application also provide a computer readable storage medium having stored therein model training instructions which, when executed on a computer, cause the computer to perform the steps performed by the model training apparatus in the method described in the embodiments of fig. 3 to 11.

Also provided in an embodiment of the application is a computer program product comprising model training instructions which, when run on a computer, cause the computer to perform the steps performed by the model training apparatus in the method described in the embodiment of figures 3 to 11 described above.

The embodiment of the application also provides a model training system, which can comprise the model training device in the embodiment shown in fig. 12 or the terminal equipment shown in fig. 14.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or partly in the form of a software product or all or part of the technical solution, which is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a model training apparatus, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of training an image recognition model, comprising:

acquiring a target sample based on the target identification image;

acquiring a second loss value according to the feature similarity between the background area and the positive sample, wherein the type of the label indicated by the second loss value is opposite to that indicated by the first loss value;

and carrying out back propagation calculation on a preset network model based on the first loss value, the second loss value and the third loss value to obtain a target network model, wherein the target network model is used for identifying the target identification image.

2. The method of claim 1, wherein said determining a training triplet from said target sample comprises:

extracting image features of the target sample based on an attention mechanism;

3. The method of claim 2, wherein the determining the corresponding positive and negative samples from the target sample comprises:

determining a target label in the target sample;

4. A method according to claim 3, wherein the target tag is contained in a tag database, and wherein the determining the target tag in the target sample comprises:

5. The method of claim 4, wherein traversing the tag database based on the candidate tag to obtain the target tag comprises:

acquiring the label similarity of the candidate label and the search label;

and determining the target label based on the label similarity.

6. The method of claim 3, wherein the determining the target tag in the target sample comprises:

7. The method of claim 1, wherein the acquiring the target sample based on the target identification image comprises:

acquiring an image tag in the target identification image;

8. The method of claim 1, wherein the predetermined network model is a small sample target detection model and the target network model is a trained small sample target detection model.

9. A method of image recognition, comprising:

Responding to the identification instruction to obtain at least one template picture, wherein the template picture is used for indicating an identification target in a target identification image;

inputting the template picture and the target recognition image into a target network model to obtain a recognition result, wherein the recognition result is a set of recognition targets, and the target network model is trained based on the model training method according to any one of claims 1-8.

10. An apparatus for training an image recognition model, comprising:

an acquisition unit configured to acquire a target sample based on a target recognition image;

the training unit is used for determining a matching label of the positive sample pair, and the matching label is determined based on the similarity of the positive sample and the target sample; determining a foreground region and a background region of the target sample based on the matching tag to obtain a classified positive sample pair; acquiring a first loss value according to the feature similarity of the foreground region and the positive sample; acquiring a second loss value according to the feature similarity between the background area and the positive sample, wherein the type of the label indicated by the second loss value is opposite to that indicated by the first loss value; obtaining a third loss value according to the feature similarity of the negative sample pair, wherein the type of the label indicated by the third loss value is opposite to that indicated by the first loss value; and carrying out back propagation calculation on a preset network model based on the first loss value, the second loss value and the third loss value to obtain a target network model, wherein the target network model is used for identifying the target identification image.

11. An apparatus for image recognition, comprising:

an acquisition unit for acquiring at least one template picture in response to an identification instruction, the template picture being used for indicating an identification target in a target identification image;

the recognition unit is used for inputting the template picture and the target recognition image into a target network model to obtain a recognition result, wherein the recognition result is a set of recognition targets, and the target network model is trained based on the model training method according to any one of claims 1-8.

12. A computer device, the computer device comprising a processor and a memory:

the memory is used for storing program codes; the processor is configured to perform the method of model training of any one of claims 1 to 8, or the method of image recognition of claim 9, according to instructions in the program code.

13. A computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of model training of any of the preceding claims 1 to 8 or the method of image recognition of claim 9.