CN111368934A

CN111368934A - Image recognition model training method, image recognition method and related device

Info

Publication number: CN111368934A
Application number: CN202010187873.3A
Authority: CN
Inventors: 卓炜; 范琦; 戴宇榮
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-03-17
Filing date: 2020-03-17
Publication date: 2020-07-03
Anticipated expiration: 2040-03-17
Also published as: CN111368934B

Abstract

The application discloses an image recognition model training method, an image recognition method and a related device, wherein a target sample is obtained based on a target recognition image; then determining a training triple according to the target sample; and inputting the training triples into a preset network model for training to obtain a target network model. Therefore, the training of the network model based on the triples is realized, the triples comprise the positive samples for indicating the similarity between the labels and the negative samples for indicating the difference between the labels, so that the trained network model can indicate the image characteristics more comprehensively, the construction process of the triples does not need manual intervention, the triples can be applied to the identification of new samples, the training time is greatly saved, and the accuracy and the efficiency of the training of the network model are improved.

Description

Image recognition model training method, image recognition method and related device

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an image recognition model training method, an image recognition method, and a related apparatus.

Background

With the continuous improvement of the requirements of users on image processing, the application of the target detection technology is also increasingly wide. The user needs to train the model using a large number of high quality target detection training samples and then use the model in the target detection task. However, in an actual application scenario, a large number of high-quality target detection training samples need a large amount of manpower and material resources to label, and often cannot be obtained quickly, so that a detection model cannot be deployed quickly in detection of a new sample, and the problem can be solved well by a small sample target detection method.

Generally, a large number of training samples are used for model training in the target detection method for small samples, so that the trained model can have the recognition capability on images similar to the training samples.

However, with the increase of small sample types in images, the method of fixing training samples cannot completely extract the features of the samples, and a large amount of manpower and material resources are needed to collect the training samples, which affects the accuracy and efficiency of model training.

Disclosure of Invention

In view of this, the present application provides a method for training an image recognition model, which can effectively avoid inefficiency and incompleteness caused by manually labeling a training sample, and improve efficiency and accuracy of a model training process.

A first aspect of the present application provides a method for training an image recognition model, which can be applied to a system or a program containing a model training function in a terminal device, and specifically includes: acquiring a target sample based on the target identification image;

determining a training triplet according to the target sample, wherein the training triplet comprises at least one positive sample pair and at least one negative sample pair, the positive sample pair is composed of the target sample and the positive sample, the negative sample pair is composed of the target sample and the negative sample, the positive sample is obtained based on a similar label corresponding to the target sample, and the negative sample is obtained based on a difference label corresponding to the target sample;

and performing supervised learning on a preset network model based on the positive sample pair and the negative sample pair to obtain a target network model, wherein the target network model is used for identifying the target identification image.

Optionally, in some possible implementation manners of the present application, the performing comparison training on a preset network model based on the positive sample pair and the negative sample pair to obtain a target network model includes:

determining a matching label of the positive sample pair, the matching label being determined based on a similarity of the positive sample and the target sample;

determining a foreground region and a background region of the target sample based on the matching labels to obtain a classified positive sample pair;

and inputting the classified positive sample pairs and the classified negative sample pairs into the preset network model for supervised learning to obtain the target network model.

Optionally, in some possible implementations of the present application, the inputting the classified positive sample pair and the classified negative sample pair into the preset network model for supervised learning to obtain the target network model includes:

acquiring a first loss value according to the feature similarity of the foreground region and the positive sample;

obtaining a second loss value according to the feature similarity of the background area and the positive sample, wherein the second loss value and the first loss value indicate opposite label types;

obtaining a third loss value according to the feature similarity of the negative sample pair, wherein the third loss value and the first loss value indicate opposite label types;

and performing back propagation calculation on the preset network model based on the first loss value, the second loss value and the third loss value to obtain the target network model.

Optionally, in some possible implementations of the present application, the determining a training triplet according to the target sample includes:

extracting image features of the target sample based on an attention mechanism;

determining corresponding positive samples and negative samples according to the target samples;

extracting image features of the positive sample by using a detection frame, and generating a positive sample pair by using the image features of the positive sample and the image features of the target sample;

extracting the image characteristics of the negative sample by using a detection frame to generate a negative sample pair with the image characteristics of the target sample;

determining a training triplet based on the pair of positive samples and the pair of negative samples.

Optionally, in some possible implementations of the present application, the determining the corresponding positive sample and the negative sample according to the target sample includes:

determining a target label in the target sample;

obtaining a sample with the same label based on the target label to obtain the positive sample;

and obtaining samples with different labels based on the target label to obtain the negative sample.

Optionally, in some possible implementations of the present application, the determining the target label in the target sample includes:

determining a corresponding candidate tag in response to at least one tag selection instruction;

and traversing in the label database based on the candidate label to obtain the target label.

Optionally, in some possible implementation manners of the present application, the traversing in the tag database based on the candidate tag to obtain the target tag includes:

traversing in the tag database based on the candidate tag to obtain at least one retrieval tag;

acquiring the label similarity of the candidate label and the retrieval label;

determining the target label based on the label similarity.

Optionally, in some possible implementations of the present application, the determining a target label in the target sample includes:

determining a template picture in response to the image in the at least one target detection frame;

and determining a target label in the target sample according to the template picture.

Optionally, in some possible implementations of the present application, the acquiring a target sample based on a target recognition image includes:

acquiring an image tag in the target identification image;

and determining the target sample meeting a preset condition according to the image label, wherein the preset condition is determined based on the matching degree of the image label and the target sample.

Optionally, in some possible implementation manners of the present application, the preset network model is a small sample target detection model, and the target network model is a trained small sample target detection model.

The second aspect of the present application provides an apparatus for training an image recognition model, comprising: an acquisition unit configured to acquire a target sample based on a target recognition image;

a determining unit, configured to determine a training triplet according to the target sample, where the training triplet includes at least one positive sample pair and at least one negative sample pair, the positive sample pair is composed of the target sample and a positive sample, the negative sample pair is composed of the target sample and a negative sample, the positive sample is obtained based on a similar label corresponding to the target sample, and the negative sample is obtained based on a difference label corresponding to the target sample;

and the training unit is used for performing supervised learning on a preset network model based on the positive sample pair and the negative sample pair to obtain a target network model, and the target network model is used for identifying the target identification image.

Optionally, in some possible implementations of the present application, the training unit is specifically configured to determine a matching label of the positive sample pair, where the matching label is determined based on a similarity between the positive sample and the target sample;

the training unit is specifically configured to determine a foreground region and a background region of the target sample based on the matching labels to obtain a classified positive sample pair;

the training unit is specifically configured to input the classified positive sample pair and the classified negative sample pair into the preset network model for supervised learning, so as to obtain the target network model.

Optionally, in some possible implementation manners of the present application, the training unit is specifically configured to obtain a first loss value according to a feature similarity between the foreground region and the positive sample;

the training unit is specifically configured to obtain a second loss value according to the feature similarity between the background region and the positive sample, where the second loss value and the first loss value indicate a label type that is opposite to each other;

the training unit is specifically configured to obtain a third loss value according to the feature similarity of the negative sample pair, where the third loss value and the first loss value indicate a label type that is opposite to each other;

the training unit is specifically configured to perform back propagation calculation on the preset network model based on the first loss value, the second loss value, and the third loss value to obtain the target network model.

Optionally, in some possible implementations of the present application, the determining unit is specifically configured to extract an image feature of the target sample based on an attention mechanism;

the determining unit is specifically configured to determine corresponding positive samples and negative samples according to the target samples;

the determining unit is specifically configured to extract image features of the positive sample by using a detection frame, and generate a positive sample pair with the image features of the target sample;

the determining unit is specifically configured to extract image features of the negative sample by using a detection frame, so as to generate a negative sample pair with the image features of the target sample;

the determining unit is specifically configured to determine a training triplet based on the pair of positive samples and the pair of negative samples.

Optionally, in some possible implementations of the present application, the determining unit is specifically configured to determine a target label in the target sample;

the determining unit is specifically configured to obtain a sample with the same label based on the target label to obtain the positive sample;

the determining unit is specifically configured to obtain samples with different labels based on the target label to obtain the negative sample.

Optionally, in some possible implementations of the present application, the determining unit is specifically configured to determine, in response to at least one tag selection instruction, a corresponding candidate tag;

the determining unit is specifically configured to traverse in the tag database based on the candidate tag to obtain the target tag.

Optionally, in some possible implementation manners of the present application, the determining unit is specifically configured to traverse in the tag database based on the candidate tag to obtain at least one retrieval tag;

the determining unit is specifically configured to obtain tag similarity between the candidate tag and the search tag;

the determining unit is specifically configured to determine the target tag based on the tag similarity.

Optionally, in some possible implementations of the present application, the determining unit is specifically configured to determine a template picture in response to an image in at least one target detection frame;

the determining unit is specifically configured to determine a target label in the target sample according to the template picture.

Optionally, in some possible implementation manners of the present application, the obtaining unit is specifically configured to obtain an image tag in the target identification image;

the obtaining unit is specifically configured to determine, according to the image tag, the target sample that meets a preset condition, where the preset condition is determined based on a matching degree between the image tag and the target sample.

A third aspect of the present application provides an image recognition method, which specifically includes: at least one template picture is obtained in response to the identification instruction, wherein the template picture is used for indicating an identification target in a target identification image;

inputting the template picture and the target recognition image into a target network model to obtain the recognition result, wherein the recognition result is the set of the recognition targets, and the target network model is obtained by training based on the model training method of any one of the first aspect.

A fourth aspect of the present application provides an image recognition apparatus, which specifically includes: an acquisition unit configured to acquire at least one template picture in response to an identification instruction, the template picture indicating an identification target in a target identification image;

and the identification unit is used for inputting the template picture and the target identification image into a target network model to obtain the identification result, wherein the identification result is the set of the identification targets, and the target network model is obtained by training based on the model training method of any one of the first aspect.

A fifth aspect of the present application provides a computer device comprising: a memory, a processor, and a bus system; the memory is used for storing program codes; the processor is configured to perform the method of model training according to any one of the first aspect or the first aspect, or the method of image recognition according to the third aspect, according to instructions in the program code.

A fourth aspect of the present application provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the method of model training according to the first aspect or any one of the first aspects, or the method of image recognition according to the third aspect.

According to the technical scheme, the embodiment of the application has the following advantages:

obtaining a target sample by identifying an image based on the target; then determining a corresponding positive sample and a corresponding negative sample according to the target sample, wherein the positive sample and the target sample contain the same label, and the negative sample and the target sample contain different labels; and inputting the target sample, the positive sample and the negative sample into a preset network model for training to obtain a target network model, wherein the target network model is used for identifying the target identification image. Therefore, the training of the network model based on the triples is realized, the triples comprise the positive samples for indicating the similarity between the labels and the negative samples for indicating the difference between the labels, so that the trained network model can more comprehensively comprise image characteristics under different labels, the construction process of the triples does not need manual intervention, the training time is greatly saved, and the accuracy and the efficiency of the training of the network model are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a network architecture diagram of the operation of a model training system;

fig. 2 is a flowchart of an image recognition model training process provided in an embodiment of the present application;

FIG. 3 is a flowchart of a method for training an image recognition model according to an embodiment of the present disclosure;

fig. 4 is a scene schematic diagram of an image recognition model training provided in an embodiment of the present application;

FIG. 5 is a flowchart of a model training method according to an embodiment of the present disclosure;

fig. 6 is a flowchart of a method for image recognition according to an embodiment of the present application;

fig. 7 is a scene schematic diagram of an image recognition method according to an embodiment of the present application;

fig. 8 is a flowchart illustrating a terminal architecture according to an embodiment of the present disclosure;

fig. 9 is a scene schematic diagram of another image recognition method according to an embodiment of the present application;

fig. 10 is a scene schematic diagram of another image recognition method according to an embodiment of the present application;

fig. 11 is a scene schematic diagram of another image recognition method according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an image recognition model training apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a method and a related device for training an image recognition model, which can be applied to a system or a program containing a model training function in terminal equipment, and a target sample is obtained based on a target recognition image; then determining a corresponding positive sample and a corresponding negative sample according to the target sample, wherein the positive sample and the target sample contain the same label, and the negative sample and the target sample contain different labels; and inputting the target sample, the positive sample and the negative sample into a preset network model for training to obtain a target network model, wherein the target network model is used for identifying the target identification image. Therefore, the training of the network model based on the triples is realized, the triples comprise the positive samples for indicating the similarity between the labels and the negative samples for indicating the difference between the labels, so that the trained network model can more comprehensively comprise image characteristics under different labels, the construction process of the triples does not need manual intervention, the training time is greatly saved, and the accuracy and the efficiency of the training of the network model are improved.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some nouns that may appear in the embodiments of the present application are explained.

Object Detection (Object Detection) each Object in a picture is marked with a rectangular box and the class of this Object is given.

The small sample Object Detection (FSOD) technology means that only a small number of samples are used for training an Object Detection model, and objects with the same category in a picture can be detected according to a given small number of template objects when the Object Detection is carried out.

Twin Network (Simase Network) means that two different inputs are processed simultaneously using a weight-shared Network, which is a twin Network.

A Feature map (Feature map) which is image information obtained by convolving an image with a filter; the Feature map may be convolved with a filter to generate a new Feature map.

Attention feature map (Attention feature map) means that a region containing a target object is focused by an Attention mechanism, and the region has stronger response.

The object identification image (Query image) is a picture for object detection, and the network model detects an object in the queryimage.

The template picture (Support image) is a template picture used in the detection of the small sample object, and the model detects all objects in the query image with the same category as the template picture according to the template pictures.

Depth-wise Cross correlation (Depth-wise Cross correlation) is a one-to-one convolution of the feature map of the support image as a filter on the feature map of the query image channel by channel. The feature maps of the Support image and the queryimage have the same number of input channels, and the number of output channels is the same as the number of input channels.

Feature Pooling (RoI Pooling) refers to Pooling the corresponding area in a feature map to a fixed size feature map according to the position of the input rectangle box.

Contrast training (contrast training) refers to a method for training a multi-path twin network through (target sample, positive sample and negative sample) triple training samples.

It should be understood that the model training method provided by the present application may be applied to a system or a program including a model training function in a terminal device, such as an image recognition program, specifically, the model training system may operate in a network architecture as shown in fig. 1, which is a network architecture diagram of the model training system, as can be seen from the diagram, the model training system may provide model training with a plurality of information sources, the terminal establishes a connection with the server through a network, and then sends a recognition request and a target recognition image to the server, the server constructs a triple training sample according to the target recognition image, and then trains a preset network model, and recognizes the target recognition image, and then returns a result to the terminal; it can be understood that, fig. 1 shows various terminal devices, in an actual scenario, there may be more or fewer types of terminal devices participating in the model training process, and the specific number and type depend on the actual scenario, which is not limited herein, and in addition, fig. 1 shows one server, but in an actual scenario, there may also be participation of multiple servers, especially in a scenario of multi-content application interaction, the specific number of servers depends on the actual scenario.

It should be noted that the model training method provided in this embodiment may also be performed offline, that is, without the participation of a server, at this time, the terminal is connected with other terminals locally, and then the process of model training between terminals is performed.

It is understood that the model training system described above may be run on a personal mobile terminal, such as: the method can be used as an application of image recognition, can also be operated on a server, and can also be operated on third-party equipment to provide model training so as to obtain a model training processing result of an information source; the specific model training system may be operated in the above-mentioned device in the form of a program, may also be operated as a system component in the above-mentioned device, and may also be used as one of cloud service programs, and the specific operation mode is determined by an actual scene, and is not limited herein.

In order to solve the above problem, the present application provides a method for training an image recognition model, which is applied to a process framework of model training shown in fig. 2, and as shown in fig. 2, for a process framework diagram of image recognition model training provided in an embodiment of the present application, a target sample is determined according to a target recognition image, then a positive sample and a negative sample are obtained based on a label in the target sample, and a triple training sample is formed, and then a preset model is subjected to comparison training to identify the target recognition image, and a structure is returned to a client.

It can be understood that the method provided by the present application may be a program written as a processing logic in a hardware system, and may also be an image recognition model training apparatus, and the processing logic is implemented in an integrated or external manner. As one implementation, the model training apparatus obtains a target sample by recognizing an image based on a target; then determining a corresponding positive sample and a corresponding negative sample according to the target sample, wherein the positive sample and the target sample contain the same label, and the negative sample and the target sample contain different labels; and inputting the target sample, the positive sample and the negative sample into a preset network model for training to obtain a target network model, wherein the target network model is used for identifying the target identification image. Therefore, the training of the network model based on the triples is realized, the triples comprise the positive samples for indicating the similarity between the labels and the negative samples for indicating the difference between the labels, so that the trained network model can more comprehensively comprise image characteristics under different labels, the construction process of the triples does not need manual intervention, the training time is greatly saved, and the accuracy and the efficiency of the training of the network model are improved.

With reference to the above flow architecture, the following describes a method for training a model in the present application, please refer to fig. 3, where fig. 3 is a flow chart of a method for training an image recognition model according to an embodiment of the present application, and the embodiment of the present application at least includes the following steps:

301. a target sample is acquired based on the target identification image.

In this embodiment, the process of obtaining the target sample based on the target identification image may be performed based on a preset template picture, that is, the template picture in the target identification image is extracted, and then the target sample including the template picture is obtained based on the template picture. For example: and if the template picture of the target identification image is a hat, acquiring a corresponding target sample containing hat image characteristics according to the image characteristics of the hat.

Optionally, the process of obtaining the target sample based on the template picture may also be performed based on an image tag in the target identification image, that is, a tag corresponding to the template picture. For example: if the template picture of the target identification image is a bicycle, acquiring a corresponding picture containing a bicycle label or bicycle image characteristics according to the bicycle label as a target sample; the target sample is obtained through the label, so that a quick search process can be realized, and the model training efficiency is improved.

It can be understood that a plurality of target samples may be recalled in the process of obtaining the target sample based on the label, and at this time, a picture with a higher matching degree may be determined as the target sample according to the matching degree of the image label and the target sample, so that the relevance between the target sample and the target identification image is improved, and the accuracy of the following training process is improved.

In this embodiment, the source of the target sample may be a preset training database, and at this time, the training sample does not need to be manually set, so that time is saved; the source of the target sample may also be obtained over a network. The small part with higher representativeness in the target identification image indicated by the target sample is a small sample detection process, and compared with the characteristic identification process of the whole image, the small sample detection process saves system resources.

302. And determining a training triplet according to the target sample.

In this embodiment, the training triplet includes at least one positive sample pair and at least one negative sample pair, the positive sample pair is composed of a target sample and a positive sample, the negative sample pair is composed of a target sample and a negative sample, the positive sample is obtained based on a similar label corresponding to the target sample, and the negative sample is obtained based on a difference label corresponding to the target sample. That is, the positive sample and the target sample belong to the same category, and the negative sample and the target sample belong to different categories, as shown in fig. 4, the scene diagram of the image recognition model training provided in the embodiment of the present application is shown, where the diagram includes the target sample a1 in the target recognition image, the corresponding positive sample a2 is a bicycle with similar image features or similar labels, and the negative sample A3 is a car with different labels.

It can be understood that the labels for the positive sample and the negative sample may be preset, that is, there is a database including a large amount of small sample materials, and the small samples are all provided with corresponding labels, and the labels may be different, that is, the negative sample, or the artificial setting of the preset corresponding relationship may be used as the selection basis for the positive sample and the negative sample.

Specifically, the determination process for the positive and negative examples may be a process performed through correlation between labels, and first, a target label in the target example needs to be determined; then obtaining a sample with the same label based on the target label to obtain a positive sample; and then, different samples of the label are obtained based on the target label to obtain a negative sample, so that a positive sample and the negative sample are quickly obtained, the process of manually marking the training sample is not needed, and the training time is greatly saved.

Optionally, due to the diversity of the tags, there may be a plurality of target tags of the target sample, and at this time, the determination may be performed based on the existence of the tags in the tag database, that is, the corresponding candidate tags are determined in response to at least one tag selection instruction; and traversing in a label database based on the candidate labels to obtain the target label. For example: if the label indicated by the label selection instruction is a bicycle, the candidate label can be a bicycle, a bicycle or a bicycle, and at the moment, only the label of the bicycle is traversed in the database, the bicycle is determined to be the target label, so that the stability of the label association process is ensured.

In a possible scenario, a completely corresponding tag may not be retrieved from the tag database, and at this time, the target tag may be determined according to the tag similarity between the candidate tag and the retrieved tag, that is, the tag with higher similarity is selected as the target tag.

Optionally, the determining process of the target tag may also be to respond to a template picture framed by the user, and then use a tag corresponding to the template picture as the target tag, for example: the template picture framed by the user comprises the puppy, and the target label is the puppy, so that the operation mode of the user is enriched, and the user experience is improved.

303. And performing supervised learning on the preset network model based on the positive sample pair and the negative sample pair to obtain a target network model.

In this embodiment, the target network model is used to identify the target identification image.

Specifically, the training process for comparing the target sample, the positive sample and the negative sample may be to generate a positive sample pair and a negative sample pair according to the target sample, the positive sample and the negative sample, where the positive sample pair includes the target sample and the positive sample, and the negative sample pair includes the target sample and the negative sample; then respectively obtaining the feature similarity of the positive sample pair and the feature similarity of the negative sample pair; and performing supervised learning on the preset network model according to the feature similarity of the positive sample pair and the feature similarity of the negative sample pair to obtain a target network model.

It can be understood that, due to the training of parameters involving different dimensions in the supervised learning process, that is, the true labels of the target sample and the positive sample are matched, and the true labels of the target sample and the negative sample are unmatched, at this time, it is necessary to obtain a loss value according to the feature similarity of the positive sample pair, and since the foreground region and the background region of the target sample are determined based on the matched labels, for example: the foreground area of the target sample is a horse, and the background area is a tree; therefore, the obtaining of the loss value according to the feature similarity of the positive sample pair comprises obtaining a first loss value according to the feature similarity of the foreground region and the positive sample, and obtaining a second loss value according to the feature similarity of the background region and the positive sample; then, a third loss value is obtained according to the feature similarity of the negative sample pair, and the third loss value and the first loss value indicate opposite label types; performing back propagation calculation on the preset network model based on the first loss value, the second loss value and the third loss value, so as to obtain a target network model; the training process gives consideration to the similarity and difference between the sample labels, so that the accuracy of model training can be improved, and the trained model is not only suitable for labeled samples, but also can identify newly added samples.

In a possible scenario, the process of model training may refer to fig. 5, where fig. 5 is a flowchart architecture diagram of a model training method provided in an embodiment of the present application; the figure shows the following steps:

(1) training triples (target sample, positive sample, negative sample) were constructed.

Wherein (target sample, positive sample) constitutes a positive sample pair and (target sample, negative sample) constitutes a negative sample pair. The true value of the sample pair is then set, since the target recognition image is a scene object, i.e. the matching label that determines the positive sample and the rectangular box (anchor) in the target recognition image.

Alternatively, the anchor of the target sample may be selected by selecting different sizes, different aspect ratios, every 16 pixels; for example: 4 sizes and 3 length-width ratios are selected according to the size of the image, and the specific parameters are determined by the actual scene.

Specifically, the positive sample pairs may be mismatched, so that the positive sample pairs need to be screened. In the positive sample pair (target sample, positive sample), screening is performed based on the coincidence rate of anchors in the target sample and labels of the recognition targets. For example: the preset condition is that the matching of the positive sample pair with the Intersection-over-Union (IOU) larger than 0.5 is successful, and a matching label A is given; correspondingly, unsuccessful is matching tag B. Then in the positive sample pair, some prediction candidate boxes are given labels a, such as the prediction candidate box that coincides with the target; some prediction candidate frames are given a label B, such as a prediction candidate frame in which the region image is the background.

In addition, in the negative example pair (target example, negative example), the labels of the prediction boxes of the target examples are all B, i.e., do not match.

(2) And extracting sample features based on the twin network.

After the sample pairs are constructed, sample features are respectively extracted, namely, triples (target samples, positive samples and negative samples) are respectively input into a backbone network (resnet50) to extract features. Wherein, the positive sample is extracted by a detection box (box) and globally pooled to obtain the characteristic f of 1 x 1024_pExtracting the negative sample by box and obtaining the characteristic f of 1 x 1024 after the global pooling_nAnd obtaining the characteristics f of M × N × 1024 after the target samples are subjected to global pooling_qAnd M, N are image parameters of the target sample.

(3) Feature vectors for the positive and negative sample pairs are extracted based on an attention mechanism. Namely, the attention features of a positive sample pair and a negative sample input area candidate network (attentionRPN) are extracted, the probability value of whether a candidate box in a target sample is matched with the positive sample is predicted on the attention features of the positive sample pair, and the probability value of whether the candidate box in the target sample is matched with the negative sample is predicted on the attention features of the negative sample pair.

The specific attention feature extraction based on attentionRPN is a deep convolution (depthwise) convolution process, and the loss function of the convolution process is as follows:

wherein f is_pIs a positive sample feature; f. of_qf,iThe foreground region characteristics of the target sample are obtained; f. of_qf,iThe foreground region characteristics of the target sample are obtained; f. of_nIs a negative sample characteristic; f. of_qIs a target sample characteristic; l is_p-qf，L_p-qb，L_n-qAre all the standard losses of the regional candidate network; n is a radical of_pqf，N_pqb，N_nqFor the number of samples, N can be set here_pqf：N_pqb：N_nqThe specific ratio is 1:1:1, and depends on the actual scene.

It is understood that the above-mentioned loss objective function is a matching loss function including a foreground region of the target sample and the positive sample, i.e. a first loss value; matching loss functions of the target sample background area and the positive sample, namely a second loss value; the matching penalty function of the target sample and the negative sample, i.e. the third penalty value.

After filtering the candidate frames by attentionRPN, the remaining candidate frames pass through a detection network to obtain a final matching prediction value, i.e. a matching result.

It can be understood that each matching loss function is a loss function of a standard detection task, namely a detection box regression loss function (box regression loss) and a class loss function (classification loss), wherein the class loss function is a binary cross entropy loss function whether to match, that is, a process of taking the value of the matching label, for example, the label a takes 1 and the label B takes 0.

In a possible scenario, the training process of the FSOD model may be improved based on the model training method provided by the present application, that is, the sample input of the FSOD adopts the triple sample construction method provided by the present embodiment; on one hand, the method can inherit all the advantages of the FSOD and greatly improve the performance of the FSOD.

It should be noted that the FSOD is taken as the preset network model in this embodiment only for example, and the model training method described in this embodiment in the actual scenario may also use a different infrastructure network structure.

With reference to the above embodiments, a target sample is obtained by identifying an image based on a target; then determining a corresponding positive sample and a corresponding negative sample according to the target sample, wherein the positive sample and the target sample contain the same label, and the negative sample and the target sample contain different labels; and inputting the target sample, the positive sample and the negative sample into a preset network model for training to obtain a target network model, wherein the target network model is used for identifying the target identification image. Therefore, the training of the network model based on the triples is realized, the triples comprise the positive samples for indicating the similarity between the labels and the negative samples for indicating the difference between the labels, so that the trained network model can more comprehensively comprise image characteristics under different labels, the construction process of the triples does not need manual intervention, the triples can be applied to the identification of new samples, the training time is greatly saved, and the accuracy and the efficiency of the training of the network model are improved.

The foregoing embodiment describes a process of model training, and in the following, the description is given by taking image recognition as a specific scenario, please refer to fig. 6, where fig. 6 is a flowchart of a method for image recognition provided in an embodiment of the present application, and the embodiment of the present application at least includes the following steps:

601. and acquiring at least one template picture in response to the identification instruction.

In this embodiment, the template picture is used to indicate an identification target in the target identification image, where the template picture may be a picture obtained through an identification instruction, or may be obtained through a tag determined by the identification instruction; in addition, the identification instruction may be initiated by the user, specifically, the identification instruction may be a click on an identification button, or a frame selection of a certain portion of the image, or a tag in response to a user input, where the specific form is determined by an actual scene.

It should be noted that the template picture obtained in response to the identification instruction may be one or multiple, specifically, when there are multiple template pictures, the template pictures are respectively identified, and corresponding identification results are output.

Optionally, if the action object of the identification instruction is a video, the target identification image is a video frame corresponding to the moment when the identification instruction is sent, and the image features of the video frame are associated with similar images in subsequent video frames, so as to implement the identification process of the video.

602. And inputting the template picture and the target identification image into the target network model to obtain an identification result.

In this embodiment, the recognition result is a set of recognition targets, and as shown in fig. 7, the scene diagram of the image recognition method provided in the embodiment of the present application is that two template pictures are input in the diagram, where the two template pictures are a hat and a bicycle, respectively, and these template pictures and the target recognition image are input into the target network model for recognition, so that elements similar to the template image in the target recognition image can be obtained and selected for highlighting.

It should be understood that the form of the frame selection highlighting in the above embodiments is only an example, and other specific ways of highlighting the recognition result should also be taken as the solution provided by the present application, for example: highlighting, changing hue, etc.

In this embodiment, the relevant features of the training process of the target network model are similar to those of step 301 and step 303 in the embodiment described in fig. 3, and may be referred to herein, which are not repeated herein.

It is understood that the result of image recognition may be a framed image feature region, or output label information of the framed region, or continuous tracking of the framed image feature, that is, a feature recognition process in a video.

The training process of the preset network model is performed on the target recognition image, and the small sample detection network is effectively trained by constructing an efficient (target sample, positive sample and negative sample) training triad, so that the detection performance of the network on the small sample is greatly improved, and the training time is shortened. The method realizes a convenient, efficient and quick model training process and expands the application scene of image recognition.

The image recognition method provided by the present application is described below with reference to a hardware architecture of a terminal, and as shown in fig. 8, is a flowchart executed by a terminal architecture provided by an embodiment of the present application, and the method includes:

801. front end a receives input data.

In this embodiment, the input data includes a target identification picture and a template picture, where the template picture may be framed by a user or generated based on a tag.

802. And carrying out model training at the back end.

In this embodiment, the process of performing model training at the back end refers to the embodiment described in fig. 3, and details are not repeated here.

It can be understood that the back end may be a processing system of the terminal, or may be a server, or may be a service device in the cloud, for example, to perform a model training process through the Tencent cloud.

803. The front end B displays the recognition result.

In this embodiment, the front end B and the front end a may be display interfaces of the same terminal or display interfaces of different terminals, and the specific form depends on an actual scene.

Optionally, the recognition result may be performed in an interactive manner as shown in fig. 9, which is a scene schematic diagram of another image recognition method provided in the embodiment of the present application as shown in fig. 9. In the figure, firstly, a user clicks an identification button to identify a target identification image, at this time, the template picture can be a person and a bicycle, and image part frames corresponding to the person and the bicycle are selected from the display interface. Further, the user may click the detail button to query a specific recognition process, i.e. the positive and negative examples involved in the model training process in the present application, for example: the positive samples are bicycles with different forms, and the negative samples are automobiles and trucks; further, the user can know the matching process of the positive sample and the negative sample by clicking more buttons, at the moment, the user can check whether the matching process is accurate, and if the matching process is not accurate, the user can click to report errors to carry out wrong records, so that the background revises the model training parameters, and the accuracy of model training is ensured.

Optionally, the above-mentioned identification process may be performed in an interactive manner as shown in fig. 10, which is a scene schematic diagram of another image identification method provided in the embodiment of the present application as shown in fig. 10. Namely, image recognition is performed by adopting a mode of label input at the front end A, a user can input a bicycle in a label column B1, a target sample of the bicycle is generated at the background, a positive sample and a negative sample are correspondingly generated, then the model training process is performed, and a training result is obtained by clicking an identification button.

Optionally, the above-mentioned identification process may also be performed in an interactive manner as shown in fig. 11, which is a scene schematic diagram of another model image identification provided in this embodiment as shown in fig. 11. Namely, determining the template picture by adopting a frame selection mode, and determining the target sample. In the figure, the user needs to mark all bicycle image features, and then may slide the touch screen to obtain a sample box C1, so as to generate a target sample of the bicycle corresponding to the sample box C1 in the background, and generate a positive sample and a negative sample correspondingly, and then perform the above-mentioned model training process, so as to obtain a training result by clicking the recognition button.

Through the above interaction mode, a rapid image recognition process can be performed, and due to the fact that the training process is performed through the triplets, the detection performance of the network model on small samples is greatly improved, training time is shortened, a user can perform image recognition through various modes conveniently and fast, and user experience is improved.

In order to better implement the above-mentioned aspects of the embodiments of the present application, the following also provides related apparatuses for implementing the above-mentioned aspects. Referring to fig. 12, fig. 12 is a schematic structural diagram of an image recognition model training apparatus according to an embodiment of the present application, where the model training apparatus 1200 includes:

an obtaining unit 1201 for obtaining a target sample based on a target recognition image;

a determining unit 1202, configured to determine a training triplet according to the target sample, where the training triplet includes at least one positive sample pair and at least one negative sample pair, the positive sample pair is composed of the target sample and a positive sample, the negative sample pair is composed of the target sample and a negative sample, the positive sample is obtained based on a similar label corresponding to the target sample, and the negative sample is obtained based on a difference label corresponding to the target sample;

a training unit 1203, configured to perform supervised learning on a preset network model based on the positive sample pair and the negative sample pair to obtain a target network model, where the target network model is used to identify the target identification image.

Optionally, in some possible implementations of the present application, the training unit 1203 is specifically configured to determine a matching label of the positive sample pair, where the matching label is determined based on a similarity between the positive sample and the target sample;

the training unit 1203 is specifically configured to determine a foreground region and a background region of the target sample based on the matching labels, so as to obtain a classified positive sample pair;

the training unit 1203 is specifically configured to input the classified positive sample pair and the classified negative sample pair into the preset network model for supervised learning, so as to obtain the target network model.

Optionally, in some possible implementation manners of the present application, the training unit 1203 is specifically configured to obtain a first loss value according to a feature similarity between the foreground region and the positive sample;

the training unit 1203 is specifically configured to obtain a second loss value according to the feature similarity between the background region and the positive sample, where the second loss value and the first loss value indicate a label type that is opposite to each other;

the training unit 1203 is specifically configured to obtain a third loss value according to the feature similarity of the negative sample pair, where the third loss value is opposite to the label type indicated by the first loss value;

the training unit 1203 is specifically configured to perform back propagation calculation on the preset network model based on the first loss value, the second loss value, and the third loss value to obtain the target network model.

Optionally, in some possible implementations of the present application, the determining unit 1202 is specifically configured to extract an image feature of the target sample based on an attention mechanism;

the determining unit 1202 is specifically configured to determine corresponding positive samples and negative samples according to the target sample;

the determining unit 1202 is specifically configured to extract an image feature of the positive sample by using a detection frame, so as to generate a positive sample pair with the image feature of the target sample;

the determining unit 1202 is specifically configured to extract an image feature of the negative sample by using a detection frame, so as to generate a negative sample pair with the image feature of the target sample;

the determining unit 1202 is specifically configured to determine a training triplet based on the pair of positive samples and the pair of negative samples.

Optionally, in some possible implementations of the present application, the determining unit 1202 is specifically configured to determine a target label in the target sample;

the determining unit 1202 is specifically configured to obtain a sample with the same label based on the target label to obtain the positive sample;

the determining unit 1202 is specifically configured to obtain samples with different labels based on the target label to obtain the negative sample.

Optionally, in some possible implementations of the present application, the determining unit 1202 is specifically configured to determine a corresponding candidate tag in response to at least one tag selection instruction;

the determining unit 1202 is specifically configured to traverse in the tag database based on the candidate tag to obtain the target tag.

Optionally, in some possible implementations of the present application, the determining unit 1202 is specifically configured to traverse in the tag database based on the candidate tag to obtain at least one retrieval tag;

the determining unit 1202 is specifically configured to obtain a tag similarity between the candidate tag and the search tag;

the determining unit 1202 is specifically configured to determine the target tag based on the tag similarity.

Optionally, in some possible implementations of the present application, the determining unit 1202 is specifically configured to determine a template picture in response to an image in at least one target detection frame;

the determining unit 1202 is specifically configured to determine a target label in the target sample according to the template picture.

Optionally, in some possible implementation manners of the present application, the obtaining unit 1201 is specifically configured to obtain an image tag in the target identification image;

the obtaining unit 1201 is specifically configured to determine, according to the image tag, the target sample meeting a preset condition, where the preset condition is determined based on a matching degree between the image tag and the target sample.

An image recognition apparatus 1300 is provided in an embodiment of the present application, and as shown in fig. 13, the apparatus for image recognition provided in the embodiment of the present application specifically includes: an obtaining unit 1301, configured to obtain at least one template picture in response to the identification instruction, where the template picture is used to indicate an identification target in the target identification image;

the identifying unit 1302 is configured to input the template picture and the target identification image into a target network model to obtain the identification result, where the identification result is the set of the identification targets, and the target network model is obtained by training based on the model training method according to any one of the first aspect.

An embodiment of the present application further provides a terminal device, as shown in fig. 14, which is a schematic structural diagram of another terminal device provided in the embodiment of the present application, and for convenience of description, only a portion related to the embodiment of the present application is shown, and details of the specific technology are not disclosed, please refer to a method portion in the embodiment of the present application. The terminal may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a point of sale (POS), a vehicle-mounted computer, and the like, taking the terminal as the mobile phone as an example:

fig. 14 is a block diagram illustrating a partial structure of a mobile phone related to a terminal provided in an embodiment of the present application. Referring to fig. 14, the handset includes: radio Frequency (RF) circuitry 1410, memory 1420, input unit 1430, display unit 1440, sensor 1450, audio circuitry 1460, wireless fidelity (WiFi) module 1470, processor 1480, and power supply 1490. Those skilled in the art will appreciate that the handset configuration shown in fig. 14 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 14:

RF circuit 1410 may be used for receiving and transmitting signals during a message transmission or call, and in particular, for processing received downlink information of a base station to processor 1480; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 1410 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 1410 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), etc.

The memory 1420 may be used to store software programs and modules, and the processor 1480 executes various functional applications and data processing of the cellular phone by operating the software programs and modules stored in the memory 1420. The memory 1420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, memory 1420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 1430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. In particular, the input unit 1430 may include a touch panel 1431 and other input devices 1432. The touch panel 1431, also referred to as a touch screen, may collect touch operations performed by a user on or near the touch panel 1431 (for example, operations performed by the user on or near the touch panel 1431 using any suitable object or accessory such as a finger or a stylus, and a range of touch operations on the touch panel 1431 with a gap), and drive a corresponding connection device according to a preset program. Alternatively, the touch panel 1431 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device and converts it to touch point coordinates, which are provided to the processor 1480 and can receive and execute commands from the processor 1480. In addition, the touch panel 1431 may be implemented by various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. In addition to the touch panel 1431, the input unit 1430 may also include other input devices 1432. In particular, other input devices 1432 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 1440 may be used to display information input by or provided to the user and various menus of the mobile phone. The display unit 1440 may include a display panel 1441, and optionally, the display panel 1441 may be configured in the form of a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like. Further, touch panel 1431 can overlay display panel 1441, and when touch panel 1431 detects a touch operation on or near touch panel 1431, it can transmit to processor 1480 to determine the type of touch event, and then processor 1480 can provide a corresponding visual output on display panel 1441 according to the type of touch event. Although in fig. 14, the touch panel 1431 and the display panel 1441 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1431 and the display panel 1441 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 1450, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 1441 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 1441 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

Audio circuitry 1460, speaker 1461, microphone 1462 may provide an audio interface between a user and a cell phone. The audio circuit 1460 can transmit the received electrical signal converted from the audio data to the loudspeaker 1461, and the electrical signal is converted into a sound signal by the loudspeaker 1461 and output; on the other hand, the microphone 1462 converts collected sound signals into electrical signals, which are received by the audio circuit 1460 and converted into audio data, which are then processed by the audio data output processor 1480, and then passed through the RF circuit 1410 for transmission to, for example, another cellular phone, or for output to the memory 1420 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through a WiFi module 1470, and provides wireless broadband internet access for the user. Although fig. 14 shows the WiFi module 1470, it is understood that it does not belong to the essential constitution of the handset and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 1480, which is the control center of the mobile phone, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1420 and calling data stored in the memory 1420, thereby integrally monitoring the mobile phone. Alternatively, the processor 1480 may include one or more processing units; alternatively, the processor 1480 may integrate an application processor, which handles primarily operating systems, user interfaces, and applications, etc., with a modem processor, which handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1480.

The handset also includes a power supply 1490 (e.g., a battery) that powers the various components, optionally, the power supply may be logically connected to the processor 1480 via a power management system, thereby implementing functions such as managing charging, discharging, and power consumption via the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In the embodiment of the present application, the processor 1480 included in the terminal also has a function of executing the respective steps of the page processing method as described above.

Also provided in embodiments of the present application is a computer-readable storage medium, which stores therein model training instructions, and when the computer-readable storage medium is executed on a computer, the computer is caused to perform the steps performed by the model training apparatus in the method described in the foregoing embodiments shown in fig. 3 to 11.

Also provided in embodiments of the present application is a computer program product comprising instructions for training a model, which when run on a computer causes the computer to perform the steps performed by the model training apparatus in the method as described in the embodiments of fig. 3 to 11.

The embodiment of the present application further provides a model training system, where the model training system may include the model training apparatus in the embodiment described in fig. 12 or the terminal device described in fig. 14.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a model training apparatus, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of image recognition model training, comprising:

acquiring a target sample based on the target identification image;

2. The method of claim 1, wherein the training of the preset network model based on the positive sample pairs and the negative sample pairs to obtain the target network model comprises:

3. The method of claim 2, wherein the inputting the classified positive sample pair and the negative sample pair into the preset network model for supervised learning to obtain the target network model comprises:

4. The method of claim 1, wherein determining training triples from the target sample comprises:

extracting image features of the target sample based on an attention mechanism;

5. The method of claim 4, wherein determining corresponding positive and negative samples from the target sample comprises:

determining a target label in the target sample;

6. The method of claim 5, wherein the target tag is included in a tag database, and wherein the determining the target tag in the target sample comprises:

7. The method of claim 6, wherein traversing the tag database based on the candidate tag to obtain the target tag comprises:

acquiring the label similarity of the candidate label and the retrieval label;

determining the target label based on the label similarity.

8. The method of claim 5, wherein the determining the target label in the target sample comprises:

9. The method of claim 1, wherein obtaining the target sample based on the target identification image comprises:

acquiring an image tag in the target identification image;

10. The method of claim 1, wherein the predetermined network model is a small sample target detection model, and the target network model is a trained small sample target detection model.

11. A method of image recognition, comprising:

at least one template picture is obtained in response to the identification instruction, wherein the template picture is used for indicating an identification target in a target identification image;

inputting the template picture and the target recognition image into a target network model to obtain the recognition result, wherein the recognition result is the set of the recognition targets, and the target network model is obtained by training based on the model training method of any one of claims 1 to 10.

12. An apparatus for training an image recognition model, comprising:

an acquisition unit configured to acquire a target sample based on a target recognition image;

13. An apparatus for image recognition, comprising:

an acquisition unit configured to acquire at least one template picture in response to an identification instruction, the template picture indicating an identification target in a target identification image;

a recognition unit, configured to input the template picture and the target recognition image into a target network model to obtain the recognition result, where the recognition result is a set of the recognition targets, and the target network model is trained based on the model training method according to any one of claims 1 to 10.

14. A computer device, the computer device comprising a processor and a memory:

the memory is used for storing program codes; the processor is configured to perform the method of model training according to any one of claims 1 to 10, or the method of image recognition according to claim 11, according to instructions in the program code.

15. A computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to perform the method of model training of any one of the preceding claims 1 to 10, or the method of image recognition of claim 11.