CN117809084A

CN117809084A - Image recognition model training method, interaction method and device based on image recognition

Info

Publication number: CN117809084A
Application number: CN202311728963.9A
Authority: CN
Inventors: 尹英杰; 丁菁汀; 马晨光
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-12-14
Filing date: 2023-12-14
Publication date: 2024-04-02

Abstract

One or more embodiments of the present disclosure provide an image recognition model training method, and an image recognition-based interaction method and apparatus. According to the method and the device, a plurality of sampling images with different visual angles in the same scene are taken as samples, and a target recognition model capable of accurately recognizing images with any visual angle is obtained through sample image contrast training with different visual angles; in the model training process, a knowledge distillation training principle is combined, a teacher model with high recognition accuracy is obtained through training, and then the teacher model is used for guiding a light student model to conduct multi-view contrast learning, so that a robust small-scale target recognition model which supports end side deployment is obtained. In addition, the real-time image acquired in real time is identified by utilizing the trained target identification model, and if the identification is successful, the preset interaction action is executed.

Description

Image recognition model training method, interaction method and device based on image recognition

Technical Field

One or more embodiments of the present disclosure relate to the field of augmented reality technologies, and in particular, to an image recognition model training method, and an image recognition-based interaction method and apparatus.

Background

With the development of computer technology, the interaction modes of users and computer devices are also becoming more and more diversified. Among other things, augmented reality (Augmented Reality, AR) technology provides users with a sensory experience that is difficult to achieve in the real world by combining the real world with virtual information.

In an actual application scene, accurate identification of the real world is an important precondition for triggering the corresponding AR effect. Particularly in environments with large range and complex light, such as large-scale stadiums of sports meetings, singing meetings and the like, the difficulty in identifying specific positions is high, the corresponding AR effect is difficult to be accurately triggered, and the user experience is influenced.

Disclosure of Invention

In order to accurately identify an interactable scene and improve interaction experience of a user, one or more embodiments of the present disclosure provide an image recognition model training method, an image recognition-based interaction method and an image recognition-based interaction device.

In a first aspect, one or more embodiments of the present disclosure provide an image recognition model training method, including:

acquiring a multi-view sample set of a sample scene; the multi-view sample set comprises a plurality of sample images corresponding to a plurality of different views in a sample scene; the sample images comprise positive sample images matched with preset identifiers in the sample scene and negative sample images not matched with the preset identifiers;

Selecting a preset number of sample images from the multi-view sample set to form a batch data set;

and training the initial recognition model through the batch data set to obtain a target recognition model.

In a possible implementation manner, the multi-view sample set includes a plurality of sample subsets corresponding to the plurality of different views one to one;

the selecting a preset number of the sample images from the multi-view sample set to form a batch data set comprises at least one of the following steps:

randomly selecting one sample image from the sample subsets with the preset quantity respectively to form a first batch of data set;

randomly selecting the preset number of sample images from all the sample images in the multi-view sample set to form a second batch of data set.

In a possible implementation manner, the training of the initial recognition model through the batch data set includes:

inputting the first data set or the second data set into the initial recognition model for training;

wherein a ratio of the number of the first data sets input to the initial recognition model to the number of the second data sets input to the initial recognition model is a preset ratio.

In a possible implementation manner, the training the initial recognition model through the batch data set to obtain the target recognition model includes:

training the teacher model to be trained in the initial recognition model through the batch data set to obtain a trained teacher model;

and carrying out distillation training on the student model to be trained in the initial recognition model according to the trained teacher model to obtain a trained student model, and taking the trained student model as the target recognition model.

In a possible implementation manner, the training the teacher model to be trained in the initial recognition model through the batch data set to obtain a trained teacher model includes:

pairing the sample images in the batch data set in pairs to obtain a plurality of sample image pairs;

inputting the sample image pair into the teacher model to be trained;

calculating a first loss function corresponding to the teacher model to be trained according to a prediction result of the sample image pair output by the teacher model to be trained;

and optimizing the teacher model to be trained according to the first loss function, so that the teacher model to be trained converges, and the trained teacher model is obtained.

In a possible implementation manner, the distillation training is performed on the student model to be trained in the initial recognition model according to the trained teacher model to obtain a trained student model, including:

respectively inputting the sample images into the trained teacher model and the student model to be trained;

calculating a second loss function corresponding to the student model to be trained according to the prediction results of the sample image pair, which are respectively output by the trained teacher model and the student model to be trained; the second loss function comprises a distillation loss function;

and optimizing the student model to be trained according to the second loss function, so that the student model to be trained converges, and the trained student model is obtained.

In a possible implementation manner, the model structure of the teacher model to be trained includes: the system comprises a first backbone network module, a first uncertain prediction network module, a first characteristic generation network module and a first characteristic fusion module;

the model structure of the student model to be trained comprises: the system comprises a second backbone network module, a second uncertain prediction network module, a second characteristic generation network module and a second characteristic fusion module;

The number of convolution layers in the second uncertainty prediction network module is less than the number of convolution layers in the first uncertainty prediction network module, and the number of convolution layers in the second feature generation network module is less than the number of convolution layers in the first feature generation network module.

In a second aspect, one or more embodiments of the present disclosure provide an interaction method based on image recognition, including:

acquiring a live-action image of a current scene;

inputting the live-action image into a target recognition model;

obtaining the matching probability between the live-action image predicted by the target recognition model and a target mark in a target scene;

if the matching probability is greater than a preset threshold, executing a preset action;

the target recognition model is a model trained by the method according to the first aspect.

In a possible implementation manner, the performing a preset action includes:

and acquiring preset virtual information, and loading the preset virtual information by taking the live-action image as a background.

In a possible implementation manner, the method further includes:

and if the matching probability is not greater than a preset threshold value, re-acquiring the live-action image.

In a third aspect, one or more embodiments of the present disclosure provide an image recognition model training apparatus, including:

the sample acquisition unit is used for acquiring a multi-view sample set of a sample scene; the multi-view sample set comprises a plurality of sample images corresponding to a plurality of different views in a sample scene; the sample images comprise positive sample images matched with preset identifiers in the sample scene and negative sample images not matched with the preset identifiers;

a sample processing unit for selecting a preset number of sample images from the multi-view sample set to form a batch data set;

and the model training unit is used for training the initial recognition model through the batch data set to obtain a target recognition model.

the sample processing unit is used for selecting a preset number of sample images from the multi-view sample set to form a batch data set, and comprises the sample processing unit is used for executing at least one of the following steps:

In a possible implementation manner, the model training unit is configured to train an initial recognition model through the batch data set, and includes:

the model training unit is used for inputting the first batch of data set or the second batch of data set into the initial recognition model for training;

In a possible implementation manner, the model training unit is configured to train the initial recognition model through the batch data set to obtain a target recognition model, and includes:

the model training unit is used for training the teacher model to be trained in the initial recognition model through the batch data set to obtain a trained teacher model; and performing distillation training on the student model to be trained in the initial recognition model according to the trained teacher model to obtain a trained student model, and taking the trained student model as the target recognition model.

In a possible implementation manner, the model training unit is configured to train, through the batch data set, a teacher model to be trained in the initial recognition model to obtain a trained teacher model, and the model training unit is configured to:

inputting the sample image pair into the teacher model to be trained;

In a possible implementation manner, the model training unit is configured to perform distillation training on a student model to be trained in the initial recognition model according to the trained teacher model, to obtain a trained student model, and the model training unit is configured to:

In a fourth aspect, one or more embodiments of the present disclosure provide an interaction device based on image recognition, including:

The image acquisition unit is used for acquiring a live-action image of the current scene;

the image recognition unit is used for inputting the live-action image into a target recognition model and obtaining the matching probability between the live-action image predicted by the target recognition model and a target mark in a target scene;

the interaction action unit is used for executing preset actions if the matching probability is larger than a preset threshold value;

wherein the object recognition model is a model trained by the method of any one of claims 1 to 7.

In a possible implementation manner, the interaction unit is configured to execute a preset action if the matching probability is greater than a preset threshold, and includes:

and the interaction unit is used for acquiring preset virtual information and loading the preset virtual information by taking the live-action image as a background if the matching probability is larger than a preset threshold value.

In a possible implementation manner, the interaction unit is further configured to: and if the matching probability is not greater than a preset threshold value, triggering the image acquisition unit to acquire the live-action image again.

In a fifth aspect, one or more embodiments of the present description also provide an electronic device comprising a memory and a processor; the memory is used for storing a computer program product; the processor is configured to execute a computer program product stored in the memory, and the computer program product, when executed, implements the method of the first or second aspect described above.

In a sixth aspect, one or more embodiments of the present specification further provide a computer readable storage medium storing computer program instructions which, when executed, implement the method of the first or second aspect described above.

In summary, according to the image recognition model training method and device provided by one or more embodiments of the present disclosure, sampling images of different view angles in the same scene are taken as samples, and the sample images of different view angles are used for comparison training to obtain a target recognition model capable of accurately recognizing images of any view angle; sample images with different visual angles can capture as many interference factors which can influence the recognition result, such as light rays, entity objects and the like of a sample scene from each angle as possible, so that the robustness of a target recognition model obtained through final training is ensured, the image recognition speed and accuracy of the target recognition model are improved, and the image recognition requirement under a complex scene is met.

The initial recognition model and the final target recognition model obtained through training in the embodiment of the specification are end-to-end neural network models, namely the input data of the initial recognition model and the final target recognition model are original data, and compared with a non-end-to-end model, the target recognition model obtained through the embodiment of the specification has higher recognition accuracy and is more convenient to directly apply to a terminal or a related application program.

According to the embodiment of the specification, sample images are selected from the multi-view sample set through different sample data selection modes, a first data set and a second data set are obtained, and the first data set and the second data set are input into an initial recognition model according to a preset proportion, so that the model is fully trained. The first batch of data sets simultaneously contain sample images with multiple visual angles, so that multi-visual angle comparison learning can be realized during training, and the recognition accuracy of the model is improved; the sample images in the second data set are more random, so that the robustness of the model can be improved.

In the model training process, the embodiment of the specification is further combined with a knowledge distillation training principle, a teacher model with high recognition accuracy is obtained through training, and then the teacher model is used for guiding a light student model to conduct multi-view contrast learning, so that a small-scale target recognition model with robust support end side deployment is obtained.

In addition, according to the interaction method and the interaction device based on image recognition provided by one or more embodiments of the present disclosure, a real-time image obtained in real time is recognized by using a trained target recognition model, and if the recognition is successful, a preset interaction action is executed. Because the target recognition model is an end-to-end small model obtained through knowledge distillation training, the accuracy and the robustness of image recognition can be guaranteed, the accurate triggering of preset interaction actions can be guaranteed, the performance requirement on a terminal can be reduced, and the occupation amount of terminal resources can be reduced, so that the interaction method based on the target recognition model can be applied to various different types of terminals, and rich interaction experience is provided for users.

Drawings

In order to more clearly illustrate the technical solution of one or more embodiments of the present description, the drawings that are required for use in the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of one or more embodiments of the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a flow diagram of a method for training an image recognition model according to one or more embodiments of the present disclosure;

FIG. 2 is a flow diagram of an interaction method based on image recognition provided in one or more embodiments of the present disclosure;

FIG. 3 is a schematic layout of sampling points for acquiring a sample image according to one or more embodiments of the present disclosure;

FIG. 4 is a schematic diagram of a model training principle provided by one or more embodiments of the present disclosure;

FIG. 5 is a schematic diagram of an interactive interface in an interactive application implemented based on an image recognition interaction method provided in accordance with one or more embodiments of the present disclosure;

FIG. 6 is a block diagram illustrating an image recognition model training apparatus according to one or more embodiments of the present disclosure;

FIG. 7 is a block diagram illustrating an interactive device based on image recognition according to one or more embodiments of the present disclosure;

fig. 8 is a block diagram of an electronic device according to one or more embodiments of the present disclosure.

Detailed Description

One or more embodiments of the present specification are described in further detail below with reference to the drawings and examples. Features and advantages of one or more embodiments of the present description will become apparent from the description.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

In addition, the technical features mentioned in the different implementations of one or more embodiments of the present specification described below may be combined with each other as long as they do not conflict with each other.

In order to facilitate understanding, an application scenario of the technical solution provided in one or more embodiments of the present disclosure is first described below.

Using AR technology, a specific virtual object can be set for a specific scene; the user can acquire the images of the scene of the user in real time through the camera of the mobile phone of the user, when the acquired images are matched with the specific scene, the corresponding AR special effect can be triggered, the user can see the effect of combining the virtual object with the scene of the user, and sensory experience of interaction with the virtual object is provided for the user. For example, in a venue of a sports event, when an image shot by a user through a camera of a mobile phone contains a certain entity object in the venue, a dynamic image of a mascot of the sports event is loaded in a current interface of the mobile phone, and the user can interact with the mascot on the mobile phone, so that the attention and participation of the user on the sports event are improved through the interesting interaction experience.

For the interactive scene based on the AR technology, one key step is how to accurately identify whether the current scene is a preset interactive scene, namely judging whether the current image shot by the user contains a preset marker, if the image contains the marker, but the obtained judging result is that the image does not contain the marker due to the interference of light rays or other entity objects, the corresponding AR characteristic cannot be triggered, and the user cannot obtain the corresponding AR interactive experience, so that the user participation degree and experience effect are seriously affected.

In order to accurately identify an interactable scene and improve interaction experience of a user, one or more embodiments of the present disclosure provide an image recognition model training method, an interaction method for performing interaction by using an image recognition model obtained by training by the training method, and related devices, which are described in detail below.

FIG. 1 illustrates a flow diagram of a method of training an image recognition model provided in one or more embodiments of the present disclosure. Referring to fig. 1, the image recognition model training method includes the following steps.

102, acquiring a multi-view sample set of a sample scene; the multi-view sample set comprises a plurality of sample images corresponding to a plurality of different views in a sample scene; the sample images comprise positive sample images matched with preset identifiers in the sample scene and negative sample images not matched with the preset identifiers;

In the embodiment of the present disclosure, the sample for model training is a multi-view sample set, that is, a sample image set obtained by taking a photograph of the same preset identifier from a plurality of different views in a sample scene. Each sample image in the multi-view sample set is marked with a respective sample label that is used to distinguish whether the respective sample image is a positive sample image or a negative sample image. The positive sample image, that is, the sample image, contains the preset identifier, and the negative sample image, that is, the sample image, does not contain the preset identifier.

Step 104, selecting a preset number of sample images from the multi-view sample set to form a batch data set;

and step 106, training the initial recognition model through the batch data set to obtain a target recognition model.

Each batch data set contains N sample images, and the initial recognition model can be input at the same time; for each sample image in the input batch data set, the initial recognition model predicts the probability of the positive sample image, and corresponding loss functions can be calculated according to the predicted probability and the sample labels corresponding to the sample images; the loss function is used to measure the difference degree between the predicted value and the actual value of the model, and the smaller the loss function is, the higher the prediction accuracy of the model is, so that the model can be optimized according to the calculated loss function, such as adjusting one or more parameters in the model. Calculating a loss function once every time a batch data set is input, and optimizing the initial recognition model according to the calculated loss function; and carrying out iterative optimization on the initial recognition model through a plurality of batch data sets until the model converges, namely training is completed, and obtaining the target recognition model.

In the embodiment of the specification, the sample images for model training are obtained by sampling from a plurality of different visual angles, and the sample images from the different visual angles can capture as many interference factors which can influence the recognition result, such as light rays, entity objects and the like of a sample scene as possible from each angle, so that the robustness of a target recognition model obtained by final training is ensured, the image recognition speed and accuracy of the target recognition model are improved, and the image recognition requirement under a complex scene is met.

In addition, the initial recognition model and the final target recognition model obtained by training in the embodiment of the specification are end-to-end neural network models, that is, input data of the initial recognition model is original data, for example, a sample image is input in the training process, an image to be recognized acquired by a user terminal can be directly input when the initial recognition model and the final target recognition model are applied to the user terminal, the original data are subjected to feature extraction by the model, and all image recognition processes such as prediction and the like are performed according to the extracted feature vector, so that feature extraction and prediction performance of the model can be simultaneously optimized in the training process. Therefore, compared with a non-end-to-end model, the target recognition model obtained by the embodiment of the specification has higher recognition accuracy and is more convenient to be directly applied to a terminal or related application programs.

It should be noted that, the sample scene may be the same as or similar to the target application scene of the target recognition model, so as to improve the performance of the target recognition model in the target application scene, and meet the application requirement. For example, the object recognition model is to be used in a performance event held in stadium a, and the image recognition capability is used to provide AR interactive effects for the performance event, then the stadium a, or other stadium similar to the stadium a, may be used as a sample scene, i.e. images of multiple different perspectives in the experience stadium a are acquired in step 102, to form the multi-perspective sample set for model training described above.

In a possible implementation, the multi-view sample set includes a plurality of sample subsets that are in one-to-one correspondence with the plurality of different views.

For each sampling view angle in a sample scene, a plurality of sample images can be acquired to form a sample subset corresponding to the view angle; the sample subsets of the individual views together comprise a multi-view sample set of the sample scene.

Alternatively, the sampling view angle of the sample scene may be set according to an interval point sampling rule, that is, in the sample scene, a plurality of sampling points are set at the same or different intervals, where each sampling point corresponds to one view angle. It can be understood that the smaller the sampling point interval is, the more sampling points are, the more sampling angles are, the more the sample set for model training is rich, and the higher the recognition accuracy of the target recognition model obtained through final training is. The number of sampling visual angles can be set according to the actual application requirement.

Fig. 3 shows a schematic diagram of sampling point setting of a sample image according to an embodiment of the present disclosure. As shown in fig. 3, the sample scene is a stadium, the central area of the sample scene is a playground of a sports event, performance, etc., the periphery of the sample scene is a multi-layer stand area, and a plurality of sampling points are arranged at intervals in the stand area. In some embodiments, as shown in fig. 3, the multi-layer stand area may be divided into three parts of a lower layer stand, a middle layer stand and an upper layer stand, and 4 sampling points are respectively set, that is, 12 sampling points are set in the stadium in total; at each sampling point, an image of the whole venue can be acquired through an image acquisition device such as a camera, wherein an image containing a preset mark in the activity venue can be used as a positive sample image, and an image not containing the preset mark can be used as a negative sample image. Compared with the activity field in the central area, the sampling points are located at different angles and different heights, and the preset marks and the surrounding environment in the activity field can be sampled in all directions. It can be understood that the specific setting modes of the sampling point positions, namely the sampling visual angles, can be various, and can be set according to factors such as sample scenes, sample quantity and the like so as to meet the model training requirements.

In a possible implementation manner, based on each sample subset in the multi-view sample set, in step 104, a preset number of sample images are selected from the multi-view sample set to form a batch data set, which may specifically include at least one of the following:

step 1042, randomly selecting one sample image from the sample subsets of the preset number to form a first data set;

for example, if M sampling views are set in the sample scene, M sample subsets can be obtained, and one sample image can be randomly selected from each sample subset in sequence until N sample images are selected, so that a first data set can be obtained. N is the number of sample images in the first batch of data set, i.e. the preset number. Wherein, M and N are positive integers, the size relationship between them can be arbitrary, and the embodiment is not limited to this.

Step 1044, randomly selecting the preset number of sample images from all the sample images in the multi-view sample set to form a second batch of data set.

Also based on the multi-view sample set with M sample subsets described above, wherein all sample images are taken as samples to be selected, N sample images are randomly selected from among them, irrespective of which sample subset each sample image comes from, a second set of data can be obtained, the number of samples in the second set of data also being N.

In a possible implementation manner, based on the first batch of data set and the second batch of data set, training the initial recognition model through the batch of data set in step 106 may specifically include:

the first data set or the second data set is input into the initial recognition model for training.

For example, the preset ratio between the first and second sets of data is 5:1, in the training process, 1 second data set is input every time 5 first data sets are input into the initial recognition model. Of course, the above-mentioned preset ratio may be other ratio values, which are not limited in this embodiment.

As can be seen from the above description, in the embodiment of the present disclosure, sample images are selected from the multi-view sample set by different sample data selection manners, so as to obtain a first set of data and a second set of data, and the first set of data and the second set of data are input into the initial recognition model according to a preset proportion, so as to ensure sufficient training of the model. The first batch of data sets simultaneously contain sample images with multiple visual angles, so that multi-visual angle comparison learning can be realized during training, and the recognition accuracy of the model is improved; the sample images in the second data set are more random, so that the robustness of the model can be improved.

In a possible implementation manner, training the initial recognition model through the batch data set in step 106 may specifically include:

step 1062, training the teacher model to be trained in the initial recognition model through the batch data set to obtain a trained teacher model;

and step 1064, performing distillation training on the student model to be trained in the initial recognition model according to the trained teacher model to obtain a trained student model, and taking the trained student model as the target recognition model.

In the embodiments of the present description, model training is performed based on knowledge distillation principles. And (3) knowledge distillation, namely training a large model with more parameters and more complicated, taking the large model as a teacher model, constructing a small model with fewer parameters, taking the small model as a student model, transferring knowledge mastered by the teacher model into the student model through knowledge distillation, taking the trained student model as a final target model, and applying the trained student model to an actual scene so as to reduce the performance requirement on a terminal.

In this embodiment, the initial recognition model includes two parts, i.e., a teacher model and a student model, so that the teacher model before training is completed is referred to as a teacher model to be trained, the teacher model after training is referred to as a trained teacher model, and the same student model is also divided into two states, i.e., a student model to be trained and a trained student model. Firstly, inputting the obtained batch data sets into a teacher model one by one, training the batch data sets, and optimizing parameters of the batch data sets to enable the batch data sets to have good image recognition capability; then continuously acquiring a batch data set, respectively inputting the batch data set into a trained teacher model and a student model to be trained, and carrying out parameter optimization on the student model based on a predicted value output by the teacher model and the student model to be trained until the student model converges, and finishing training; the trained student model is used as the target recognition model.

According to the embodiment of the specification, the target recognition model is obtained based on the knowledge distillation principle, so that the recognition accuracy and other performances of the model can be guaranteed, and the target recognition model can be lightened, so that the target recognition model can be applied to terminals with different configurations.

In a possible implementation manner, training the teacher model to be trained in the initial identification model through the batch data set in step 1062 to obtain a trained teacher model may specifically include:

step 10622, pairing the sample images in the batch dataset two by two to obtain a plurality of sample image pairs;

in this embodiment, the sample images may be paired in a random manner, so as to ensure that each sample image exists in at least one sample image pair.

For example, each batch dataset contains N sample images I _i (1.ltoreq.i.ltoreq.N); if n=4, the batch data set can be expressed as a set { I } ₁ ,I ₂ ,I ₃ ,I ₄ Pairwise pairing to obtain a product as shown in (I) ₁ ,I ₃ ) And (I) ₂ ,I ₄ ) Or (I) ₁ ,I ₄ ) And (I) ₂ ,I ₃ ) And so on, and the random pairing result of two sample image pairs. As another example, if n=5, the batch dataset can be represented as { I } ₁ ,I ₂ ,I ₃ ,I ₄ ,I ₅ By pairing every two, it is possible to obtain a product such as (I) ₁ ,I ₃ )、(I ₂ ,I ₄ ) And (I) ₃ ,I ₅ ) And (5) the random pairing result of the three sample image pairs.

Step 10624, inputting the sample image pair into the teacher model to be trained;

step 10626, calculating a first loss function corresponding to the teacher model to be trained according to the prediction result of the sample image pair output by the teacher model to be trained;

in some embodiments, the prediction results output by the teacher model may include probability values characterizing whether both sample images in the sample image pair are positive or negative.

And 10628, optimizing the teacher model to be trained according to the first loss function, so that the teacher model to be trained converges to obtain the trained teacher model.

In some embodiments, the teacher model to be trained may be optimized by using a gradient descent method, that is, the parameters of the teacher model to be trained are iteratively adjusted along the gradient direction of the first loss function until the first loss function reaches a minimum.

In the embodiment of the specification, the teacher model is trained by taking the sample image pair as a unit, the two sample images in each sample image pair can be used for comparing and evaluating the recognition effects of the teacher model on different types of sample images, and the comparison evaluation result is fused into the first loss function, so that the teacher model is optimized according to the first loss function, the difference between the recognition effects of the teacher model on different images can be eliminated or reduced, and the stability of the teacher model is improved.

In a possible implementation manner, based on the teacher model training process described in the foregoing steps 10622 to 10628, in step 1064, distillation training is performed on the student model to be trained in the initial recognition model according to the trained teacher model to obtain a trained student model, which may specifically include:

step 10642, inputting the sample images into the trained teacher model and the student model to be trained respectively;

when training the student model, the sample images in the batch data set are paired in pairs according to step 10622, and each obtained sample image is respectively input into the trained teacher model and the student model to be trained, so that the trained teacher model and the student model to be trained respectively predict the same samples, and respectively output prediction results.

Step 10644, calculating a second loss function corresponding to the student model to be trained according to the prediction results of the sample image pair output by the trained teacher model and the student model to be trained respectively;

the second loss function comprises a distillation loss function and is used for measuring prediction deviation generated by distillation learning of the teacher model by the student model. Additionally, the second loss function may further include a loss function for evaluating a prediction bias of the student model itself.

And step 10646, optimizing the student model to be trained according to the second loss function, so that the student model to be trained converges to obtain the trained student model.

And calculating a second loss function according to the prediction result of the teacher model and the student model on the same sample, and optimizing the student model according to the second loss function, namely influencing the parameter adjustment direction of the student model through the prediction result of the teacher model, so as to achieve the purpose of transferring knowledge mastered by the teacher model into the student model, namely realizing distillation training, and obtaining the student model which is more simplified than the teacher model.

In some embodiments, a gradient descent algorithm may also be used when optimizing and adjusting parameters of the student model, and specific principles may refer to the foregoing teacher model training process, which is not described herein.

In one possible implementation manner, the model structure of the teacher model to be trained includes: the system comprises a first backbone network module, a first uncertain prediction network module, a first characteristic generation network module and a first characteristic fusion module;

the student model to be trained has the same structure as the teacher model, namely comprises: the system comprises a second backbone network module, a second uncertain prediction network module, a second characteristic generation network module and a second characteristic fusion module.

Wherein the number of convolution layers (Convolutional layer) in the second uncertainty prediction network module is less than the number of convolution layers in the first uncertainty prediction network module and the number of convolution layers in the second feature generation network module is less than the number of convolution layers in the first feature generation network module. In addition, the vector output by the first uncertain prediction network module and the vector output by the second uncertain prediction network module have the same dimension, and the vector output by the first characteristic generation network module and the vector output by the second characteristic generation network module also have the same dimension.

Specifically, the first backbone network module in the teacher model may employ a larger recognition network model, such as ResNet50, resNet101, and the like; accordingly, the second backbone network module in the student model may employ a lightweight end-friendly recognition network model, such as ShuffleNetV2, mobienetV2, and the like. The first uncertain prediction network module and the first characteristic generation network module in the teacher model respectively comprise a plurality of convolution layers; correspondingly, the second uncertain prediction network module and the second characteristic generation network module in the student model respectively comprise one or a small number of convolution layers. Therefore, compared with a teacher model, the light student model has fewer parameters and time delay, and the student model is trained through knowledge distillation, so that the student model has higher precision, is more suitable for being deployed in a terminal, and reduces the performance requirement on the terminal and the occupation amount of computing resources.

Based on the above model structure, fig. 4 shows training principles of a teacher model and a student model in the initial recognition model according to the embodiment of the present specification, and a model training process will be described in detail with reference to fig. 4.

Referring to fig. 4, the process of training the teacher model alone is as follows:

1.1 Image pairs { (I) of respective sample images in one lot data set _i ,I _j ) 1.ltoreq.i, j.ltoreq.N } is input to the teacher model to be trained (i.e., step 10624), and respective sample image pairs (I) are generated by the first uncertainty prediction network module _i ,I _j ) Corresponding first uncertainty vectorThe uncertainty vector can be used for measuring the prediction accuracy difference of the current teacher model to be trained on two sample images in the same image sample pair, and meanwhile, each sample image pair (I _i ,I _j ) Corresponding first feature vector->Then, according to the first uncertainty vectorAnd a first feature vector->Each sample image pair (I) is calculated by a first feature fusion module _i ,I _j ) First hybrid feature->The first hybrid characteristic may be calculated using the following formula:

wherein (1)>

1.2 Then according to the first mixing characteristicThe mixed probability can be predicted, and a loss function L caused by the self prediction deviation of the teacher model to be trained is calculated _t I.e. the first loss function; the calculation formula of the first loss function is as follows:

in the above formula, y _i For sample image I _i Corresponding sample tag, y _j For sample image I _j Corresponding sample tag, y _i And y _j The value of 0 or 1, the value of 0 represents a negative sample label, namely the corresponding sample image is a negative sample image, and the value of 1 represents a positive sample label, namely the corresponding sample image is a positive sample image;and->Respectively corresponding weight values of the negative sample image and the positive sample image in the teacher model to be trained, < +.>And->Respectively aiming at sample images I of the teacher model to be trained _i And I _j Assigned weights, it will be appreciated that sample image I _i And I _j The corresponding weights are associated with their sample tags.

1.3 According to the first loss function L _t And (3) optimizing and updating parameters in the teacher model to be trained by adopting the gradient descent method and the like until the model converges, namely finishing training the teacher model, and obtaining the trained teacher model.

Still referring to fig. 4, in training a student model through a trained teacher model, the process of knowledge distillation training is as follows:

2.1 Image pairs { (I) of respective sample images in one lot data set _i ,I _j ) Inputting 1-I, j-N into the trained teacher model and the student model to be trained respectively, wherein the trained teacher model still generates each sample image pair (I) _i ,I _j ) Corresponding first hybrid featuresThe specific steps are the same as the training process of the teacher model, and are not repeated here; similarly, the student model to be trained can also calculate and obtain each sample image pair (I) through the second uncertainty prediction network module and the second feature generation network module respectively _i ,I _j ) Corresponding second uncertainty vector +.>And a second feature vector->Thereby generating respective sample image pairs (I _i ,I _j ) Corresponding second hybrid feature->Wherein the second hybrid feature->The calculation formula of (2) can also be referred to the first hybrid feature +.>

2.2 Then according to the first mixing characteristicAnd second hybrid feature->Calculation of distillation loss function L _dis According to the second mixing characteristic->Calculating a loss function L of a student model to be trained _s Further, the second loss function l=l is calculated _s +L _dis ；

The distillation loss function L _dis The calculation formula of (2) is as follows:

wherein (1)>

In the above formula, T is a temperature coefficient required for knowledge distillation, and the value thereof may be set according to practical application conditions, for example, T may take the value of 3.

The loss function L of the student model to be trained _s The calculation formula of (2) is as follows:

in the above-mentioned formula(s),and->Respectively corresponding weight values of the negative sample image and the positive sample image in the student model to be trained; in addition, L _s And the first loss function L _t Similar calculation formulas of (a) can be referred to each other, and are not described herein.

2.3 According to the calculation result of the second loss function L, the gradient descent method is adopted to perform optimization and update on parameters in the student model to be trained until the model converges, namely training of the student model is completed, and a trained student model, namely a target recognition model for image recognition in the embodiment, is obtained.

After the target recognition model is obtained, the target recognition model can be deployed in a terminal and used for providing real-time image recognition functions for application scenes such as AR interaction and the like.

As can be seen from the above description, according to the model training method provided in the embodiment of the present disclosure, sampling images of multiple views in the same scene are used as sample images, and end-to-end small models supporting end-side deployment, that is, the student models are obtained through multi-view contrast learning and knowledge distillation learning, so that challenge conditions of multiple user shooting views, complex light, changeable arrangement of active personnel in a field and the like in complex scenes such as large-scale stadium and the like can be dealt with, accuracy and robustness of preset identification in the complex scene are ensured, and only less computing resources on the end side are required.

In an application program for realizing the AR interactive effect, an interactive guiding frame may be disposed in the interactive interface 60, and a rounded hexagonal frame 61 shown in fig. 5 is the interactive guiding frame, and guiding information 63 may be displayed, such as "please align with the field center" to trigger the AR special effect "shown in fig. 5. The interactive guide frame, the guide information and the like are used for guiding a user to place the identified object into the guide frame as much as possible so as to achieve the purposes of improving the identification accuracy, improving the display effect of the AR special effect and the like.

In view of this, in one possible implementation, during the model training process, the recognition capability of the recognition model to the preset area of the input image may be trained simultaneously, and the related strategies may include multiple kinds. For example, the feature extraction module of the initial recognition model may be optimized to perform feature extraction only on a preset area, such as a central area, of the input sample image; for another example, the preset mark is used as a positive sample image only in a preset area of the sample image, and the positive sample image does not contain the preset mark and the negative sample image which is not in the preset area is used as a negative sample image; for another example, an image clipping module is set in the initial recognition model, an input sample image is clipped by the image clipping module, only a part in a preset area is reserved, and then a subsequent recognition flow is carried out on the clipped image by a feature extraction module and the like. Referring to fig. 5, the preset area may be a rectangular area within the interactive guide frame, that is, an area indicated by a dashed box 62 in fig. 5.

In an actual application scene, an important area for image recognition, namely the preset area, can be determined in advance according to the design of an interactive interface of a related application program for deploying the target recognition model, and model training is performed according to any strategy in combination with the position of the preset area, so that the target recognition model obtained through final training can be recognized mainly for the preset area of an input image.

Based on the same inventive concept, one or more embodiments of the present disclosure further provide an interaction method based on image recognition, which is similar to the training method principle of the image recognition model, and can be referred to with each other. Fig. 2 is a flowchart of the interaction method based on image recognition according to an embodiment of the present disclosure, and referring to fig. 2, the method includes the following steps.

Step 202, obtaining a live-action image of a current scene;

the interaction method of the embodiment can be applied to an intelligent terminal, and the live-action image is obtained by calling an image acquisition unit, such as a camera, of the intelligent terminal to take a snapshot of the current scene in real time.

Step 204, inputting the live-action image into a target recognition model;

Step 206, obtaining the matching probability between the live-action image predicted by the target recognition model and the target mark in the target scene;

step 208, if the matching probability is greater than a preset threshold, executing a preset action;

the target recognition model is a model obtained by training the image recognition model training method according to any embodiment.

According to the related description about the model structure and the training process in the foregoing embodiment, the target recognition model may perform feature extraction and probability prediction on the input live-action image, so as to obtain the probability that the live-action image includes the target identifier in the preset target scene, that is, the matching probability. When the matching probability corresponding to the live-action image is larger than a preset threshold, the identification is considered to be successful, namely the live-action image contains the target mark, so that the preset action is executed, and the corresponding interaction effect is triggered. The above-mentioned preset threshold may be set according to the actual application scenario, and may be set to 0.5 or higher, for example.

For example, in a session of a sports stadium, a live audience may open an AR interactive application based on the above interaction method on his smart phone, where the application may implement the following interaction functions: invoking a camera of the smart phone to capture a live-action image in the stadium, utilizing the target recognition model to recognize the captured live-action image, and displaying a preset AR interaction effect for audiences when the live-action image is recognized to contain a central area of the stadium. In this AR interactive application, the central area within the stadium is used as the target identifier for image recognition, and a preset AR interactive effect will be triggered if and only if the mobile phone camera of the live audience (user) is looking at the central area of the stadium.

As can be seen from the above description, in the embodiment of the present disclosure, a real-time image obtained by using a target recognition model obtained by training in advance is recognized and predicted, and when the probability of existence of a target identifier in the real-time image is predicted to be greater than a preset threshold, the existence of the target identifier in the real-time image is considered, so as to trigger a preset action; because the target recognition model is an end-to-end small model obtained through knowledge distillation training, the accuracy and the robustness of image recognition can be guaranteed, the accurate triggering of preset interaction actions can be guaranteed, the performance requirement on a terminal can be reduced, and the occupation amount of terminal resources can be reduced, so that the interaction method based on the target recognition model can be applied to various different types of terminals, and rich interaction experience is provided for users.

In a possible implementation manner, when the matching probability is higher than a preset threshold in step 208, a preset action is performed, which may specifically include:

The preset virtual information can be designed according to different application scenes and the interaction effect to be achieved. Taking the application scenario of field interaction on the above-mentioned certain sports event as an example, the preset virtual information may include a virtual mascot of the sports event generated by using a three-dimensional modeling technology; when the identification is successful, namely the predicted matching probability of the live-action image is larger than the preset probability, loading and displaying the virtual mascot by taking the live-action image as a background; in addition, the preset virtual information can be static or dynamic after being loaded and displayed, such as the virtual mascot can randomly move in a display area, introduce event information for a user, and the like.

The interaction method based on image recognition may further include:

step 210, if the matching probability is not greater than the preset threshold, returning to step 202, and re-acquiring the live-action image.

In this embodiment, the capturing of live-action images may be continuous, for example, the live-action images may be captured by the camera at intervals (for example, 0.5 seconds), and as the user operates, the alignment angle of the camera may be slightly changed, so that different live-action images may be captured at different moments; and identifying the live-action image obtained by shooting each time through the target identification model, and if the identification is unsuccessful, identifying the live-action image shot next time again until the identification is successful, and triggering a preset action.

In a possible implementation manner, when the interaction method provided in the embodiment of the present disclosure is implemented, the interaction interface 60 shown in fig. 5 may be used to prompt the user what the target identifier is to be identified by displaying the guiding information 63, and guide the user to place the target identifier in the middle of the display screen through the interaction guiding frame 61, so as to improve the identification accuracy and the identification efficiency, and provide better interaction experience for the user.

It should be understood that the foregoing embodiments are merely examples, and modifications may be made to the foregoing embodiments in actual implementation, and those skilled in the art may understand that the modification methods of the foregoing embodiments without performing any inventive effort fall within the protection scope of one or more embodiments of the present disclosure, and the embodiments are not repeated herein.

Based on the same inventive concept, one or more embodiments of the present disclosure further provide an image recognition model training apparatus, and since the principle of the problem solved by the apparatus is similar to that of the foregoing image recognition model training method, implementation of the image recognition model training apparatus may refer to implementation of the foregoing image recognition model training method, and repeated parts will not be repeated.

Fig. 6 is a block diagram of an image recognition model training apparatus according to one or more embodiments of the present disclosure. As shown in fig. 6, the image recognition model training apparatus 300 may include:

a sample acquiring unit 301, configured to acquire a multi-view sample set of a sample scene; the multi-view sample set comprises a plurality of sample images corresponding to a plurality of different views in a sample scene; the sample images comprise positive sample images matched with preset identifiers in the sample scene and negative sample images not matched with the preset identifiers;

a sample processing unit 302, configured to select a preset number of sample images from the multi-view sample set to form a batch data set;

and the model training unit 303 is configured to train the initial recognition model through the batch data set, so as to obtain a target recognition model.

the sample processing unit 302 is configured to select a preset number of the sample images from the multi-view sample set to form a lot data set, and the sample processing unit 302 is configured to perform at least one of the following:

In a possible implementation manner, the model training unit 303 is configured to train an initial recognition model through the batch data set, including:

the model training unit 303 is configured to input the first set of data or the second set of data into the initial recognition model for training;

In a possible implementation manner, the model training unit 303 is configured to train the initial recognition model through the batch dataset to obtain a target recognition model, and includes:

The model training unit 303 is configured to train a teacher model to be trained in the initial recognition model through the batch data set, so as to obtain a trained teacher model; and performing distillation training on the student model to be trained in the initial recognition model according to the trained teacher model to obtain a trained student model, and taking the trained student model as the target recognition model.

In a possible implementation manner, the model training unit 303 is configured to train, through the batch data set, a teacher model to be trained in the initial recognition model, to obtain a trained teacher model, where the model training unit 303 is configured to:

inputting the sample image pair into the teacher model to be trained;

In a possible implementation manner, the model training unit 303 is configured to perform distillation training on a student model to be trained in the initial recognition model according to the trained teacher model, to obtain a trained student model, where the model training unit 303 is configured to:

Based on the same inventive concept, one or more embodiments of the present disclosure further provide an interaction device based on image recognition, and fig. 7 is a block diagram of a structure of the interaction device 400. Referring to fig. 7, the interaction device 400 includes:

an image acquisition unit 401, configured to acquire a live-action image of a current scene;

an image recognition unit 402, configured to input the live-action image into a target recognition model, and obtain a matching probability between the live-action image predicted by the target recognition model and a target identifier in a target scene;

an interaction unit 403, configured to execute a preset action if the matching probability is greater than a preset threshold;

In a possible implementation manner, the interaction unit 403 is configured to perform a preset action if the matching probability is greater than a preset threshold, including:

The interaction unit 403 is configured to obtain preset virtual information if the matching probability is greater than a preset threshold, and load the preset virtual information with the live-action image as a background.

In a possible implementation, the interaction unit 403 is further configured to: and if the matching probability is not greater than a preset threshold value, triggering the image acquisition unit to acquire the live-action image again.

Referring to fig. 8, fig. 8 is a block diagram of an electronic device according to one or more embodiments of the present disclosure. As shown in fig. 8, the electronic device 500 may include a processor 501 and a memory 502; memory 502 may be coupled to processor 501. Notably, this fig. 8 is exemplary; other types of structures may also be used in addition to or in place of the structures to implement telecommunications functions or other functions.

In a possible implementation, the functionality of the image recognition model training apparatus 300 may be integrated into the processor 501. Wherein the processor 501 may be configured to:

In another possible implementation, the image recognition model training apparatus 300 may be configured separately from the processor 501, for example, the image recognition model training apparatus 300 may be configured as a chip connected to the processor 501, and the training process of the image recognition model is implemented under the control of the processor 501.

In a possible implementation, the functionality of the image recognition based interaction means 400 may be integrated into the processor 501. Wherein the processor 501 may be configured to:

acquiring a live-action image of a current scene;

inputting the live-action image into a target recognition model;

the target recognition model is obtained through training by the image recognition model training method or the image recognition model training device according to any embodiment.

In another possible implementation, the image recognition based interaction device 400 may be configured separately from the processor 501, for example, the image recognition based interaction device 400 may be configured as a chip connected to the processor 501, and the image recognition based interaction method is implemented under the control of the processor 501.

Furthermore, in some alternative implementations, the electronic device 500 may further include: communication module, input unit, audio processor, display, power etc.. It is noted that the electronic device 500 need not include all of the components shown in fig. 8; in addition, the electronic device 500 may further include components not shown in fig. 8, to which reference is made to the related art.

In some alternative implementations, the processor 501, also sometimes referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, with the processor 501 receiving inputs and controlling the operation of the various components of the electronic device 500.

The memory 502 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The above-described information about the image recognition model training apparatus 300 or the image recognition-based interactive apparatus 400 may be stored, and a program for executing the related information may be stored. And the processor 501 can execute the program stored in the memory 502 to realize information storage or processing, etc.

The input unit may provide input to the processor 501. The input unit is for example a key or a touch input device. The power source may be used to provide power to the electronic device 500. The display can be used for displaying display objects such as images and characters. The display may be, for example, but not limited to, an LCD display.

The memory 502 may be a solid state memory such as Read Only Memory (ROM), random Access Memory (RAM), SIM card, and the like. But also a memory which holds information even when powered down, can be selectively erased and provided with further data, an example of which is sometimes referred to as EPROM or the like. Memory 502 may also be some other type of device. Memory 502 includes a buffer memory (sometimes referred to as a buffer). The memory 502 may include an application/function storage for storing application programs and function programs or a flow chart for executing operations of the electronic device 500 by the processor 501.

Memory 502 may also include a data store for storing data, such as contacts, digital data, pictures, sounds, and/or any other data used by an electronic device. The driver store of memory 502 may include various drivers for the computer device for communication functions and/or for performing other functions of the computer device (e.g., messaging applications, address book applications, etc.).

The communication module is a transmitter/receiver that transmits and receives signals via an antenna. A communication module (transmitter/receiver) is coupled to the processor 501 to provide input signals and receive output signals, as may be the case with conventional mobile communication terminals.

Based on different communication technologies, a plurality of communication modules, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, etc., may be provided in the same computer device. The communication module (transmitter/receiver) is also coupled to the speaker and microphone via the audio processor to provide audio output via the speaker and to receive audio input from the microphone to implement the usual telecommunications functions. The audio processor may include any suitable buffers, decoders, amplifiers and so forth. In addition, an audio processor is coupled to the processor 501 such that sound can be recorded locally through a microphone and sound stored locally can be played through a speaker.

One or more embodiments of the present specification further provide a computer-readable storage medium capable of implementing all the steps of the image recognition model training method or the image recognition-based interaction method in the above embodiments, where the computer-readable storage medium stores a computer program that, when executed by a processor, implements all the steps of the image recognition model training method or the image recognition-based interaction method in the above embodiments. Specific steps may be referred to the previous embodiments, and will not be repeated here.

Although one or more embodiments of the present description provide method operational steps as described in the embodiments or flowcharts, more or fewer operational steps may be included based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented by an actual device or client product, the instructions may be executed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment) as shown in the embodiments or figures.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, apparatus (system) or computer program product. Accordingly, the present specification embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

One or more embodiments of the present specification are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to one or more embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus and system embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the description of method embodiments in part.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The specific meaning of the terms in one or more embodiments of the present specification may be understood by those of ordinary skill in the art in view of the specific circumstances.

It should be noted that, without conflict, one or more embodiments and features of the embodiments may be combined with each other. The one or more embodiments of the present specification are not limited to any single aspect, nor to any single embodiment, nor to any combination and/or permutation of these aspects and/or embodiments. Moreover, each aspect and/or embodiment of one or more embodiments of the present description may be utilized alone or in combination with one or more other aspects and/or embodiments.

Finally, it should be noted that: the above embodiments are merely for illustrating the technical solution of one or more embodiments of the present disclosure, and are not limiting thereof; while one or more embodiments of the present disclosure have been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will appreciate that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding embodiments from the scope of the one or more embodiments of the present disclosure, which are intended to be covered by the claims and the scope of the present disclosure.

The foregoing description of one or more embodiments of the present specification has been presented in conjunction with alternative embodiments, but such embodiments are merely exemplary and serve only as illustrations. On the basis of the above, various substitutions and improvements can be made on one or more embodiments of the present specification, and all of them fall within the protection scope of one or more embodiments of the present specification.

Claims

1. An image recognition model training method, comprising the steps of:

2. The method of claim 1, wherein the multi-view sample set comprises a plurality of sample subsets in one-to-one correspondence with the plurality of different views;

3. The method of claim 2, wherein the training of the initial recognition model by the set of batch data comprises:

4. The method of claim 1, wherein training the initial recognition model with the batch dataset results in a target recognition model, comprising:

5. The method of claim 4, wherein training the teacher model to be trained in the initial recognition model via the batch data set to obtain a trained teacher model, comprises:

Inputting the sample image pair into the teacher model to be trained;

6. The method of claim 5, wherein the performing distillation training on the student model to be trained in the initial recognition model according to the trained teacher model to obtain a trained student model comprises:

7. The method of claim 4, wherein the model structure of the teacher model to be trained comprises: the system comprises a first backbone network module, a first uncertain prediction network module, a first characteristic generation network module and a first characteristic fusion module;

8. An interactive method based on image recognition, comprising:

acquiring a live-action image of a current scene;

inputting the live-action image into a target recognition model;

9. The method of claim 8, wherein the performing a preset action comprises:

10. The method as recited in claim 8, further comprising:

11. An image recognition model training device, comprising:

12. The apparatus of claim 11, wherein the multi-view sample set comprises a plurality of sample subsets in one-to-one correspondence with the plurality of different views;

13. The apparatus of claim 12, wherein the model training unit is configured to train an initial recognition model through the batch dataset, comprising:

14. The apparatus of claim 11, wherein the model training unit is configured to train the initial recognition model through the batch dataset to obtain the target recognition model, and comprises:

15. The apparatus according to claim 14, wherein the model training unit is configured to train a teacher model to be trained in the initial recognition model through the lot data set to obtain a trained teacher model, and the model training unit is configured to:

inputting the sample image pair into the teacher model to be trained;

16. The apparatus according to claim 15, wherein the model training unit is configured to perform distillation training on a student model to be trained in the initial recognition model according to the trained teacher model, to obtain a trained student model, and the model training unit is configured to:

17. The apparatus of claim 14, wherein the model structure of the teacher model to be trained comprises: the system comprises a first backbone network module, a first uncertain prediction network module, a first characteristic generation network module and a first characteristic fusion module;

18. An image recognition-based interaction device, comprising:

19. The apparatus of claim 18, wherein the interaction unit is configured to perform a preset action if the matching probability is greater than a preset threshold, comprising:

20. The apparatus of claim 18, wherein the interactive action unit is further configured to: and if the matching probability is not greater than a preset threshold value, triggering the image acquisition unit to acquire the live-action image again.

21. An electronic device, the electronic device comprising:

a memory for storing a computer program product;

a processor for executing a computer program product stored in the memory, which, when executed, implements the method of any of the preceding claims 1-10.

22. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon computer program instructions which, when executed, implement the method of any of the preceding claims 1-10.