CN116597293A

CN116597293A - Multi-mode scene recognition method, device, computer equipment and storage medium

Info

Publication number: CN116597293A
Application number: CN202310005427.XA
Authority: CN
Inventors: 卢波; 肖塞; 曲晓超; 刘洛麒
Original assignee: Xiamen Meitu Technology Co Ltd
Current assignee: Xiamen Meitu Technology Co Ltd
Priority date: 2023-01-04
Filing date: 2023-01-04
Publication date: 2023-08-15

Abstract

The present application relates to a multi-modal scene recognition method, apparatus, computer device, storage medium and computer program product. The method comprises the following steps: determining a pre-trained first multi-modal scene recognition model; the first multi-modal scene recognition model comprises a first image encoding network, and the second multi-modal scene recognition model comprises a second image encoding network; respectively inputting sample images into a first image coding network and a second image coding network for coding processing, respectively inputting corresponding coding processing results into a first pre-trained auxiliary branch and a trained second auxiliary branch for image recognition, and obtaining a first image recognition result and a second image recognition result; based on the difference between the first image recognition result and the second image recognition result, adjusting the first image coding network to obtain a trained first multi-mode scene recognition model; scene recognition is performed based on the trained first multimodal scene recognition model. By adopting the method, the accuracy of scene recognition can be improved.

Description

Multi-mode scene recognition method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of deep learning technology, and in particular, to a multi-mode scene recognition method, apparatus, computer device, storage medium, and computer program product.

Background

With the development of computer and internet technologies, scene recognition is increasingly widely used, for example, scene recognition can be performed on a captured image or video, and the scene recognition can be used to add text matched with a scene to the image or video.

In the traditional technology, scene recognition can be performed by using a multi-mode representation model, wherein the multi-mode representation model is a technology for extracting information from multiple fields of data such as images, texts, videos, voices and the like by a machine, converting and fusing the information, and further improving the performance of the model. Since the structure of the multi-modal representation model is generally complex, the complexity of the model can be reduced by knowledge distillation.

However, the distillation scheme of the multi-modal representation model mainly uses a teacher network to train student networks, mutual distillation among a plurality of student networks, self distillation of a single student network and the like, and the accuracy of the model obtained by adopting the traditional distillation scheme is to be improved.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a multi-modal scene recognition method, apparatus, computer device, computer-readable storage medium, and computer program product that can improve recognition accuracy.

In a first aspect, the present application provides a multi-modal scene recognition method. The method comprises the following steps: determining a pre-trained first multi-modal scene recognition model; the pre-trained first multi-mode scene recognition model comprises a first image coding network and a first text coding network, the pre-trained first multi-mode scene recognition model is obtained by training based on a trained second multi-mode scene recognition model, the second multi-mode scene recognition model comprises a second image coding network and a second text coding network, and the network complexity of the first image coding network is smaller than that of the second image coding network; inputting a sample image into the first image coding network for coding processing, and inputting a coding processing result into a first auxiliary branch for pre-training for image recognition to obtain a first image recognition result; inputting a sample image into the second image coding network for coding processing, and inputting a coding processing result into a trained second auxiliary branch for image recognition to obtain a second image recognition result; based on the difference between the first image recognition result and the second image recognition result, adjusting the first image coding network to obtain a trained first multi-mode scene recognition model; and performing scene recognition based on the trained first multi-mode scene recognition model.

In a second aspect, the application further provides a multi-mode scene recognition device. The device comprises: the model determining module is used for determining a pre-trained first multi-mode scene recognition model; the pre-trained first multi-mode scene recognition model comprises a first image coding network and a first text coding network, the pre-trained first multi-mode scene recognition model is obtained by training based on a trained second multi-mode scene recognition model, the second multi-mode scene recognition model comprises a second image coding network and a second text coding network, and the network complexity of the first image coding network is smaller than that of the second image coding network; the first image recognition module is used for inputting a sample image into the first image coding network for coding processing, and inputting a result of the coding processing into a first auxiliary branch for pre-training for image recognition to obtain a first image recognition result; the second image recognition module is used for inputting a sample image into the second image coding network for coding processing, inputting a result of the coding processing into a trained second auxiliary branch for image recognition, and obtaining a second image recognition result; the model adjustment module is used for adjusting the first image coding network based on the difference between the first image recognition result and the second image recognition result to obtain a trained first multi-mode scene recognition model; and the scene recognition module is used for recognizing the scene based on the trained first multi-mode scene recognition model.

In some embodiments, the model adjustment module is further to: obtaining a first loss value based on the difference between the first image recognition result and the second image recognition result; inputting a sample image into a first image coding network in the pre-trained first multi-mode scene recognition model to perform coding processing to obtain a first coding feature; inputting the sample image into a second image coding network in the trained second multi-mode scene recognition model for coding processing to obtain a second coding characteristic; obtaining a second loss value based on a feature difference value between the first coding feature and the second coding feature; and adjusting a first image coding network in the pre-trained first multi-mode scene recognition model based on the first loss value and the second loss value to obtain the trained first multi-mode scene recognition model.

In some embodiments, the multi-modal scene recognition device further includes a first training module to: inputting the sample image into a second image coding network in the trained second multi-mode scene recognition model for coding processing, and inputting the result of the coding processing into a second auxiliary branch to be trained for image recognition to obtain a third image recognition result of the sample image; and adjusting the second auxiliary branch to be trained based on the difference between the third image recognition result and the standard image recognition result of the sample image to obtain the trained second auxiliary branch.

In some embodiments, the first training module is further to: based on the difference between the third image recognition result and the standard image recognition result, adjusting the second auxiliary branch to be trained to obtain a second auxiliary branch of preliminary training; inputting the sample image into a second image coding network in the trained second multi-mode scene recognition model for coding processing, and inputting the result of the coding processing into a second auxiliary branch to be trained for image recognition to obtain a fourth image recognition result of the sample image; and adjusting the first auxiliary branch of the preliminary training based on the difference between the fourth image recognition result and the standard image recognition result of the sample image to obtain the trained first auxiliary branch.

In some embodiments, the multi-modal scene recognition device further includes a second training model, the second training module to: inputting a sample image and a sample text into the trained second multi-mode scene recognition model to perform similarity calculation, and generating a first similarity; the first similarity characterizes the similarity between the sample image and the sample text; inputting the sample image and the sample text into a first multi-mode scene recognition model to be trained to perform similarity calculation, and generating second similarity; and adjusting parameters of the first multi-mode scene recognition model to be trained based on the difference between the first similarity and the second similarity to obtain a pre-trained first multi-mode scene recognition model.

In some embodiments, the scene recognition module is further to: inputting the target scene image into a first image coding network of the trained first multi-mode scene recognition model to obtain target image characteristics; inputting the candidate scene text into a first text coding network of the trained first multi-mode scene recognition model to obtain candidate text characteristics; and determining the candidate scene text as the target scene text matched with the target scene image under the condition that the similarity between the target image feature and the candidate text feature is larger than a similarity threshold value.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps in the multi-mode scene recognition method when executing the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the above-described multimodal scene recognition method.

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the multimodal scene recognition method described above.

According to the multi-mode scene recognition method, the device, the computer equipment, the storage medium and the computer program product, the first multi-mode scene recognition model comprises the first image coding network, the second multi-mode scene recognition model comprises the second image coding network, the network complexity of the first image coding network is smaller than that of the second image coding network, the trained second multi-mode scene recognition model can calculate the similarity between data of different modes through training of a large amount of data to realize scene recognition and has higher recognition precision, the first image coding network is adjusted based on the difference between the first image recognition result and the second image recognition result, the image expression capability of the first image coding network is enhanced, the trained first similarity model has higher recognition precision, namely the recognition accuracy of the first similarity recognition model is improved, and the scene recognition accuracy is improved.

Drawings

FIG. 1 is an application environment diagram of a multi-mode scene recognition method according to an embodiment of the present application;

fig. 2 is a flow chart of a multi-mode scene recognition method according to an embodiment of the present application;

FIG. 3A is a schematic structural diagram of a first multi-modal scene recognition model according to an embodiment of the present application;

FIG. 3B is a schematic structural diagram of a second multi-modal scene recognition model according to an embodiment of the present application;

FIG. 3C is a schematic diagram of an auxiliary model according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of a model distillation method according to an embodiment of the present application;

fig. 5 is a block diagram of a multi-mode scene recognition device according to an embodiment of the present application;

FIG. 6 is an internal block diagram of a computer device in accordance with an embodiment of the present application;

fig. 7 is an internal structural diagram of another computer device in an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The multi-mode scene recognition method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server.

Specifically, server 104 determines a pre-trained first multimodal scene recognition model; the pre-trained first multi-mode scene recognition model comprises a first image coding network and a first text coding network, the pre-trained first multi-mode scene recognition model is obtained by training based on a trained second multi-mode scene recognition model, the second multi-mode scene recognition model comprises a second image coding network and a second text coding network, and the network complexity of the first image coding network is smaller than that of the second image coding network; then inputting the sample image into a first image coding network for coding processing, and inputting the result of the coding processing into a first auxiliary branch for image recognition to obtain a first image recognition result; inputting the sample image into a second image coding network for coding processing, and inputting the result of the coding processing into a trained second auxiliary branch for image recognition to obtain a second image recognition result; based on the difference between the first image recognition result and the second image recognition result, adjusting the first image coding network to obtain a trained first multi-mode scene recognition model; the server 104 receives the scene image sent by the terminal 102, performs scene recognition based on the trained first multi-mode scene recognition model, obtains a scene recognition result, and the server 104 may send the scene recognition result to the terminal 102. The terminal 102 may store the scene recognition result transmitted from the server 104.

The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.

In some embodiments, as shown in fig. 2, a multi-mode scene recognition method is provided, and the method is applied to the server 104 in fig. 1 for illustration, and includes the following steps:

step 202, determining a pre-trained first multi-modal scene recognition model; the pre-trained first multi-mode scene recognition model comprises a first image coding network and a first text coding network, the pre-trained first multi-mode scene recognition model is obtained by training based on a trained second multi-mode scene recognition model, the second multi-mode scene recognition model comprises a second image coding network and a second text coding network, and the network complexity of the first image coding network is smaller than that of the second image coding network.

The first multi-modal scene recognition model and the second multi-modal scene recognition model are multi-modal representation models, the multi-modal representation models can process multi-modal data, the multi-modal data comprises at least two modes, the modes of the data comprise at least one of images, texts, videos or voices, for example, the data comprising two different modes of images and texts are multi-modal data, and the image data is single-modal data. The first multi-mode scene recognition model comprises a first image coding network and a first text coding network, as shown in fig. 3A, a structural schematic diagram of the first multi-mode scene recognition model is displayed, the first image coding network is used for coding images to obtain image coding features, the first text coding network is used for coding texts to obtain text coding features, then similarity between the image coding features and the text coding features is calculated, and similarity between the images and the texts is output.

The first multi-mode scene recognition model is trained based on the second multi-mode scene recognition model, has the scene recognition function, and is low in recognition accuracy due to the fact that the number of samples used for pre-training is small. The second multi-mode scene recognition model comprises a second image coding network and a second text coding network, the input and the output of the second multi-mode scene recognition model are the same as those of the first multi-mode scene recognition model, namely the functions of the models are the same, but the model structures and the parameter numbers are different, and the network complexity of the first image coding network is smaller than that of the second image coding network, so that the network complexity of the first multi-mode scene recognition model is smaller than that of the second similarity generation network. The first text encoding network is determined based on the second text encoding network, and the first text encoding network and the second text encoding network may be the same or different.

Specifically, the server trains a first similarity model to be trained by using the sample image and the sample text to obtain a pre-trained first multi-mode scene recognition model. The sample image and sample text are for model training and may be images and descriptive text of different scenes. The sample images and sample text employed to train the different networks may be the same or different.

In some embodiments, the server may input the sample image and the sample text into a first multi-modal scene recognition model to be trained to perform similarity calculation, and generate a first similarity; and inputting the sample image and the sample text into a trained second multi-mode scene recognition model to perform similarity calculation, generating second similarity, calculating a difference value between the first similarity and the second similarity, and adjusting parameters of the first multi-mode scene recognition model by using the difference value to obtain a pre-trained first multi-mode scene recognition model.

In some embodiments, the trained second multi-mode scene recognition model is obtained by training the second multi-mode scene recognition model to be trained by using the sample image and the sample text in the oversized data set, and the trained second multi-mode scene recognition model has higher recognition accuracy due to the fact that the data amount of the data set used for training is very large, the training time is long, and the network complexity is high. However, because the network complexity of the second multi-mode scene recognition model is higher and the second multi-mode scene recognition model is difficult to deploy to the mobile terminal, the trained second multi-mode scene recognition model can be used as a teacher model, the pre-trained first multi-mode scene recognition model is used as a student model, and knowledge distillation is performed on the student model by using the teacher model, so that the recognition accuracy of the first multi-mode scene recognition model is improved.

And 204, inputting the sample image into a first image coding network for coding processing, and inputting the result of the coding processing into a first auxiliary branch for pre-training for image recognition to obtain a first image recognition result.

The first auxiliary branch is used for identifying scene features in an image, for example, the image is a beach image comprising sky, sea waves, beach and shells, and the image identification result is "sky", "sea waves", "beach" and "shells". The first auxiliary branch may assist in the distillation training of the model. The first image recognition result is obtained by performing image recognition on the sample image by a pre-trained first auxiliary branch, and the pre-trained first auxiliary branch is obtained by training based on a trained second auxiliary branch.

Specifically, the server acquires a sample image, as shown in fig. 3A, inputs the sample image into the first image coding network to perform coding processing, and then inputs the result of the coding processing into the first auxiliary branch, that is, the auxiliary branch, to perform image recognition, thereby obtaining a first image recognition result.

In some embodiments, the output layer corresponding to the first image coding network is a full connection layer, after the first image coding network, the first auxiliary branch is accessed in a full connection operator mode, and then the branch for processing the image in the first multi-mode scene recognition model includes two output branches, the output layer corresponding to the first image coding network outputs the image coding feature, and the first auxiliary branch outputs the image recognition result.

In some embodiments, the server may utilize the trained second auxiliary branch to distill the training of the second auxiliary branch to be trained to obtain a pre-trained first auxiliary branch, the pre-trained first auxiliary branch having the function of identifying scene features in the image. The method comprises the steps that a server inputs a sample image into a first image coding network in a first multi-mode scene recognition model to be trained to carry out coding processing, inputs a result of the coding processing into a first auxiliary branch to be trained to carry out image recognition to obtain a sample recognition result, inputs the sample image into a second image coding network to be trained to carry out coding processing, inputs the result of the coding processing into a second auxiliary branch to be trained to carry out image recognition to obtain a target recognition result, and then adjusts parameters of the first auxiliary branch to be trained until the network converges to obtain the first auxiliary branch to be trained based on the difference between the sample recognition result and the target recognition result.

And 206, inputting the sample image into a second image coding network for coding, and inputting the result of coding into a trained second auxiliary branch for image recognition to obtain a second image recognition result.

Wherein the second auxiliary branch is used to identify scene features in the image.

Specifically, the server inputs the sample image into a second image coding network in the trained second multi-mode scene recognition model to carry out coding processing, and inputs the result of the coding processing into a trained second auxiliary branch to carry out image recognition, so as to obtain a second image recognition result.

In some embodiments, the model of the second auxiliary branch in combination with the second image encoding network may be regarded as an auxiliary model corresponding to the second multi-modal scene recognition model. The auxiliary model is derived based on a second multi-modal scene recognition model. As shown in fig. 3B, a schematic structural diagram of the second multi-modal scene recognition model is shown, and fig. 3C shows a schematic structural diagram of the auxiliary model. The second auxiliary branch may be accessed after the second image coding network in the second multi-mode scene recognition model in a full-connection operator manner, and the output layer corresponding to the second image coding network in the second multi-mode scene recognition model and the part behind the output layer are discarded, and the obtained auxiliary model is shown in fig. 3C.

Step 208, adjusting the first image coding network based on the difference between the first image recognition result and the second image recognition result to obtain a trained first multi-mode scene recognition model.

Specifically, the server may adjust parameters of the first auxiliary branch and parameters of the first image encoding network in the first multi-modal scene recognition model based on a difference between the first image recognition result and the second image recognition result, to obtain a trained first multi-modal scene recognition model.

In some embodiments, the pre-trained first multi-modal scene recognition model is a pre-trained student model, the pre-trained first auxiliary branch is an auxiliary branch of the student model, the first image coding network is an original branch of the student model, the trained second multi-modal scene recognition model is a teacher model, and a model formed by the trained second auxiliary branch and the second image coding network is called an auxiliary model, and the auxiliary model and the teacher model are two independent models. The server inputs the sample images into a pre-trained student model, a trained teacher model and a trained auxiliary model respectively, a first image coding network and a corresponding output layer in the student model output first coding features, auxiliary branches output first image recognition results, a second image coding network and a corresponding output layer in the teacher network output second coding features, and the auxiliary model outputs second image recognition results. The server obtains a first loss value based on the difference between the first image recognition result and the second image recognition result, and obtains a second loss value based on the feature difference value between the first coding feature and the second coding feature. And then the server performs weighted calculation on the first loss value and the second loss value to obtain a weighted loss value, and adjusts the first image coding network and the auxiliary branch in the student network by using the weighted loss value. Because the teacher model and the auxiliary model are utilized to iteratively update the parameters of two output branches of the student model, the distillation process is called mixed distillation training of the student model.

In some embodiments, as shown in fig. 4, the server uses the trained second multi-mode scene recognition model as a teacher model, accesses a second auxiliary branch after a second image coding network in the teacher model to construct an auxiliary model, and trains the auxiliary model by using a sample image to obtain a trained auxiliary model; then taking the pre-trained first multi-mode scene recognition model as a student model, accessing a first auxiliary branch after a first image coding network in the student model, constructing an auxiliary branch, and performing distillation training on the auxiliary branch of the student model by using the trained auxiliary model to obtain a pre-trained auxiliary branch; and then performing mixed distillation training on the original branches and the auxiliary branches of the student model by using the teacher model and the auxiliary model to obtain a trained student model and a trained auxiliary branch, and removing the auxiliary branches of the student model to obtain a trained first multi-mode scene recognition model.

At step 210, scene recognition is performed based on the trained first multimodal scene recognition model.

Wherein the scene is a landmark location including beach, flower store, station, park, school, etc., and the scene recognition may be determining a corresponding scene text based on scene features in the scene image. The scene image is an image corresponding to a scene, and for example, when the scene is a station, the scene image may be a photograph taken of the station. The scene text may be a scene name or a scene description text.

Specifically, the server determines a target scene image and a candidate scene text, inputs the target scene image and the candidate scene text into a trained first multi-mode scene recognition model to obtain the similarity between the target scene image and the candidate scene text, and then determines the target scene text based on the similarity to realize scene recognition. For example, the similarity between the target scene image and the plurality of candidate scene texts may be calculated, respectively, and the candidate scene text corresponding to the maximum similarity may be used as the target scene text; the similarity can also be compared with a similarity threshold, and the candidate scene text is used as the target scene text under the condition that the similarity is larger than the similarity threshold. The target scene image is a scene image to be identified, and the candidate scene text can be determined from a scene text library, wherein the scene text library is preset and comprises a plurality of scene texts. The target scene text is scene text that matches the target scene image.

In the multi-mode scene recognition method, the first multi-mode scene recognition model comprises the first image coding network, the second multi-mode scene recognition model comprises the second image coding network, the network complexity of the first image coding network is smaller than that of the second image coding network, the trained second multi-mode scene recognition model is trained by a large amount of data, the similarity between data of different modes can be calculated to realize scene recognition and has higher recognition precision, the first image coding network is adjusted based on the difference between the first image recognition result and the second image recognition result, the image expression capability of the first image coding network is enhanced, the trained first similarity model has higher recognition precision, namely the recognition accuracy of the first similarity recognition model is improved, and the scene recognition accuracy is improved.

In some embodiments, step 208 further comprises: obtaining a first loss value based on the difference between the first image recognition result and the second image recognition result; inputting a sample image into a first image coding network in a pre-trained first multi-mode scene recognition model to perform coding processing to obtain a first coding feature; inputting the sample image into a second image coding network in a trained second multi-mode scene recognition model for coding processing to obtain a second coding characteristic; obtaining a second loss value based on the feature difference value between the first coding feature and the second coding feature; and adjusting a first image coding network in the pre-trained first multi-mode scene recognition model based on the first loss value and the second loss value to obtain a trained first multi-mode scene recognition model.

The first loss value is obtained based on the difference between the first image recognition result and the second image recognition result, the first coding characteristic is obtained by coding the sample image through the first image coding network, and the second coding characteristic is obtained by coding the sample image through the second image coding network.

Specifically, the server obtains a first loss value according to the difference between the first image recognition result and the second image recognition result, for example, the first loss value may be obtained by calculating a BCE loss (Binary Cross Entropy Loss ) loss function; and based on the feature difference value between the first encoding feature and the second encoding feature, obtaining a second loss value, e.g., the second loss value may be calculated using an MSE loss (Mean Square Error ) function; and then carrying out weighted calculation on the first loss value and the second loss value to obtain a weighted loss value, adjusting the first image coding network in the pre-trained first multi-mode scene recognition model by using the weighted loss value until the model converges, removing the output branch of the first auxiliary branch which is accessed by the first multi-mode scene recognition model, and reserving the structure and parameters of the first image coding network after training to obtain the trained first multi-mode scene recognition model.

In this embodiment, a first loss value is obtained based on a first image recognition result and a second image recognition result, a second loss value is obtained based on a first coding feature and a second coding feature, and a first image coding network in a pre-trained first multi-mode scene recognition model is adjusted based on the first loss value and the second loss value, so that the expression capability of the first image coding network on images is improved, the recognition precision of the trained first multi-mode scene recognition model is higher, and the scene recognition accuracy is improved.

In some embodiments, the step of determining the trained second auxiliary branch comprises: inputting the sample image into a second image coding network in a trained second multi-mode scene recognition model for coding processing, and inputting the result of the coding processing into a second auxiliary branch to be trained for image recognition to obtain a third image recognition result of the sample image; and adjusting the second auxiliary branch to be trained based on the difference between the third image recognition result and the standard image recognition result of the sample image to obtain a trained second auxiliary branch.

The third image recognition result is obtained by performing image recognition on the sample image by the second auxiliary branch to be trained. The standard recognition result is obtained by manually labeling the sample image, which is also called a label of the sample image, and the standard recognition result of one sample image can be one or a plurality of sample images, for example, the sample image comprises sky, sea wave, beach and shells, and the standard recognition result of the sample image is "sky", "sea wave", "beach" and "shells", i.e. the sample image can also be called multi-label single-mode data.

Specifically, the server trains the second auxiliary branch to be trained, which is accessed by the second multi-mode scene recognition model, namely trains the auxiliary model. The server can input the sample image into a second image coding network in the trained second multi-mode scene recognition model to carry out coding processing, and input the result of the coding processing into a second auxiliary branch to be trained to carry out image recognition to obtain a third image recognition result of the sample image; and adjusting the second auxiliary branch to be trained based on the difference between the third image recognition result and the standard image recognition result of the sample image to obtain a trained second auxiliary branch.

In this embodiment, based on the difference between the third image recognition result and the standard image recognition result of the sample image, the second auxiliary branch to be trained is adjusted to obtain the trained second auxiliary branch, so that the first auxiliary branch can be distilled and trained by using the trained second auxiliary branch, and the recognition accuracy of the first similarity generating network is improved.

In some embodiments, adjusting the second auxiliary branch to be trained based on a difference between the third image recognition result and the standard image recognition result of the sample image, the obtaining the trained second auxiliary branch comprises: based on the difference between the third image recognition result and the standard image recognition result, adjusting a second auxiliary branch to be trained to obtain a second auxiliary branch of preliminary training; inputting the sample image into a second image coding network in a trained second multi-mode scene recognition model for coding processing, and inputting the result of the coding processing into a second auxiliary branch to be trained for image recognition to obtain a fourth image recognition result of the sample image; and adjusting the first auxiliary branch of the preliminary training based on the difference between the fourth image recognition result and the standard image recognition result of the sample image to obtain a trained first auxiliary branch.

The fourth image recognition result is obtained by performing image recognition on the sample image by the second auxiliary branch of the preliminary training.

Specifically, the server firstly fixes parameters of a second image coding network in a second multi-mode scene recognition model, adjusts a second auxiliary branch to be trained based on the difference between a third image recognition result and a standard image recognition result to obtain a first auxiliary branch for preliminary training, and the parameters of the second image coding network in the second multi-mode scene recognition model are unchanged in the training process; and then releasing parameters of a second image coding network in the second multi-mode scene recognition model, participating in training at a preset learning rate, namely adjusting a first auxiliary branch of preliminary training and the second image coding network based on the difference between the fourth image recognition result and the standard image recognition result of the sample image, and taking the second auxiliary branch with highest precision in the training process as a trained second auxiliary branch.

In this embodiment, based on the difference between the third image recognition result and the standard image recognition result, the second auxiliary branch to be trained is adjusted to obtain the second auxiliary branch for preliminary training, and based on the difference between the fourth image recognition result and the standard image recognition result of the sample image, the second auxiliary branch for preliminary training is adjusted to obtain the trained second auxiliary branch, so that the recognition accuracy of the trained second auxiliary branch is further improved, and the distillation training effect is better.

In some embodiments, determining the pre-trained first multi-modal scene recognition model includes: inputting a sample image and a sample text into a first multi-mode scene recognition model to be trained to perform similarity calculation, and generating first similarity; inputting the sample image and the sample text into a trained second multi-mode scene recognition model to perform similarity calculation, and generating second similarity; a second similarity characterizing a similarity between the sample image and the sample text; and adjusting parameters of the first multi-mode scene recognition model to be trained based on the difference between the first similarity and the second similarity to obtain a pre-trained first multi-mode scene recognition model.

The first similarity is generated by a first multi-mode scene recognition model to be trained, and the second similarity is generated by a second multi-mode scene recognition model which is trained. The sample images and sample text employed to train the different networks may be the same or different.

Specifically, the server adjusts parameters of the first image coding network and parameters of the first text coding network in the first multi-mode scene recognition model to be trained based on the difference between the first similarity and the second similarity until the model converges, and a pre-trained first multi-mode scene recognition model is obtained.

In this embodiment, the trained second multi-modal scene recognition model is used to train the first multi-modal scene recognition model to be trained, so as to obtain the pre-trained first multi-modal scene recognition model, so that the pre-trained first multi-modal scene recognition model has the capability of calculating the similarity between the image and the text.

In some embodiments, step 210 further comprises: inputting the target scene image into a first image coding network of a trained first multi-mode scene recognition model to obtain target image characteristics; inputting the candidate scene text into a first text coding network of a trained first multi-mode scene recognition model to obtain candidate text characteristics; and determining the candidate scene text as the target scene text matched with the target scene image in the condition that the similarity between the target image feature and the candidate text feature is larger than a similarity threshold.

Wherein the target scene image is a scene image to be identified and the target scene text is a scene text matching the target scene image. The similarity threshold is preset, and for example, the similarity threshold may be set to 0.9.

Specifically, the server may obtain a target scene image to be identified from the terminal, determine a candidate scene text from a scene text library, input the target scene image into a first image coding network of a trained first multi-mode scene identification model to perform coding processing on the target scene image, and obtain a target image feature through an output layer corresponding to the first image coding network; inputting the candidate scene text into a first text coding network of a trained first multi-mode scene recognition model to perform coding processing to obtain candidate text features, then calculating the similarity between the target image features and the candidate text features, wherein the similarity between the target image features and the candidate text features characterizes the matching degree between the target scene image and the candidate scene text, and under the condition that the similarity is larger than a similarity threshold, determining the candidate scene text as the target scene text, wherein the target scene text is a scene name or a scene description text corresponding to a scene in the scene image; in the event that the similarity is less than the similarity threshold, returning to the step of determining candidate scene text from the scene text library until the similarity is greater than the similarity threshold.

In the embodiment, the scene recognition is realized by obtaining the similarity between the target scene image and the candidate scene text based on the characteristics of the target image and the characteristics of the candidate text, and determining the candidate scene text as the target scene text under the condition that the similarity is larger than the similarity threshold value.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a multi-mode scene recognition device for realizing the multi-mode scene recognition method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the multi-mode scene recognition device or devices provided below may be referred to the limitation of the multi-mode scene recognition method hereinabove, and will not be repeated herein.

In some embodiments, as shown in fig. 5, there is provided a multi-modal scene recognition apparatus, including: a model determination module 502, a first image recognition module 504, a second image recognition module 506, a model adjustment module 508, and a scene recognition module 510, wherein:

a model determination module 502 for determining a pre-trained first multi-modal scene recognition model; the pre-trained first multi-mode scene recognition model comprises a first image coding network and a first text coding network, the pre-trained first multi-mode scene recognition model is obtained by training based on a trained second multi-mode scene recognition model, the second multi-mode scene recognition model comprises a second image coding network and a second text coding network, and the network complexity of the first image coding network is smaller than that of the second image coding network.

The first image recognition module 504 is configured to input the sample image into the first image encoding network for encoding, and input a result of the encoding into the first auxiliary branch for image recognition, so as to obtain a first image recognition result.

The second image recognition module 506 is configured to input the sample image into the second image encoding network for encoding, and input the result of encoding into the trained second auxiliary branch for image recognition, so as to obtain a second image recognition result.

The model adjustment module 508 is configured to adjust the first image encoding network based on a difference between the first image recognition result and the second image recognition result, so as to obtain a trained first multi-mode scene recognition model.

The scene recognition module 510 is configured to perform scene recognition based on the trained first multi-modal scene recognition model.

In some embodiments, the model adjustment module 508 is further to: obtaining a first loss value based on the difference between the first image recognition result and the second image recognition result; inputting a sample image into a first image coding network in a pre-trained first multi-mode scene recognition model to perform coding processing to obtain a first coding feature; inputting the sample image into a second image coding network in a trained second multi-mode scene recognition model for coding processing to obtain a second coding characteristic; obtaining a second loss value based on the feature difference value between the first coding feature and the second coding feature; and adjusting a first image coding network in the pre-trained first multi-mode scene recognition model based on the first loss value and the second loss value to obtain a trained first multi-mode scene recognition model.

In some embodiments, the multi-modal scene recognition device further includes a first training module, the first training module configured to: inputting the sample image into a second image coding network in a trained second multi-mode scene recognition model for coding processing, and inputting the result of the coding processing into a second auxiliary branch to be trained for image recognition to obtain a third image recognition result of the sample image; and adjusting the second auxiliary branch to be trained based on the difference between the third image recognition result and the standard image recognition result of the sample image to obtain a trained second auxiliary branch.

In some embodiments, the first training module is further to: based on the difference between the third image recognition result and the standard image recognition result, adjusting a second auxiliary branch to be trained to obtain a second auxiliary branch of preliminary training; inputting the sample image into a second image coding network in a trained second multi-mode scene recognition model for coding processing, and inputting the result of the coding processing into a second auxiliary branch to be trained for image recognition to obtain a fourth image recognition result of the sample image; and adjusting the first auxiliary branch of the preliminary training based on the difference between the fourth image recognition result and the standard image recognition result of the sample image to obtain a trained first auxiliary branch.

In some embodiments, the multi-modal scene recognition device further includes a second training model, the second training module to: inputting the sample image and the sample text into a trained second multi-mode scene recognition model to perform similarity calculation, and generating a first similarity; the first similarity represents the similarity between the sample image and the sample text; inputting the sample image and the sample text into a first multi-mode scene recognition model to be trained to perform similarity calculation, and generating second similarity; and adjusting parameters of the first multi-mode scene recognition model to be trained based on the difference between the first similarity and the second similarity to obtain a pre-trained first multi-mode scene recognition model.

In some embodiments, scene recognition module 510 is further to: inputting the target scene image into a first image coding network of a trained first multi-mode scene recognition model to obtain target image characteristics; inputting the candidate scene text into a first text coding network of a trained first multi-mode scene recognition model to obtain candidate text characteristics; and determining the candidate scene text as the target scene text matched with the target scene image in the condition that the similarity between the target image feature and the candidate text feature is larger than a similarity threshold.

The modules in the multi-mode scene recognition device can be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In some embodiments, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing related data related to the multi-mode scene recognition method. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a multi-modal scene recognition method.

In some embodiments, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program, when executed by a processor, implements a multi-modal scene recognition method. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by persons skilled in the art that the structures shown in fig. 6 and 7 are block diagrams of only portions of structures associated with the present inventive arrangements and are not limiting of the computer device to which the present inventive arrangements are applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In some embodiments, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps in the multi-modal scene recognition method described above when the computer program is executed.

In some embodiments, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the above-described multimodal scene recognition method.

In some embodiments, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the above-described multi-modal scene recognition method.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method for multi-modal scene recognition, the method comprising:

determining a pre-trained first multi-modal scene recognition model; the pre-trained first multi-mode scene recognition model comprises a first image coding network and a first text coding network, the pre-trained first multi-mode scene recognition model is obtained by training based on a trained second multi-mode scene recognition model, the second multi-mode scene recognition model comprises a second image coding network and a second text coding network, and the network complexity of the first image coding network is smaller than that of the second image coding network; the first text encoding network is determined based on the second text encoding network;

Inputting a sample image into the first image coding network for coding processing, and inputting a coding processing result into a first auxiliary branch for pre-training for image recognition to obtain a first image recognition result;

inputting a sample image into the second image coding network for coding processing, and inputting a coding processing result into a trained second auxiliary branch for image recognition to obtain a second image recognition result;

based on the difference between the first image recognition result and the second image recognition result, adjusting the first image coding network to obtain a trained first multi-mode scene recognition model;

and performing scene recognition based on the trained first multi-mode scene recognition model.

2. The method of claim 1, wherein adjusting the first image encoding network based on the difference between the first image recognition result and the second image recognition result to obtain a trained first multi-modal scene recognition model comprises:

obtaining a first loss value based on the difference between the first image recognition result and the second image recognition result;

inputting a sample image into a first image coding network in the pre-trained first multi-mode scene recognition model to perform coding processing to obtain a first coding feature;

Inputting the sample image into a second image coding network in the trained second multi-mode scene recognition model for coding processing to obtain a second coding characteristic;

obtaining a second loss value based on a feature difference value between the first coding feature and the second coding feature;

and adjusting a first image coding network in the pre-trained first multi-mode scene recognition model based on the first loss value and the second loss value to obtain the trained first multi-mode scene recognition model.

3. The method of claim 1, wherein the step of determining the trained second auxiliary branch comprises:

inputting the sample image into a second image coding network in the trained second multi-mode scene recognition model for coding processing, and inputting the result of the coding processing into a second auxiliary branch to be trained for image recognition to obtain a third image recognition result of the sample image;

and adjusting the second auxiliary branch to be trained based on the difference between the third image recognition result and the standard image recognition result of the sample image to obtain the trained second auxiliary branch.

4. A method according to claim 3, wherein said adjusting the second auxiliary branch to be trained based on the difference between the third image recognition result and the standard image recognition result of the sample image, resulting in the trained second auxiliary branch comprises:

based on the difference between the third image recognition result and the standard image recognition result, adjusting the second auxiliary branch to be trained to obtain a second auxiliary branch of preliminary training;

inputting the sample image into a second image coding network in the trained second multi-mode scene recognition model for coding processing, and inputting the result of the coding processing into a second auxiliary branch to be trained for image recognition to obtain a fourth image recognition result of the sample image;

and adjusting the first auxiliary branch of the preliminary training based on the difference between the fourth image recognition result and the standard image recognition result of the sample image to obtain the trained first auxiliary branch.

5. The method of claim 1, wherein the determining a pre-trained first multi-modal scene recognition model comprises:

inputting a sample image and a sample text into a first multi-mode scene recognition model to be trained to perform similarity calculation, and generating first similarity;

Inputting the sample image and the sample text into a trained second multi-mode scene recognition model to perform similarity calculation, and generating second similarity; the second similarity characterizes the similarity between the sample image and the sample text;

and adjusting parameters of the first multi-mode scene recognition model to be trained based on the difference between the first similarity and the second similarity to obtain a pre-trained first multi-mode scene recognition model.

6. The method of claim 1, wherein the scene recognition based on the trained first multi-modal scene recognition model comprises:

inputting the target scene image into a first image coding network of the trained first multi-mode scene recognition model to obtain target image characteristics;

inputting the candidate scene text into a first text coding network of the trained first multi-mode scene recognition model to obtain candidate text characteristics;

and determining the candidate scene text as the target scene text matched with the target scene image under the condition that the similarity between the target image feature and the candidate text feature is larger than a similarity threshold value.

7. A multi-modal scene recognition apparatus, the apparatus comprising:

the model determining module is used for determining a pre-trained first multi-mode scene recognition model; the pre-trained first multi-mode scene recognition model comprises a first image coding network and a first text coding network, the pre-trained first multi-mode scene recognition model is obtained by training based on a trained second multi-mode scene recognition model, the second multi-mode scene recognition model comprises a second image coding network and a second text coding network, and the network complexity of the first image coding network is smaller than that of the second image coding network; the first text encoding network is determined based on the second text encoding network;

the first image recognition module is used for inputting a sample image into the first image coding network for coding processing, and inputting a result of the coding processing into a first auxiliary branch for pre-training for image recognition to obtain a first image recognition result;

the second image recognition module is used for inputting a sample image into the second image coding network for coding processing, inputting a result of the coding processing into a trained second auxiliary branch for image recognition, and obtaining a second image recognition result;

The model adjustment module is used for adjusting the first image coding network based on the difference between the first image recognition result and the second image recognition result to obtain a trained first multi-mode scene recognition model;

and the scene recognition module is used for recognizing the scene based on the trained first multi-mode scene recognition model.

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.