CN114283316A

CN114283316A - Image identification method and device, electronic equipment and storage medium

Info

Publication number: CN114283316A
Application number: CN202111089047.6A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2022-04-05

Abstract

The present application relates to the field of image processing technologies, and in particular, to an image recognition method, an image recognition apparatus, an electronic device, and a storage medium, which are used to improve accuracy and efficiency of image classification. The method comprises the following steps: a metric learning embedding module based on an image classification joint model extracts a first shallow global feature of an image to be detected; extracting a first deep semantic feature of an image to be detected based on a deep semantic embedding module of an image classification combined model; and the semantic prediction module based on the image classification combined model performs feature fusion on the first shallow layer global feature and the first deep layer semantic feature, performs multi-label classification on the image to be detected based on the obtained first fusion feature, and obtains a classification result of the image to be detected. The image classification combined model is obtained by performing combined training on the metric learning embedded module and the depth semantic embedded module, so that the combined learning effect is enhanced, the time consumption of feature extraction is reduced, and the accuracy and efficiency of image classification are effectively improved.

Description

Image identification method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image recognition method and apparatus, an electronic device, and a storage medium.

Background

With the rapid development of internet technology and the rapid growth of multimedia resources, internet content gradually develops from mainstream text information to various multimedia data; with the continuous updating of the electronic equipment, the imaging equipment is continuously simplified and updated, and image data resources are rapidly expanded.

In the fields of image retrieval and the like, a related search task is often required to be executed based on the overall global features and the main semantic features of an image, and therefore how to efficiently and accurately extract the features of the image becomes an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application provides an image identification method, an image identification device, electronic equipment and a storage medium, which are used for improving the accuracy and efficiency of image classification.

An image recognition method provided by an embodiment of the present application includes:

extracting a first shallow global feature of an image to be detected based on a metric learning embedded module in a trained image classification combined model, wherein the first shallow global feature represents global basic information of the image to be detected; and

extracting a first deep semantic feature of the image to be detected based on a depth semantic embedding module in the image classification combined model, wherein the first deep semantic feature represents image semantic information of the image to be detected;

based on a semantic prediction module in the image classification combined model, performing feature fusion on the first shallow layer global feature and the first deep layer semantic feature, and performing multi-label classification on the image to be detected based on the obtained first fusion feature to obtain a classification result of the image to be detected;

the image classification joint model is obtained by performing joint training on the metric learning embedding module and the depth semantic embedding module.

An image recognition apparatus provided in an embodiment of the present application includes:

the first feature extraction unit is used for extracting a first shallow global feature of an image to be detected based on a metric learning embedding module in a trained image classification combined model, wherein the first shallow global feature represents global basic information of the image to be detected; and

the second feature extraction unit is used for extracting a first deep semantic feature of the image to be detected based on a depth semantic embedding module in the image classification combined model, and the first deep semantic feature represents the image semantic information of the image to be detected;

the classification prediction unit is used for performing feature fusion on the first shallow global feature and the first deep semantic feature based on a semantic prediction module in the image classification combined model, and performing multi-label classification on the image to be detected based on the obtained first fusion feature to obtain a classification result of the image to be detected;

Optionally, the apparatus further comprises:

the image retrieval unit is used for respectively extracting second shallow layer global features of each candidate image in an image library based on the metric learning embedding module; respectively extracting second deep semantic features of the image to be detected based on the deep semantic embedding module;

respectively performing feature fusion on each second shallow layer global feature and the corresponding second deep layer semantic feature based on the semantic prediction module to obtain second fusion features corresponding to each candidate image;

determining the similarity of each candidate image and the image to be detected respectively based on the first fusion features and the second fusion features corresponding to each candidate image;

and determining a similar image corresponding to the image to be detected based on each similarity.

Optionally, the image classification joint model further includes: a base convolution module; the device further comprises:

the basic processing unit is used for inputting the image to be detected into a basic convolution module in the image classification combined model before the first feature extraction unit extracts the first shallow global feature of the image to be detected based on a metric learning embedded module in a trained image classification combined model, and performing global feature extraction on the image to be detected based on the basic convolution module to obtain a global feature map corresponding to the image to be detected;

the first feature extraction unit is specifically configured to:

and based on the metric learning embedding module, embedding the global feature map to obtain a first shallow global feature corresponding to the image to be detected.

Optionally, the depth semantic embedding module includes a convolution layer and an embedding layer composed of residual structures; the second feature extraction unit is specifically configured to:

and extracting the features of the semantic information in the global feature map based on the convolutional layer in the deep semantic embedding module, and embedding the extracted semantic information based on the embedding layer in the deep semantic embedding module to obtain a first deep semantic feature corresponding to the image to be detected.

Optionally, the learning rate of the semantic prediction module is higher than the learning rates of other modules, where the other modules include the basic convolution module, the metric learning embedding module, and the deep semantic embedding module.

Optionally, the base convolution module includes a plurality of convolution layers; network parameters corresponding to the convolutional layers in the basic convolutional module are obtained based on the pre-trained parameter initialization of the specified sample library, and network parameters corresponding to the convolutional layers in the deep semantic embedding module are obtained based on random initialization.

Optionally, the apparatus further comprises:

the training unit is used for obtaining the image classification combined model through training in the following modes:

acquiring a training sample data set, and selecting a training sample group from the training sample data set;

inputting the selected training sample group into the trained image classification combined model, and acquiring a third shallow global feature output by a metric learning embedding module, a third deep semantic feature output by a deep semantic embedding module and a classification vector output by a semantic prediction module based on the third shallow global feature output by the metric learning embedding module in the image classification combined model;

and constructing a target loss function based on the third shallow global feature, the third deep semantic feature and the classification vector, adjusting network parameters of the image classification combined model for multiple times based on the target loss function until the image classification combined model is converged, and outputting the trained image classification combined model.

Optionally, the training unit is specifically configured to:

constructing a first triple loss function based on the third shallow global features corresponding to the training samples in the training sample group; constructing a second triple loss function based on the third deep semantic features corresponding to the training samples;

constructing a multi-label loss function based on the classification vectors and the corresponding multi-label vectors corresponding to the training samples, wherein the classification vectors represent the prediction probabilities of the training samples corresponding to the classification labels, and the multi-label vectors represent the real probabilities of the training samples corresponding to the classification labels;

obtaining a classification entropy loss function based on the classification vector corresponding to each training sample;

and carrying out weighted summation on the first triple loss function, the second triple loss function, the multi-label loss function and the classification entropy loss function to obtain a target loss function.

Optionally, each training sample set includes three training samples: an anchor sample, a positive sample and a negative sample; the training unit is specifically configured to:

determining a first distance between a third shallow global feature corresponding to an anchor sample in the training sample group and a third shallow global feature corresponding to a positive sample, and a second distance between the third shallow global feature corresponding to the anchor sample and a third shallow global feature corresponding to a negative sample;

taking the difference between the first distance and the second distance and the sum of a specified boundary value and the specified boundary value as a first target value, wherein the specified boundary value is used for representing a difference value boundary of the similarity between the positive sample and the negative sample;

and taking the first target value and the maximum value in the reference parameter as the first triple loss function.

based on a third distance between a third deep semantic feature corresponding to an anchor sample in the training sample set and a third deep semantic feature corresponding to a positive sample, a fourth distance between the third deep semantic feature corresponding to the anchor sample and a third deep semantic feature corresponding to a negative sample;

taking the difference between the third distance and the fourth distance and the sum of a specified boundary value used for a difference boundary representing the similarity between the positive sample and the negative sample as a second target value;

and taking the second target value and the maximum value in the reference parameter as the second triple loss function.

Optionally, the training unit is further configured to:

before network parameters of the image classification combined model are adjusted for multiple times based on the target loss function, performing weighted summation on the first triple loss function, the multi-label loss function and the classification entropy loss function to obtain a middle loss function;

and adjusting the network parameters of the deep semantic module in the image classification combined model for the specified times based on the intermediate loss function.

Optionally, the apparatus includes:

a sample construction unit, configured to obtain a training sample set in the training sample data set in the following manner:

according to the similarity between the collected sample images, forming a positive sample pair by every two sample images with the similarity higher than a specified threshold value;

performing sample recombination based on each positive sample pair to obtain a plurality of training sample groups for constituting the training sample data set, each training sample group including three training samples: the image processing method comprises the following steps of an anchor sample, a positive sample and a negative sample, wherein the anchor sample and the positive sample are sample images with the similarity higher than a specified threshold, and the negative sample and the positive sample are sample images with the similarity not higher than the specified threshold.

Optionally, the sample constructing unit is specifically configured to:

selecting one sample image in a positive sample pair as a target sample, and respectively selecting one sample image from other sample pairs as a candidate sample;

respectively determining the distance between the target sample and each candidate sample;

sorting the candidate samples according to the respective corresponding distances, and selecting at least one candidate sample with a designated sorting position as a negative sample corresponding to the target sample;

and respectively combining each selected negative sample with the positive sample pair to form a training sample group, wherein a target sample in the positive sample pair is a positive sample in the training sample group, and the other sample in the positive sample pair is an anchor sample in the training sample group.

An electronic device provided by an embodiment of the present application includes a processor and a memory, where the memory stores program codes, and when the program codes are executed by the processor, the processor is caused to execute any one of the steps of the image recognition method.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the electronic device to perform the steps of any one of the image recognition methods described above.

An embodiment of the present application provides a computer-readable storage medium, which includes program code for causing an electronic device to perform any one of the steps of the image recognition method described above when the program product runs on the electronic device.

The beneficial effect of this application is as follows:

the embodiment of the application provides an image identification method and device, electronic equipment and a storage medium. According to the image classification method and device, the model structure beneficial to learning of two features is obtained through the metric learning embedding module used for extracting the shallow global features and the depth semantic embedding module used for extracting the deep semantic features, when the images to be detected are classified based on the image joint classification model, the images can be characterized in a combined mode through the bottom image characterization and the deep semantic characterization, the combined learning effect of the bottom image characterization and the deep semantic characterization is enhanced, time consumption of feature extraction is reduced, image classification is carried out based on the obtained image fusion features, more accurate classification results can be obtained, and accuracy and efficiency of image classification are improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic diagram of an application scenario in an embodiment of the present application;

fig. 2 is a flowchart of an implementation of an image recognition method in an embodiment of the present application;

FIG. 3 is a flowchart illustrating an implementation of an image retrieval method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an image classification combination model in an embodiment of the present application;

FIG. 5 is a schematic diagram of a residual module according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating an implementation of yet another image recognition method in an embodiment of the present application;

fig. 7 is a flowchart of an implementation of a method for constructing a training sample data set in the embodiment of the present application;

FIG. 8 is a flowchart illustrating a method for training an image classification combination model according to an embodiment of the present disclosure;

FIG. 9 is a flow chart of a method of calculating a loss function in an embodiment of the present application;

fig. 10 is a schematic structural diagram of an image recognition apparatus in an embodiment of the present application;

fig. 11 is a schematic diagram of a hardware component of an electronic device to which an embodiment of the present application is applied;

fig. 12 is a schematic diagram of a hardware component structure of another electronic device to which the embodiment of the present application is applied.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the technical solutions of the present application. All other embodiments obtained by a person skilled in the art without any inventive step based on the embodiments described in the present application are within the scope of the protection of the present application.

Some concepts related to the embodiments of the present application are described below.

Image recognition: refers to class level identification, considers only identifying the class of an object (e.g., person, dog, cat, bird, etc.) without regard to a particular instance of the object, and presents the class to which the object belongs. A typical example is large generic object recognition, which identifies which of the 1000 categories an object is, from the recognition task in the source dataset Imagenet.

Image multi-label identification: whether the image has a combination of the specified attribute labels is recognized by the computer. An image may have multiple attributes, and the multi-label identification task is to determine which preset attribute labels a certain image has.

Image shallow global feature: the global basic information of the characterization image, namely the embedded (embedding) representation of the basic feature information of the image, such as texture features, a characterization of Scale-invariant feature transform (SIFT) corner features, and the output of the shallow neural network belongs to the bottom image characterization in deep learning. SIFT is a description used in the field of image processing. The description has scale invariance, can detect key points in the image and is a local feature descriptor.

Deep semantic features of the image: the image semantic information of the representation image, namely the feature (semantic embedding) with semantic information, in the application, such feature is output at the layer before the classification layer, namely the semantic embedding can predict the semantic label of the image through the classification layer.

Metric (Metric) and Metric Learning (Metric Learning): in mathematics, a metric (or distance function) is a function that defines the distance between elements in a set. One set with metrics is called a metric space. Metric learning, also called similarity learning, is a common machine learning method for comparing and measuring similarity between data, and has a wide application and an extremely important position in computer vision, such as in the important fields of face recognition, image retrieval and the like.

The embodiments of the present application relate to Artificial Intelligence (AI) and Machine Learning technologies, and are designed based on a computer vision technology and Machine Learning (ML) in the AI.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence.

Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology mainly comprises a computer vision technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions. With the research and progress of artificial intelligence technology, artificial intelligence is researched and applied in a plurality of fields, such as common smart homes, smart customer service, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, robots, smart medical treatment and the like.

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Compared with the method for finding mutual characteristics among big data by data mining, the machine learning focuses on the design of an algorithm, so that a computer can automatically learn rules from the data and predict unknown data by using the rules.

Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like. The image classification combined model in the embodiment of the application is obtained by training through a machine learning or deep learning technology. The training method based on the image classification combined model in the embodiment of the application can be used for classifying, retrieving and the like images.

The method for training the image classification combined model provided by the embodiment of the application can be divided into two parts, including a training part and an application part; the training part relates to the technical field of machine learning, and in the training part, an image classification combined model is trained through the technology of machine learning. Specifically, the image classification joint model is trained by using a training sample group in a training sample data set given in the embodiment of the application, after the training sample group passes through the image classification joint model, an output result of the image classification joint model is obtained, model parameters are continuously adjusted by combining the output result, and the trained image classification joint model is output; the application part is used for classifying, searching and the like the images by using the image classification joint model obtained by training in the training part.

The following briefly introduces the design concept of the embodiments of the present application:

with the rapid development of internet technology and the rapid growth of multimedia resources, internet content gradually develops from mainstream text information to various multimedia data; with the continuous updating of the electronic equipment, the imaging equipment is continuously simplified and updated, and image data resources are rapidly expanded. Through the neural network model, embedding feature extraction can be carried out on the image data, and some downstream tasks are executed based on the extracted features.

When the embedding features of the image are trained based on the deep neural network, the overall global features and the main semantic features of the image are often required to be represented at the same time, and some downstream tasks have great requirements on semantic representation, such as searching similar category images by using images; some downstream tasks focus more on overall features, such as simple image deduplication; while some tasks require two features to be considered simultaneously, such as image semantic deduplication.

However, in the deep learning stage in the related art, the two feature extraction methods are generally trained and learned by different models respectively, but the two separate training and learning methods cause a certain trouble in the training process and the application process, the training time and convergence of both sides are slow due to the fact that the two separate learning methods are performed, and because the underlying network structures are generally the same, a large amount of repeated model parameter learning exists, not only two sets of models and parameters need to be maintained separately, but also the prediction of the two networks is very time-consuming when the result prediction is performed based on the models.

In view of this, embodiments of the present application provide an image recognition method, an image recognition apparatus, an electronic device, and a storage medium. According to the image classification method and device, the model structure beneficial to learning of two features is obtained through the metric learning embedding module used for extracting the shallow global features and the depth semantic embedding module used for extracting the deep semantic features, when the images to be detected are classified based on the image joint classification model, the images can be characterized in a combined mode through the bottom image characterization and the deep semantic characterization, the combined learning effect of the bottom image characterization and the deep semantic characterization is enhanced, time consumption of feature extraction is reduced, image classification is carried out based on the obtained image fusion features, more accurate classification results can be obtained, and accuracy and efficiency of image classification are improved.

The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it should be understood that the preferred embodiments described herein are merely for illustrating and explaining the present application, and are not intended to limit the present application, and that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Fig. 1 is a schematic view of an application scenario in the embodiment of the present application. The application scenario diagram includes two terminal devices 110 and a server 120. The terminal device 110 in the embodiment of the present application may be installed with a client, where the client is used for image classification, image retrieval, and the like. The server 120 may comprise a server. The server is used for providing image materials for the client, for example, the image library in the embodiment of the present application may be located on the server side, and stores a plurality of candidate images. Alternatively, the image library may be local to the client. In addition, the server in the embodiment of the present application may also be used for image classification, image retrieval, and the like, and is not particularly limited herein.

In the embodiment of the present application, the image classification joint model may be deployed on the terminal device 110 for training, and may also be deployed on the server 120 for training. The server 120 may store a plurality of training samples, including a plurality of training sample sets, for training the image classification joint model. Optionally, after the image classification joint model is obtained based on training in the embodiment of the present application, the trained image classification joint model may be directly deployed on the server 120 or the terminal device 110. The image classification joint model is generally deployed directly on the server 120, and in the embodiment of the present application, the image classification joint model is often used for image classification, image retrieval, and the like.

It should be noted that the image recognition method for training the image classification combined model provided in the embodiment of the present application may be applied to various application scenarios including an image classification task and an image retrieval task, and training samples used in different scenarios are different and are not listed here.

Specifically, in an image classification scene, fusion features and the like of an image can be acquired based on the image identification method in the application, which preset attribute labels the image has are judged based on the features, and the judgment result is used as the classification result of the image.

In the image retrieval scene, the method is further subdivided into semantic retrieval, similarity retrieval and the like, the scenes can be applied, shallow global features, deep semantic features, fusion features and the like of the image can be obtained based on the image identification method in the application, and image retrieval is performed based on the features, such as image rearrangement retrieval, similar subject retrieval in the image and the like.

In an alternative embodiment, terminal device 110 and server 120 may communicate via a communication network.

In an alternative embodiment, the communication network is a wired network or a wireless network.

In the embodiment of the present application, the terminal device 110 is a device that has a certain computing capability and runs instant messaging software and a website or social contact software and a website, such as a personal computer, a mobile phone, a tablet computer, a notebook, an e-book reader, a vehicle-mounted terminal, and the like used by a user. Each terminal device 110 is connected to a server 120 through a wireless network, and the server 120 is a server or a server cluster or a cloud computing center formed by a plurality of servers, or is a virtualization platform.

It should be noted that fig. 1 is only an example, and the number of the terminal devices and the servers is not limited in practice, and is not specifically limited in the embodiment of the present application.

The video detection method provided by the exemplary embodiment of the present application is described below with reference to the accompanying drawings in conjunction with the application scenarios described above, and it should be noted that the application scenarios described above are only shown for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect.

It should be noted that, the image recognition method in the embodiment of the present application may be executed by a server or a terminal device alone, or may be executed by both the server and the terminal device, and is not limited herein.

Referring to fig. 2, an implementation flow chart of an image recognition method provided in the embodiment of the present application is shown, taking a server as an execution subject, and a specific implementation flow of the method is as follows:

s21: the server extracts a first shallow global feature of the image to be detected based on a metric learning embedding module in a trained image classification combined model, and the first shallow global feature represents global basic information of the image to be detected.

S22: the server extracts a first deep semantic feature of the image to be detected based on a depth semantic embedding module in the image classification combined model, and the first deep semantic feature represents image semantic information of the image to be detected.

It should be noted that the shallow global features in the embodiment of the present application mainly refer to the whole information of the bottom layer of an image, and the deep semantic features are extracted by performing deeper feature abstraction on embedding with the whole description of the image, so the present application provides a model combining the bottom layer features and the deep semantic, that is, an image classification combined model in the present application, which includes a metric learning embedding module and a deep semantic embedding module, and performs image classification after extracting corresponding features based on the two modules, respectively.

S23: the server performs feature fusion on the first shallow layer global feature and the first deep layer semantic feature based on a semantic prediction module in the image classification combined model, performs multi-label classification on the image to be detected based on the obtained first fusion feature, and obtains a classification result of the image to be detected.

The image classification combined model is obtained by performing combined training on a metric learning embedding module and a depth semantic embedding module.

In the above embodiment, a model structure beneficial to learning of two features is obtained by a metric learning embedding module for extracting a shallow global feature and a depth semantic embedding module for extracting a deep semantic feature, when an image to be detected is classified based on the image joint classification model in the present application, the image can be characterized by combining a bottom image characterization and a deep semantic characterization, the joint learning effect of the bottom image characterization and the deep semantic characterization is enhanced, the time consumption of feature extraction is reduced, the image is classified based on the obtained image fusion features, a more accurate classification result can be obtained, and the accuracy and efficiency of image classification are improved.

In an alternative embodiment, the application can also perform image retrieval based on the output result of the image classification model. Referring to fig. 3, which is a flowchart illustrating an implementation of an image retrieval method in an embodiment of the present application, the method specifically includes the following steps:

s301: the server respectively extracts second shallow global features of each candidate image in the image library based on a metric learning embedded module; and respectively extracting second deep semantic features of the image to be detected based on the deep semantic embedding module.

The second shallow global feature has the same nature as the content represented by the first shallow global feature, namely the second shallow global feature represents the global basic information of the image; the first and the second are for images, wherein the first shallow global feature is for an image to be detected, and the second shallow global feature is for a candidate image. Similarly, the second deep semantic feature and the first deep semantic feature represent the same content in nature and represent the image semantic information of the image. The second blend feature and the first blend feature are similar in the following, and repeated descriptions are omitted.

S302: and the server performs feature fusion on each second shallow global feature and the corresponding second deep semantic feature respectively based on the semantic prediction module to obtain second fusion features corresponding to each candidate image.

S303: and the server determines the similarity of each candidate image and the image to be detected respectively based on the first fusion characteristics and the second fusion characteristics corresponding to each candidate image.

S304: and the server determines a similar image corresponding to the image to be detected based on each similarity.

For example, in an image re-arrangement retrieval scene, similar images corresponding to the image to be detected in the image library, including images that are the same as or extremely similar to the image to be detected, specifically, candidate images with similarity higher than a certain threshold value, may be determined based on the above manner, and then removed from the image library; for another example, similar subject retrieval in the drawings can be performed based on the above manner, for example, a user inputs a clothing image as an image to be detected, and retrieves similar clothing drawings from an image library.

In practical application, the global texture and semantic features of the image are related to each other, for example, certain semantic categories only appear in a specific global image environment, such as polar bears and ice cubes are related, and sheep and polar bears generally do not appear at the same time, the image recognition method in the application well utilizes the related information to improve the learning effects of the polar bears and the ice cubes, and can effectively improve the accuracy of image classification and image retrieval.

Optionally, the image classification joint model in the embodiment of the present application further includes a basic convolution module. That is, in the embodiment of the present application, the model structure can be divided into 4 modules: the system comprises a basic convolution module, a metric learning embedding module, a deep semantic embedding module, a semantic prediction module and the like.

Fig. 4 is a schematic structural diagram of an image classification combination model in the embodiment of the present application. The model specifically comprises: a base convolution module 401, a metric learning embedding module 402, a deep semantic embedding module 403, and a semantic prediction module 404.

Optionally, the basic convolution module 401 includes a plurality of convolution layers, such as a Convolutional Neural Network (CNN) shown as 401 in fig. 4, where the CNN may be a convolution Network composed of residual structures; in addition, in addition to the adoption of the ResNet101 basic result as the global feature extraction, the image classification joint model in the application is additionally provided with a deep semantic embedding module consisting of a residual structure Conv6_ x module and a subsequent semantic multi-label classification module FC (namely a semantic prediction module). The Deep semantic Embedding module specifically includes a convolutional layer and an embedded layer, where the convolutional layer is composed of a residual structure, the convolutional layer is a semantic basic part, such as Deep CNN shown in 403 in fig. 4, and hereinafter, the Deep CNN part mainly includes Conv6 for detailed example, and the embedded layer is a semantic Embedding part, such as Deep Embedding shown in 403 in fig. 4, and hereinafter, the Deep semantic Embedding is mainly exemplified in detail by the Deep CNN including Embedding2 for example.

In the application, the information can be better extracted by adopting the residual error structure than that of a common convolutional layer, so that the deep semantic embedding module still adopts the residual error structure. In practical application, the number of stacks of residual structure, in this case 2 stacks, can be adjusted as required.

The following briefly describes network parameters of the image classification joint model in the embodiment of the present application with reference to table 1, table 2, and table 3.

In the embodiment of the present application, the network parameters corresponding to the convolutional layers in the basic convolutional module are determined by initialization based on parameters pre-trained in a specified sample library, for example, ResNet101 is used for training the basic convolutional module, and the parameters are shown in table 1:

TABLE 1 ResNet101 feature Module Structure Table

Where x3blocks (residual block) in table 1 represents a stack of 3 such network structures, and similarly, x2 blocks are similar. The base convolution module in this application includes five convolution layers, shown in table 1 as Conv1-Conv5_ x, which employ parameters of ResNet101 pre-trained on Imagenet data set as initial network parameters.

It should be noted that the network results of the basic convolution modules listed in table 1 are only examples, and besides the network structure of ResNet101, different network structures may be used, and different weights of the pre-trained model may be used as the basic convolution modules, such as ResNet50, inclusion v4, etc., for a search with a large data size, a small network, such as ResNet18, may be used, and the embedding dimension may be reduced, such as 64 bits, etc., to reduce the feature storage space; in addition, in addition to using the dataset Imagenet to train the model, other large-scale datasets, such as the Open Image dataset, may be considered.

Referring to fig. 5, a schematic structural diagram of a residual block (block) in the embodiment of the present application is shown, where the residual block includes three convolutional layers, which are 1 × 1 convolutional layers, 3x3 convolutional layers, and 1x1 convolutional layers, and all the blocks of types Conv2_ x, Conv3_ x, etc. in table 1 are generated by stacking residual blocks for multiple times under different parameters.

The following briefly introduces the network structure and parameters corresponding to the metric learning embedded module in the embodiment of the present application, and refer to table 2:

table 2 metric learning module structure table based on ResNet101

Layer name (Layer name)	Output size (Output result size)	Layer (Layer)
			Pool_cr1	1x2048	Max pool (largest pool)
Embedding	1x512	full connectivity (full connection)

Wherein 512 is the embedding dimension of the shallow global feature.

Optionally, in this embodiment of the application, the network parameters corresponding to the convolutional layer in the deep semantic embedding module may be obtained based on random initialization. The initialization is mainly exemplified herein by using a gaussian distribution with variance of 0.01 and mean of 0. Specifically, the network structures and parameters corresponding to the deep semantic embedding module and the semantic prediction module are shown in table 3:

table 3 classification module structure table based on ResNet101

Referring to table 3, Conv6_ x, Pool _ cr1 and Embedding2 are deep semantic Embedding modules, and FC is a semantic prediction module. Wherein 512 is the Embedding dimension of the deep semantic features, Embedding2 is the semantic Embedding of 512 dimensions, and Nclass is the number of multi-label categories, that is, the number of classification labels. When using Imagenet to open source image training, 1000 labels are provided, and Nclass is 1000. Wherein, Conv6_ x is a newly added convolution module for feature extraction after the global feature Conv5_ x convolution module. The x2 blocks are similar to the x3 blocks.

The convolutional layer in the depth semantic embedding module may be denoted as Conv6_ x, the embedding layer in the depth semantic embedding module may be denoted as embedding2, and the semantic prediction module may be denoted as full connectivity layer fc (full connectivity). Newly added layers such as Conv6_ x, embedding, and FC layers are initialized with a gaussian distribution with variance of 0.01 and mean of 0.

It should be noted that, in the embodiment of the present application, the branches in tables 2 and 3 may also adopt other model structures, and the tables only show one possible scheme, which is not specifically limited herein.

In the embodiment of the application, a model structure beneficial to learning of two types of features can be obtained by designing a deep residual error feature extraction module on a conventional basic feature extraction network structure, convergence of a new module to a classification task is realized through pre-training and fine-tuning multi-stage learning, further, the two feature convergence is realized by considering the iteration efficiency of different learning tasks, and finally, the image is characterized by means of combination of bottom image characterization and deep semantic characterization so as to be used for downstream retrieval or other related applications.

Optionally, in order to further improve the model learning effect, an asynchronous learning rate is further designed, the base convolution module, the metric learning embedding module, and the deep semantic embedding module all use an lr1 ═ 0.0005 learning rate, and the multi-label classification FC layer (i.e., the semantic prediction module) uses an lr ═ 0.005(═ lr1 ×. 10), i.e., the learning rate of the semantic prediction module is higher than the learning rates of the other modules, which include the base convolution module, the metric learning embedding module, and the deep semantic embedding module.

The multi-label classification task is to enable two samples with the same class to output the same prediction class, namely, the same learning target is fitted, and prediction results of labels under overfitting are very same for multiple dissimilar images. Because the FC layer is easier to overfit to the target, the overfitting of the semantics can enable the semantics embedding to be overfitting more easily, namely the embedding of two images of the same category is the same (namely the embedding is overfitting to the classification target), so that the embedding between the images is not differentiated, the asynchronous learning rate is adopted in the application, the parameter updating of the embedding layer can be slower than that of the FC layer (the updating efficiency is 0.1 time of the FC layer), and the overfitting of the embedding in each batch parameter updating to the multi-label target is effectively prevented.

In the embodiment of the application, the semantic embedding is considered to be deeper characteristic abstraction extraction of the embedding with integral image description, so the application provides a bottom layer characteristic and deep semantic combined model, a model structure beneficial to learning of two characteristics is obtained by designing a deep residual characteristic extraction module on a conventional basic characteristic extraction network structure, then a new module is converged to a classification task by pre-training and fine-tuning multi-stage learning, finally two characteristic convergence is realized by considering iteration efficiency of different learning tasks, and finally an image is characterized by means of combination of bottom layer image characterization and deep semantic characterization so as to be used for downstream retrieval or other related applications.

After the structure of an image classification combined model in the embodiment of the present application is described, a process of performing image classification based on the image classification combined model in the embodiment of the present application is described below with reference to fig. 4.

Referring to fig. 6, which is a schematic flowchart of an image classification method in an embodiment of the present application, the method specifically includes the following steps:

s601: and the server inputs the image to be detected into a basic convolution module in the image classification combined model, and performs global feature extraction on the image to be detected based on the basic convolution module to obtain a global feature map corresponding to the image to be detected.

S602: and the server carries out embedding processing on the global feature map based on the metric learning embedding module to obtain a first shallow global feature corresponding to the image to be detected.

S603: the server extracts a first deep semantic feature of the image to be detected based on a deep semantic embedding module in the image classification combined model.

S604: and the server extracts the features of the semantic information in the global feature map based on the convolutional layer in the deep semantic embedding module, and performs embedding processing on the extracted semantic information based on the embedding layer in the deep semantic embedding module to obtain a first deep semantic feature corresponding to the image to be detected.

S605: the server performs feature fusion on the first shallow layer global feature and the first deep layer semantic feature based on a semantic prediction module in the image classification combined model, performs multi-label classification on the image to be detected based on the obtained first fusion feature, and obtains a classification result of the image to be detected.

The following describes the model training process in the embodiment of the present application in detail:

in the embodiment of the present application, the training samples are the triplet samples shown in fig. 4, and one training sample group includes three training samples: an anchor sample, a positive sample and a negative sample can be represented as triples (anchor, positive, negative), where anchor is the anchor sample, positive is the positive sample, and negative is the negative sample. The anchor-positive is a similar or identical image, that is, the anchor sample and the positive sample are sample images with the similarity higher than a specified threshold, and the anchor-negative is a dissimilar image, that is, the negative sample and the positive sample are sample images with the similarity not higher than the specified threshold.

It should be noted that, the image sample pair labeling for triple learning is supported by the present application, and specifically, the training data for image metric learning (metric learning) may be labeled according to the following rule: the task of each annotation is to pick out 3 picture composition triples (anchors, positive, negative) meeting the rules from the full image.

However, considering that there are a large number of easy samples in the randomly generated triplet data, the samples are helpful for model learning at the beginning, but the model establishes a good distinguishing capability for the easy samples quickly (e.g. after 3 rd epoch), and at this time, the loss of the large number of easy samples is far greater than that of hard samples, so that the hard samples are buried in the easy, and thus, enough difficult samples obtained at the middle and later stages have a large influence on the model. The easy sample refers to a simple and easily distinguishable sample pair, and the hard sample refers to an indistinguishable sample, such as a similar sample group with a large characteristic difference or a different sample group with a similar characteristic.

Based on this, in the embodiment of the present application, the collected sample images may be divided into the training sample groups based on the following manner. Referring to fig. 7, it is a flowchart of an implementation of a method for constructing a training sample data set in the embodiment of the present application, and specifically includes the following steps:

s701: and the server combines every two sample images with similarity higher than a specified threshold value into a positive sample pair according to the similarity between the collected sample images.

S702: and the server performs sample recombination on the basis of each positive sample pair to obtain a plurality of training sample groups for forming a training sample data set.

That is, the acquired sample images are first divided into positive sample pairs, and since both samples in each positive sample pair belong to the same or similar samples, one of the positive sample pairs can be used as an anchor sample in the triplet samples, and the other positive sample pair can be used as a positive sample in the triplet samples. However, two samples from different positive sample pairs may belong to samples with low similarity, and based on this, sample recombination can be performed on each positive sample pair, and a plurality of training sample groups can be constructed based on one positive sample pair.

In an alternative embodiment, step S702 may be implemented based on the following process, and the process of constructing multiple training sample sets based on one positive sample pair is implemented, specifically including the following steps:

s7021: the server selects one sample image in one positive sample pair as a target sample, and respectively selects one sample image from other sample pairs as a candidate sample.

S7022: the server determines the distance between the target sample and each candidate sample, respectively.

S7023: and the server sorts each candidate sample according to the corresponding distance, and selects at least one candidate sample with a designated sorting position as a negative sample corresponding to the target sample.

The step of assigning the ranking positions may refer to the last N ranked candidate samples according to the respective corresponding distances from small to large, or may refer to the first N ranked candidate samples according to the respective corresponding distances from large to small, where N is a positive integer. That is, several candidate samples that are farther away from the target sample are selected as negative samples.

S7024: and the server respectively forms a training sample group by each selected negative sample and positive sample pair, wherein the target sample in the positive sample pair is a positive sample in the training sample group, and the other sample in the positive sample pair is an anchor point sample in the training sample group.

In the above embodiment, each negative sample may form a training sample set with the sample set; for a positive sample pair, N negative samples are selected to form N training sample sets. If there are N positive sample pairs, N × N training sample sets can be formed.

For example, for a positive sample pair 1, the sample pair includes two sample images: sample image 1 and sample image 2, assume that sample image 1 is selected as a target sample, and in addition, from the positive sample pairs 2-6, 5 sample images are selected as candidate images, assuming that: sample image 3, sample image 5, sample image 7, sample image 9, and sample image 11. Distances between the sample image 1 and the five candidate images are calculated, respectively, assuming that N is 3, where the farthest distances from the sample image 1 are: sample image 3, sample image 5, and sample image 7, then three training sample sets can be constructed based on the three candidate samples, namely { sample image 1, sample image 2, sample image 3}, { sample image 1, sample image 2, sample image 5}, { sample image 1, sample image 2, and sample image 7}, respectively.

In the embodiment of the present application, the triplet label is adjusted as follows: only positive sample pairs are labeled and similar sample pairs are obtained. Specifically, the triple is obtained by mining in each sample (batch-size, bs) pair of the batch as follows: for sample x (i.e., the target sample) in a certain positive sample pair: and (3) randomly selecting an image from the remaining bs-1 positive sample pairs to serve as a candidate sample for each pair, respectively calculating the distance between each candidate sample and x, sorting the candidate samples from small to large according to the distances, taking the first 10 candidate samples as negative samples, and respectively forming triples with the positive samples in x, so that each sample generates 10 triples, and the whole batch obtains 10 × bs triples. Where bs needs to be set to a relatively large value, such as 256; the batch size is a hyper-parameter that defines the number of samples to be processed before updating the internal model parameters.

Furthermore, semantic label labeling needs to be performed on the training samples, in the embodiment of the application, multi-label labeling can be performed on the images only on the positive samples, and labels include scene labels such as stage and restaurant; drawing labels for games, animations, etc.; other labels such as self-timer, action of a person, etc., are not specifically limited herein. When the image contains a certain label, the true probability corresponding to the label is 1, and if the image does not contain the label, the true probability corresponding to the label is 0; assuming that the image classification joint model in the present application needs to learn 1000 multi-labels, there is a case that the image does not contain any label, i.e. the true probability corresponding to all labels is 0. When the true probabilities corresponding to the 1000 multi-labels are expressed by vectors, the true probabilities are the multi-label vectors in the application, and the true probabilities corresponding to the classification labels of the training samples are expressed.

After the preparation process of the training data in the embodiment of the present application is described, the following describes in detail the training process of the image classification joint model in the embodiment of the present application based on the training data:

in an alternative embodiment, the image classification combined model is obtained by training in the following manner, and specifically, refer to fig. 8, which is a flowchart of a training method of the image classification combined model in the embodiment of the present application, and includes the following steps:

s801: the server obtains a training sample data set, and selects a training sample group from the training sample data set.

S802: the server inputs the selected training sample group into the trained image classification combined model, and obtains a third shallow global feature output by the metric learning embedding module, a third deep semantic feature output by the deep semantic embedding module and a classification vector output by the semantic prediction module based on the metric learning embedding module in the image classification combined model.

S803: the server constructs an intermediate loss function based on the third shallow global feature and the classification vector, and adjusts the network parameters of the deep semantic module in the image classification combined model for the specified times based on the intermediate loss function.

Specifically, a first triple loss function, a multi-label loss function and a classification entropy loss function are constructed based on the third shallow layer global feature and the classification vector, and the first triple loss function, the multi-label loss function and the classification entropy loss function are subjected to weighted summation to obtain a middle loss function.

S804: and the server constructs a target loss function based on the third shallow global feature, the third deep semantic feature and the classification vector, adjusts network parameters of the image classification combined model for multiple times based on the target loss function until the image classification combined model is converged, and outputs the trained image classification combined model.

Specifically, a first triple loss function, a second triple loss function, a multi-label loss function and a classification entropy loss function are constructed based on a third shallow global feature, a third deep semantic feature and a classification vector, and the first triple loss function, the second triple loss function, the multi-label loss function and the classification entropy loss function are subjected to weighted summation to obtain a target loss function.

It should be noted that the above listed training method is only a simple introduction, and actually, based on the above training process, an epoch round iteration can be performed on the full amount of data of the training sample data set; carrying out iteration treatment on each round of full samples; where the epoch number is a hyper-parameter that defines the number of jobs of the learning algorithm in the entire training data set.

Specifically, the specific operations in each iteration are as follows: the full size samples were divided into Nb batches per batch-size samples (triple samples), and for each batch:

(1) model forward: all parameters of the model are set to be in a state needing learning, and the neural network carries out forward calculation on an input image during training to obtain a prediction result.

(2) loss calculation: calculating triple loss (triple loss) for the em, wherein the triple loss comprises a first triple loss function and a second triple loss function; a multi-label loss function (bce loss) is computed for FC as well as the classification Entropy loss function. The sum yields the total loss. Specifically, the following description is calculated, and it is assumed that the weights corresponding to the first triple loss function, the second triple loss function, the multi-label loss function, and the classification entropy loss function are respectively: w1, w2, w3 and w 4.

(3) Updating model parameters: and (3) carrying out Gradient backward calculation on the loss in the previous step by adopting a random Gradient Descent (SGD) method to obtain the updated values of all model parameters, and updating the network.

It should be noted that step S803 in the embodiment of the present application is optional. In the process of training the image classification combined model, after step S802 is executed, step S804 may be directly executed, that is, the network parameters of the image classification combined model are directly optimized until the model converges, and in this way, Conv6 has only one stage of learning; alternatively, step S803 is executed before step S804, and in this embodiment, Conv6 has two stages of learning.

In the case of executing step S803, the fully-connected layer in the embodiment of the present application includes two stages of training: first, based on three loss functions (weighted summation), N iterations are performed; then, based on four loss functions (weighted summation), the iteration is unified until convergence. Wherein, the second stage is the adjustment of the overall parameters of the model, and the first stage is the adjustment of the full connection layer only.

Taking the image classification joint model shown in fig. 4 as an example, in the embodiment of the present application, Conv 1-5 in the network are initialized by using parameters pre-trained by Imagenet, Conv6 is initialized randomly, different initializations result in different learning effects of two modules, and in order to make Conv6 effectively initialize tasks, Conv6 may be learned in the following two stages:

the first stage is as follows: the method comprises the steps of training a limited epoch wheel (for example, 10 wheels) by adopting three loss of a first triple loss function, a multi-label loss function and a classification entropy loss function, so that global embedding and classification are preliminarily converged (weight setting is 1, 1 and 0.5), wherein the purpose of learning here is to enable Conv6 to find a group of good parameters suitable for a current classification task, and meanwhile, the global embedding needs to keep measurement learning characteristics all the time, so that the global embedding is always in learning loss (because the semantic embedding does not have good characterization capability at this time, and the semantic embedding and the classification are not the same learning task, have certain conflict and are adjacent modules, the Conv6 parameter optimization is easily influenced, and therefore, learning is not added). The entropy loss has the effect that other possibly related categories have certain predicted values while performing semantic classification, but the predicted values are not required to be large enough to be identified as the categories which are not marked, so that overfitting of the network to the marked categories is avoided.

And a second stage: and training to be convergent by adopting all losses, so that the global embedding and the semantic embedding learn to be convergent respectively (w 1-w 4 weight is set to be 1:1:0.5:0.5), the subsequent semantic classification weight is reduced, and the two embedding learn preferentially.

It should be noted that, in the embodiment of the present application, some super parameters, such as weights w1 to w4 corresponding to each loss function, and margin during triple-loss calculation, may be adjusted as needed, which are merely illustrated herein and are not limited herein.

The following describes in detail a calculation process of the loss function in the embodiment of the present application, and the present application mainly relates to three types of loss functions, namely, a triplet loss function, a classification entropy loss function, and a multi-label loss function. And constructing a first triple loss function and a second triple loss function respectively based on the shallow global feature and the deep semantic feature. Thus, the target loss function in the present application is mainly obtained by weighted summation of the four loss functions.

Referring to fig. 9, which is a flowchart of a method for calculating a loss function in the embodiment of the present application, the method specifically includes the following steps:

s901: and the server constructs a first triple loss function based on the third shallow global features corresponding to the training samples in the training sample group.

It should be noted that, in the embodiment of the present application, the third shallow global feature is the same as the content characterized by the first shallow global feature and the second shallow global feature in the foregoing, and only the corresponding objects are different, and the third shallow global feature is for a training sample, and the following third deep semantic feature and the classification vector are similar.

In particular, the first triplet loss function may be constructed based on the following process:

s9011: determining a first distance between a third shallow global feature corresponding to the anchor sample in the training sample group and a third shallow global feature corresponding to the positive sample, and a second distance between the third shallow global feature corresponding to the anchor sample and a third shallow global feature corresponding to the negative sample.

S9012: the difference between the first distance and the second distance is taken as a first target value, and the sum of a specified boundary value for a difference value boundary representing the degree of similarity between the positive and negative examples.

S9013: and taking the first target value and the maximum value in the reference parameter as a first triplet loss function.

For example, after the triplets (a, p, n) are found in the batch sample, a first triple loss function triplet loss1 is computed based on the shallow global features of these triple samples.

The calculation formula of the triple loss function Triplet loss is as follows:

l_tri＝max(||x_a-x_p||-||x_a-x_n||+α，0)

where α is margin (predetermined boundary value) and is set to 4. In particular, | x_a-x_p| may represent the L2 distance, i.e., the first distance, between the third shallow global feature to which the anchor sample corresponds and the third shallow global feature to which the positive sample corresponds; II x_a-x_n| may represent the L2 distance, i.e., the second distance, between the third shallow global feature to which the anchor sample corresponds and the third shallow global feature to which the negative sample corresponds.

In the above calculation formula of Triplet loss, the reference parameter is 0. If the first target value | | x_a-x_p||-||x_a-x_nIf | | l + α is greater than 0, the first triple loss function is the first target value, and if the first target value | | x_a-x_p||-||x_a-x_nIf | + α is not greater than 0, the first triplet loss function is 0.

In the embodiment of the present application, the purpose of triplet loss1 is to make the distance between anchor and negative shallow global features greater than 4 (i.e., greater than margin) than the distance between anchor and positive shallow global features. Based on the loss function, the shallow global features of the anchor point sample and the positive sample are close to each other, and the shallow global features of the anchor point sample and the negative sample are opposite to each other, so that the shallow global features between the positive sample and the negative sample have a certain distance, and the classification accuracy is improved.

S902: and the server constructs a second triple loss function based on the third deep semantic features corresponding to the training samples.

Correspondingly, the process of constructing the second triplet loss function is similar to the process of constructing the first triplet loss function, and the method specifically comprises the following implementation steps:

s9021: based on a third distance between a third deep semantic feature corresponding to the anchor sample in the training sample set and a third deep semantic feature corresponding to the positive sample, a fourth distance between the third deep semantic feature corresponding to the anchor sample and a third deep semantic feature corresponding to the negative sample.

S9022: the difference between the third distance and the fourth distance and the sum of specified boundary values for the difference boundary representing the degree of similarity between the positive and negative examples are taken as the second target value.

S9023: and taking the second target value and the maximum value in the reference parameter as a second triplet loss function.

In the embodiment of the present application, for the depth semantic features output by the depth semantic embedding module, the above computation formula of Triplet loss may also be used to compute 2 of the second Triplet loss function Triplet loss, and the margin is the same.

Wherein | x_a-x_p| may represent the L2 distance between the anchor sample corresponding third deep semantic feature and the positive sample corresponding third deep semantic feature, i.e., the third distance; II x_a-x_n| may represent the L2 distance, i.e., the fourth distance, between the anchor sample corresponding third deep semantic feature and the negative sample corresponding third deep semantic feature.

In the embodiment of the present application, the purpose of the second triple loss function triplet loss2 is to make the distance between the anchor and the deep semantic features of negative greater than 4. Based on the loss function, the deep semantic features of the anchor sample and the positive sample are close to each other, and the deep semantic features of the anchor sample and the negative sample are opposite to each other, so that the deep semantic features between the positive sample and the negative sample have a certain distance, and the classification accuracy is improved.

S903: the server constructs a multi-label loss function based on the classification vectors corresponding to the training samples and the corresponding multi-label vectors, wherein the classification vectors represent the prediction probabilities of the training samples corresponding to the classification labels, and the multi-label vectors represent the real probabilities of the training samples corresponding to the classification labels.

In the embodiment of the present application, for the probability vector (i.e., the classification vector) output by the FC layer, a multi-label loss function bce loss, i.e., a Class loss shown in fig. 4, between the probability vector and the multi-label vector obtained by multi-label labeling can be calculated.

Specifically, for a sample i of the total number n of samples, the true label vector t [ i ] (i.e., the multi-label classification vector) is a 0, 1 vector of 1 × 1000, and the predicted value o [ i ] (i.e., the classification vector) is the prediction probability for each of 1 × 1000 labels, the multi-label loss is calculated according to the following formula:

when the true value of a certain label bit is 1, the left side of the plus sign of the upper formula takes effect, and when the true value is 0, the right side of the upper formula takes effect, so that the supervision information of a certain sample under all labels can be learned.

In the embodiment of the application, the FC classification learning is used for constraining the classification embedding to have classification information, and the classification information is very important for similar class image retrieval such as downstream tasks. On the other hand, since the classification is easy to be over-fitted, the embedding discrimination is not large, so that the FC layer is set in the learning process to adopt a learning rate 10 times larger than that of other layers, the gradient of the FC layer only uses the original 0.1 gradient when the returned semantic embedding is updated, and the overfitting of the embedding to the classification is avoided.

S904: and the server acquires a classification entropy loss function based on the classification vector corresponding to each training sample.

Wherein the categorical Entropy loss function is Encopy loss in FIG. 4. In the embodiment of the application, in order to avoid the problem of over-fitting of classification prediction, the control loss is introduced, the Entropy is calculated for the prediction result of the FC layer, and the distribution loss is introduced. In the model learning process, the larger the expected entropy is, the better the expected entropy is, so that the model prediction result has a larger predicted value for a certain classification label, but the predicted values of a plurality of labels are allowed to be larger, but the influence on the classification effect is limited.

Specifically, for all 2n plots in the batch (assuming there are n sample pairs in the batch), each plot can calculate the following penalty:

wherein p is_iFor the predicted output of the i sample FC layer, p_ijThe prediction result of the j-th class (i.e. the target class) is the i sample.

S905: and the server performs weighted summation on the first triple loss function, the second triple loss function, the multi-label loss function and the classification entropy loss function to obtain a target loss function.

Assume that the first triplet loss function is denoted L_triplet1The second triplet loss function is denoted L_triplet2The multi-label penalty function is expressed as L_bceThe classification entropy loss function is expressed as L_EntropyThe weights corresponding to the several loss functions are: w is a₁，w₂，w₃，w₄. Accordingly, the target loss function is:

L_total＝w₁L_triplet1+w₂L_triplet2+w₃L_bce+w₄L_Entropy。

it should be noted that the specific calculation order of the above loss functions is not limited specifically, and the above is only an example.

In summary, the embodiment of the application provides a joint learning network structure supporting simultaneous image global representation and semantic embedding, and semantic extraction is realized by adding a deep residual structure to the bottom-layer features of an image. Meanwhile, the bottom layer embedding representation and the semantic embedding representation of the image are learned, so that the time consumption in application is reduced, and the effects of extraction and semantic embedding are improved. In addition, the learning strategies such as the overfitting loss resistance and the new parameter layer pre-training initialization method are provided, so that the effective parameter learning is realized, and the more effective initialization (such as adding Conv6) of the overall network parameters and the learning effect are realized through the setting of the loss and the learning method.

Based on the same inventive concept, the embodiment of the application also provides an image recognition device. As shown in fig. 10, which is a schematic structural diagram of the image recognition apparatus 1000, the image recognition apparatus may include:

a first feature extraction unit 1001, configured to extract a first shallow global feature of an image to be detected based on a metric learning embedding module in a trained image classification combined model, where the first shallow global feature represents global basic information of the image to be detected; and

the second feature extraction unit 1002 is configured to extract a first deep semantic feature of the image to be detected based on a depth semantic embedding module in the image classification combination model, where the first deep semantic feature represents image semantic information of the image to be detected;

the classification prediction unit 1003 is used for performing feature fusion on the first shallow layer global feature and the first deep layer semantic feature based on a semantic prediction module in the image classification combined model, performing multi-label classification on the image to be detected based on the obtained first fusion feature, and obtaining a classification result of the image to be detected;

Optionally, the apparatus further comprises:

an image retrieval unit 1004, configured to extract second shallow global features of each candidate image in the image library respectively based on the metric learning embedding module; respectively extracting second deep semantic features of the image to be detected based on the deep semantic embedding module;

respectively performing feature fusion on each second shallow layer global feature and the corresponding second deep layer semantic feature based on a semantic prediction module to obtain a second fusion feature corresponding to each candidate image;

Optionally, the image classification joint model further includes: a base convolution module; the device still includes:

a basic processing unit 1005, configured to, before the first feature extraction unit 1001 extracts the first shallow global feature of the image to be detected based on the metric learning embedding module in the trained image classification combined model, input the image to be detected into a basic convolution module in the image classification combined model, perform global feature extraction on the image to be detected based on the basic convolution module, and obtain a global feature map corresponding to the image to be detected;

the first feature extraction unit 1001 is specifically configured to:

and based on a metric learning embedding module, carrying out embedding processing on the global feature map to obtain a first shallow global feature corresponding to the image to be detected.

Optionally, the depth semantic embedding module includes a convolution layer and an embedding layer composed of residual structures; the second feature extraction unit 1002 is specifically configured to:

Optionally, the learning rate of the semantic prediction module is higher than that of other modules, and the other modules include a basic convolution module, a metric learning embedding module and a deep semantic embedding module.

Optionally, the apparatus further comprises:

a training unit 1006, configured to train to obtain an image classification joint model by:

inputting the selected training sample group into a trained image classification combined model, and acquiring a third shallow global feature output by a metric learning embedding module, a third deep semantic feature output by a deep semantic embedding module and a classification vector output by a semantic prediction module based on the metric learning embedding module in the image classification combined model;

Optionally, the training unit 1006 is specifically configured to:

constructing a multi-label loss function based on the classification vector and the corresponding multi-label vector corresponding to each training sample, wherein the classification vector represents the prediction probability of the training sample corresponding to each classification label, and the multi-label vector represents the real probability of the training sample corresponding to each classification label;

Optionally, each training sample set includes three training samples: an anchor sample, a positive sample and a negative sample; the training unit 1006 is specifically configured to:

determining a first distance between a third shallow global feature corresponding to an anchor sample in a training sample group and a third shallow global feature corresponding to a positive sample, and a second distance between the third shallow global feature corresponding to the anchor sample and a third shallow global feature corresponding to a negative sample;

taking the sum of the difference between the first distance and the second distance and a specified boundary value as a first target value, wherein the specified boundary value is used for representing a difference value boundary of the similarity between the positive sample and the negative sample;

and taking the first target value and the maximum value in the reference parameter as a first triplet loss function.

based on a third distance between a third deep semantic feature corresponding to the anchor sample in the training sample group and a third deep semantic feature corresponding to the positive sample, a fourth distance between the third deep semantic feature corresponding to the anchor sample and a third deep semantic feature corresponding to the negative sample;

taking the sum of the difference between the third distance and the fourth distance and a specified boundary value as a second target value, wherein the third specified boundary value is used for a difference value boundary representing the similarity between the positive sample and the negative sample;

and taking the second target value and the maximum value in the reference parameter as a second triplet loss function.

Optionally, the training unit 1006 is further configured to:

before network parameters of the image classification combined model are adjusted for multiple times based on the target loss function, weighting and summing a first triple loss function, a multi-label loss function and a classification entropy loss function to obtain a middle loss function;

Optionally, the apparatus comprises:

a sample construction unit 1007, configured to obtain a training sample group in a training sample data set by the following method:

performing sample recombination based on each positive sample pair to obtain a plurality of training sample groups for forming a training sample data set, wherein each training sample group comprises three training samples: the image processing method comprises the following steps of an anchor sample, a positive sample and a negative sample, wherein the anchor sample and the positive sample are sample images with the similarity higher than a specified threshold value, and the negative sample and the positive sample are sample images with the similarity not higher than the specified threshold value.

Optionally, the sample construction unit 1007 is specifically configured to:

sorting all the candidate samples according to the respective corresponding distances, and selecting at least one candidate sample with a designated sorting position as a negative sample corresponding to the target sample;

and respectively combining each selected negative sample with each selected positive sample pair to form a training sample group, wherein the target sample in the positive sample pair is a positive sample in the training sample group, and the other sample in the positive sample pair is an anchor point sample in the training sample group.

According to the image classification method and device, the model structure beneficial to learning of two features is obtained through the metric learning embedding module used for extracting the shallow global features and the depth semantic embedding module used for extracting the deep semantic features, when the images to be detected are classified based on the image joint classification model, the images can be characterized in a combined mode through the bottom image characterization and the deep semantic characterization, the combined learning effect of the bottom image characterization and the deep semantic characterization is enhanced, time consumption of feature extraction is reduced, image classification is carried out based on the obtained image fusion features, more accurate classification results can be obtained, and accuracy and efficiency of image classification are improved.

For convenience of description, the above parts are separately described as modules (or units) according to functional division. Of course, the functionality of the various modules (or units) may be implemented in the same one or more pieces of software or hardware when implementing the present application.

Having described the image recognition method and apparatus according to an exemplary embodiment of the present application, next, an image recognition apparatus according to another exemplary embodiment of the present application will be described.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

In some possible embodiments, an image recognition apparatus according to the present application may include at least a processor and a memory. Wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the image recognition method according to various exemplary embodiments of the present application described in the specification. For example, the processor may perform the steps as shown in fig. 2.

The electronic equipment is based on the same inventive concept as the method embodiment, and the embodiment of the application also provides the electronic equipment. In one embodiment, the electronic device may be a server, such as server 120 shown in FIG. 1. In this embodiment, the electronic device may be configured as shown in fig. 11, and include a memory 1101, a communication module 1103, and one or more processors 1102.

A memory 1101 for storing computer programs executed by the processor 1102. The memory 1101 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a program required for running an instant messaging function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.

The memory 1101 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1101 may also be a non-volatile memory (non-volatile memory), such as a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD) or a solid-state drive (SSD); or the memory 1101 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1101 may be a combination of the above memories.

The processor 1102 may include one or more Central Processing Units (CPUs), a digital processing unit, and the like. The processor 1102 is configured to implement the image recognition method when the computer program stored in the memory 1101 is called.

The communication module 1103 is used for communicating with the terminal device and other servers.

In the embodiment of the present application, a specific connection medium among the memory 1101, the communication module 1103, and the processor 1102 is not limited. In the embodiment of the present application, the memory 1101 and the processor 1102 are connected through a bus 1104 in fig. 11, the bus 1104 is depicted by a thick line in fig. 11, and the connection manner between other components is merely an illustrative illustration and is not limited thereto. The bus 1104 may be divided into an address bus, a data bus, a control bus, and the like. For ease of description, only one thick line is depicted in FIG. 11, but only one bus or one type of bus is not depicted.

The memory 1101 stores a computer storage medium, and the computer storage medium stores computer-executable instructions for implementing the image recognition method according to the embodiment of the present application. The processor 1102 is configured to perform the image recognition method described above, as shown in FIG. 2.

In another embodiment, the electronic device may also be other electronic devices, such as the terminal device 110 shown in fig. 1. In this embodiment, the structure of the electronic device may be as shown in fig. 12, including: communications assembly 1210, memory 1220, display unit 1230, camera 1240, sensors 1250, audio circuitry 1260, bluetooth module 1270, processor 1280, and the like.

The communication component 1210 is configured to communicate with a server. In some embodiments, a Wireless Fidelity (WiFi) module may be included, the WiFi module being a short-range Wireless transmission technology, through which the electronic device may help the user to transmit and receive information.

The memory 1220 may be used for storing software programs and data. Processor 1280 performs various functions of terminal device 110 and data processing by executing software programs or data stored in memory 1220. The memory 1220 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Memory 1220 stores an operating system that enables terminal device 110 to operate. The memory 1220 may store an operating system and various application programs, and may also store codes for performing the image recognition method according to the embodiment of the present application.

The display unit 1230 may also be used to display information input by the user or information provided to the user and a Graphical User Interface (GUI) of various menus of the terminal apparatus 110. Specifically, the display unit 1230 may include a display screen 1232 disposed on the front surface of the terminal device 110. The display 1232 may be configured in the form of a liquid crystal display, a light emitting diode, or the like. The display unit 1230 may be used to display the image classification result, the image retrieval result, and the like in the embodiment of the present application.

The display unit 1230 may be further configured to receive input numeric or character information and generate signal input related to user settings and function control of the terminal device 110, and specifically, the display unit 1230 may include a touch screen 1231 disposed on the front surface of the terminal device 110 and configured to collect touch operations of a user thereon or nearby, such as clicking a button, dragging a scroll box, and the like.

The touch screen 1231 may cover the display screen 1232, or the touch screen 1231 and the display screen 1232 may be integrated to implement the input and output functions of the terminal device 110, and after the integration, the touch screen may be referred to as a touch display screen for short. The display unit 1230 in this application can display the application programs and the corresponding operation steps.

The camera 1240 may be used to capture still images and a user may post comments on the images taken by the camera 1240 through an application. The number of the cameras 1240 may be one or plural. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing elements convert the light signals into electrical signals, which are then passed to a processor 1280 for conversion into digital image signals.

The terminal device may further comprise at least one sensor 1250, such as an acceleration sensor 1251, a distance sensor 1252, a fingerprint sensor 1253, a temperature sensor 1254. The terminal device may also be configured with other sensors such as a gyroscope, barometer, hygrometer, thermometer, infrared sensor, light sensor, motion sensor, and the like.

Audio circuit 1260, speaker 1261, microphone 1262 may provide an audio interface between a user and terminal device 110. The audio circuit 1260 may transmit the received electrical signal converted from the audio data to the speaker 1261, and the audio signal is converted into a sound signal by the speaker 1261 and output. Terminal device 110 may also be configured with a volume button for adjusting the volume of the sound signal. On the other hand, the microphone 1262 converts the collected sound signals into electrical signals, which are received by the audio circuit 1260 and converted into audio data, which are output to the communication module 1210 for transmission to, for example, another terminal device 110, or to the memory 1220 for further processing.

The bluetooth module 1270 is used for information interaction with other bluetooth devices having bluetooth modules through a bluetooth protocol. For example, the terminal device may establish a bluetooth connection with a wearable electronic device (e.g., a smart watch) that is also equipped with a bluetooth module through the bluetooth module 1270, so as to perform data interaction.

The processor 1280 is a control center of the terminal device, connects various parts of the entire terminal device using various interfaces and lines, and performs various functions of the terminal device and processes data by running or executing software programs stored in the memory 1220 and calling data stored in the memory 1220. In some embodiments, processor 1280 may include one or more processing units; the processor 1280 may also integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a baseband processor, which primarily handles wireless communications. It is to be appreciated that the baseband processor described above may not be integrated into the processor 1280. In the application, the processor 1280 may run an operating system, an application program, a user interface display and a touch response, and the image recognition method according to the embodiment of the application. Additionally, processor 1280 is coupled with display unit 1230.

In some possible embodiments, the aspects of the image recognition method provided by the present application may also be implemented in the form of a program product, which includes program code for causing an electronic device to perform the steps in the image recognition method according to various exemplary embodiments of the present application described above in this specification when the program product is run on the electronic device, for example, the electronic device may perform the steps as shown in fig. 2.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on the user equipment, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. An image recognition method, comprising:

2. The method of claim 1, wherein said multi-label classifying said image to be detected based on said obtained first fused feature further comprises:

respectively extracting second shallow global features of each candidate image in an image library based on the metric learning embedding module; respectively extracting second deep semantic features of the image to be detected based on the deep semantic embedding module;

3. The method of claim 1, wherein the image classification joint model further comprises: a base convolution module; before the metric learning embedding module in the trained image classification-based joint model extracts the first shallow global feature of the image to be detected, the method further comprises the following steps:

inputting the image to be detected into a basic convolution module in the image classification combination model, and performing global feature extraction on the image to be detected based on the basic convolution module to obtain a global feature map corresponding to the image to be detected;

the metric learning embedding module based on the trained image classification combined model extracts the first shallow global feature of the image to be detected, and comprises the following steps:

4. The method of claim 3, wherein the depth semantic embedding module includes a convolutional layer and an embedding layer of residual structure;

the depth semantic embedding module based on the image classification combined model extracts a first deep semantic feature of the image to be detected,

5. The method of claim 3, wherein the semantic prediction module has a learning rate that is higher than learning rates of other modules, including the base convolution module, the metric learning embedding module, and the deep semantic embedding module.

6. The method of claim 4, wherein the base convolution module includes a plurality of convolution layers; network parameters corresponding to the convolutional layers in the basic convolutional module are obtained based on the pre-trained parameter initialization of the specified sample library, and network parameters corresponding to the convolutional layers in the deep semantic embedding module are obtained based on random initialization.

7. The method of claim 1, wherein the joint model of image classification is trained by:

inputting the selected training sample group into the trained image classification combined model, and acquiring a third shallow global feature output by a metric learning embedding module based on the image classification combined model, a third deep semantic feature output by a deep semantic embedding module and a classification vector output by a semantic prediction module, wherein the classification vector represents the prediction probability of the training sample corresponding to each classification label;

8. The method of claim 7, wherein constructing a loss function based on the third shallow global feature, the third deep semantic feature, and the classification vector comprises:

constructing a multi-label loss function based on the classification vectors corresponding to the training samples and the corresponding multi-label vectors, wherein the multi-label vectors represent the real probability of the training samples corresponding to the classification labels;

9. The method of claim 8, wherein each training sample set comprises three training samples: an anchor sample, a positive sample and a negative sample; constructing a first triple loss function based on the third shallow global features corresponding to the training samples in the training sample group, including:

10. The method of claim 8, wherein each training sample set comprises three training samples: an anchor sample, a positive sample and a negative sample; constructing a second triple loss function based on the third deep semantic features corresponding to the training samples, including:

11. The method of claim 8, wherein prior to the adjusting network parameters of the joint model of image classification based on the objective loss function a plurality of times, further comprising:

carrying out weighted summation on the first triple loss function, the multi-label loss function and the classification entropy loss function to obtain an intermediate loss function;

12. The method according to any of claims 1 to 11, wherein the training sample set in the training sample data set is obtained by:

13. The method of claim 12, wherein the performing sample reorganization based on the positive sample pairs to obtain a plurality of training sample groups for constructing the training sample data set comprises:

14. An image recognition apparatus, comprising:

15. An electronic device, comprising a processor and a memory, wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 13.

16. A computer-readable storage medium, characterized in that it comprises program code for causing an electronic device to carry out the steps of the method according to any one of claims 1 to 13, when said storage medium is run on said electronic device.

17. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 13 when executed by a processor.