CN115690530A

CN115690530A - Object recognition model training method and device, electronic equipment and storage medium

Info

Publication number: CN115690530A
Application number: CN202211222074.0A
Authority: CN
Inventors: 牛雪松; 谷继力
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-10-08
Filing date: 2022-10-08
Publication date: 2023-02-03

Abstract

The present disclosure relates to a method, an apparatus, an electronic device and a storage medium for training an object recognition model, wherein the method comprises: extracting semantic features of the sample image at a target semantic depth based on an object recognition model to be trained to obtain an initial semantic feature map of the target semantic depth; carrying out scale compression processing on the initial semantic feature map to obtain a first semantic feature map; extracting semantic features of the sample image at the target semantic depth based on the teacher identification model to obtain a second semantic feature map; performing frequency domain enhancement processing on the basis of the frequency domain features corresponding to the first semantic feature map to obtain a first frequency domain enhancement feature map, and performing frequency domain enhancement processing on the basis of the frequency domain features corresponding to the second semantic feature map to obtain a second frequency domain enhancement feature map; and training the object recognition model based on the difference between the first frequency domain enhanced feature map and the second frequency domain enhanced feature map until a preset training end condition is reached. The method and the device reduce consumption of calculation resources by the model while ensuring the identification accuracy of the object identification model.

Description

Object recognition model training method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for training an object recognition model, an electronic device, and a storage medium.

Background

With the development of computer technology, object recognition models based on convolutional neural networks are widely applied in the field of object recognition (such as commodity recognition), and because the convolutional neural networks need to consume more computing resources in the inference process, the landing application of the object recognition models on mobile phone terminals or embedded devices brings huge challenges.

In the related art, some targeted training methods are also adopted for the above problems, and although the training methods can reduce the consumption of computing resources of the trained object recognition model during inference to a certain extent, the recognition accuracy of the model cannot be guaranteed.

Disclosure of Invention

The present disclosure provides a method and an apparatus for training an object recognition model, an electronic device, and a storage medium, so as to at least solve the problem that in the related art, it is not possible to ensure that a model has a high recognition accuracy while effectively reducing the consumption of computing resources in the inference process of the object recognition model. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided an object recognition model training method, including:

extracting semantic features of a sample image at a target semantic depth based on an object recognition model to be trained to obtain an initial semantic feature map of the target semantic depth; the object identification model is used for identifying a target object in the sample image;

carrying out scale compression processing on the initial semantic feature map of the target semantic depth to obtain a first semantic feature map of the target semantic depth;

extracting semantic features of the sample image at the target semantic depth based on a pre-trained teacher recognition model to obtain a second semantic feature map of the target semantic depth; the teacher identification model is used for identifying a target object in the sample image;

performing frequency domain enhancement processing on the first semantic feature map based on the frequency domain feature corresponding to the first semantic feature map to obtain a first frequency domain enhancement feature map of the target semantic depth; performing frequency domain enhancement processing on the second semantic feature map based on the frequency domain feature corresponding to the second semantic feature map to obtain a second frequency domain enhancement feature map of the target semantic depth;

and training the object recognition model based on the difference between the first frequency domain enhanced feature map of the target semantic depth and the second frequency domain enhanced feature map of the target semantic depth until a preset training end condition is reached, so as to obtain the trained object recognition model.

In an exemplary embodiment, the sample image corresponds to a reference category label; the training the object recognition model based on the difference between the first frequency-domain enhanced feature map and the second frequency-domain enhanced feature map until reaching a preset training end condition includes:

determining a first loss value based on a difference between the first frequency-domain enhancement feature map and the second frequency-domain enhancement feature map;

carrying out object recognition processing on the sample image based on the first semantic feature map through the object recognition model to obtain a first recognition result; determining a second loss value based on the reference class label of the first recognition result corresponding to the sample image;

performing object recognition processing on the sample image based on the second semantic feature map through the teacher recognition model to obtain a second recognition result;

determining a third loss value based on the first recognition result and the second recognition result;

and adjusting the model parameters of the object recognition model based on the first loss value, the second loss value and the third loss value until a preset training end condition is reached.

In an exemplary embodiment, the extracting semantic features of a sample image at a target semantic depth based on an object recognition model to be trained to obtain an initial semantic feature map of the target semantic depth includes:

inputting the sample image into an object recognition model to be trained, and performing convolution processing on the sample image through a shallow convolution network of the object recognition model to obtain a feature map output by the shallow convolution network; the shallow convolutional network comprises a preset number of convolutional layers close to the input layer in the object recognition model;

and obtaining an initial semantic feature map of the target semantic depth based on the feature map output by the shallow convolutional network.

In an exemplary embodiment, the obtaining an initial semantic feature map of the target semantic depth based on the feature map output by the shallow convolutional network includes:

carrying out scale compression processing on the feature map output by the shallow layer convolution network, inputting the feature map subjected to scale compression processing into a middle layer convolution network connected with the shallow layer convolution network for convolution processing, and obtaining the feature map output by the middle layer convolution network;

and taking the feature map output by the middle layer convolution network and/or the feature map output by the shallow layer convolution network as the initial semantic feature map of the target semantic depth.

In an exemplary embodiment, the determining a first loss value based on a difference between the first frequency-domain enhanced feature map of the target semantic depth and the second frequency-domain enhanced feature map of the target semantic depth includes:

determining a first sub-loss value based on a difference between a first frequency-domain enhancement feature map corresponding to the shallow convolutional network and a second frequency-domain enhancement feature map corresponding to the shallow convolutional network;

determining a second sub-loss value based on a difference between a first frequency domain enhancement feature map corresponding to the middle layer convolutional network and a second frequency domain enhancement feature map corresponding to the middle layer convolutional network;

determining the first penalty value based on the first sub-penalty value and the second sub-penalty value.

In an exemplary embodiment, the determining the first loss value based on the first sub-loss value and the second sub-loss value includes:

determining a third sub-loss value based on a difference between a frequency domain enhanced feature map corresponding to a higher-level convolutional network in the object recognition model and a frequency domain enhanced feature map corresponding to a higher-level convolutional network in the teacher recognition model;

determining the first loss value based on the first, second, and third sub-loss values;

the frequency domain enhancement characteristic graph corresponding to the high-level convolutional network is obtained based on the characteristic graph output by the high-level convolutional network, and the characteristic graph output by the high-level convolutional network is obtained by carrying out scale compression processing on the connected characteristic graph output by the middle-level convolutional network.

In an exemplary embodiment, the performing, on the basis of the frequency-domain feature corresponding to the first semantic feature map, frequency-domain enhancement processing on the first semantic feature map to obtain a first frequency-domain enhanced feature map includes:

performing Fourier transform on the first semantic feature map to obtain a first frequency domain feature map corresponding to the first semantic feature map;

determining a first enhanced frequency domain map based on the first frequency domain feature map and a first mask map corresponding to the first frequency domain feature map; performing inverse Fourier transform on the first frequency domain enhancement image to obtain a first frequency domain enhancement characteristic image;

the frequency domain enhancement processing is performed on the second semantic feature map based on the frequency domain feature corresponding to the second semantic feature map to a second frequency domain enhanced feature map, and the frequency domain enhancement processing comprises the following steps:

performing Fourier transform on the second semantic feature map to obtain a second frequency domain feature map corresponding to the second semantic feature map;

determining a second enhanced frequency domain map based on the second frequency domain feature map and a second mask map corresponding to the second frequency domain feature map; and carrying out inverse Fourier transform on the second enhanced frequency domain image to obtain a second frequency domain enhanced characteristic image.

According to a second aspect of the embodiments of the present disclosure, there is provided an object recognition model training apparatus, including:

the first feature extraction unit is configured to extract semantic features of a sample image at a target semantic depth based on an object recognition model to be trained to obtain an initial semantic feature map of the target semantic depth; the object identification model is used for identifying a target object in the sample image;

the scale compression unit is configured to perform scale compression processing on the initial semantic feature map of the target semantic depth to obtain a first semantic feature map of the target semantic depth;

the second feature extraction unit is configured to extract semantic features of the sample image at the target semantic depth based on a pre-trained teacher identification model to obtain a second semantic feature map of the target semantic depth; the teacher identification model is used for identifying a target object in the sample image;

a frequency domain enhancement processing unit configured to perform frequency domain enhancement processing on the first semantic feature map based on a frequency domain feature corresponding to the first semantic feature map to obtain a first frequency domain enhancement feature map of the target semantic depth; performing frequency domain enhancement processing on the second semantic feature map based on the frequency domain features corresponding to the second semantic feature map to obtain a second frequency domain enhanced feature map of the target semantic depth;

and the training unit is configured to execute a difference between a first frequency domain enhanced feature map based on the target semantic depth and a second frequency domain enhanced feature map based on the target semantic depth, train the object recognition model until a preset training end condition is reached, and obtain a trained object recognition model.

In an exemplary embodiment, the sample image corresponds to a reference category label; the training unit comprises:

a first loss determination unit configured to perform determining a first loss value based on a difference between the first frequency-domain enhancement feature map and the second frequency-domain enhancement feature map;

a second loss determination unit configured to perform object recognition processing on the sample image based on the first semantic feature map by the object recognition model to obtain a first recognition result; determining a second loss value based on the reference class label of the first recognition result corresponding to the sample image;

a first identification unit configured to perform object identification processing on the sample image based on the second semantic feature map by the teacher identification model to obtain a second identification result;

a third loss determination unit configured to perform determination of a third loss value based on the first recognition result and the second recognition result;

a parameter adjusting unit configured to perform adjusting model parameters of the object recognition model based on the first loss value, the second loss value, and the third loss value until a preset training end condition is reached.

In an exemplary embodiment, the first feature extraction unit includes:

the first convolution unit is configured to input the sample image into an object recognition model to be trained, and carry out convolution processing on the sample image through a shallow convolution network of the object recognition model to obtain a feature map output by the shallow convolution network; the shallow convolutional network comprises a preset number of convolutional layers close to the input layer in the object recognition model;

and the feature extraction subunit is configured to execute a feature map output based on the shallow convolutional network to obtain an initial semantic feature map of the target semantic depth.

In an exemplary embodiment, the feature extraction subunit is specifically configured to perform scale compression processing on the feature map output by the shallow convolutional network, and input the feature map after the scale compression processing to a middle convolutional network connected to the shallow convolutional network for convolution processing, so as to obtain a feature map output by the middle convolutional network; and taking the feature map output by the middle layer convolution network and/or the feature map output by the shallow layer convolution network as the initial semantic feature map of the target semantic depth.

In an exemplary embodiment, the first loss determining unit includes:

a first sub-loss determination unit configured to perform determining a first sub-loss value based on a difference between a first frequency-domain enhancement feature map corresponding to the shallow convolutional network and a second frequency-domain enhancement feature map corresponding to the shallow convolutional network;

a second sub-loss determination unit configured to perform determining a second sub-loss value based on a difference between a first frequency-domain enhancement feature map corresponding to the middle layer convolutional network and a second frequency-domain enhancement feature map corresponding to the middle layer convolutional network;

a fourth loss determination unit configured to perform determining the first loss value based on the first sub-loss value and the second sub-loss value.

In an exemplary embodiment, the fourth loss determining unit includes:

a third sub-loss determination unit configured to perform determination of a third sub-loss value based on a difference between a frequency-domain enhancement feature map corresponding to a higher-level convolutional network in the object recognition model and a frequency-domain enhancement feature map corresponding to a higher-level convolutional network in the teacher recognition model;

a fifth loss determination unit configured to perform determining the first loss value based on the first sub-loss value, the second sub-loss value, and the third sub-loss value;

In an exemplary embodiment, the frequency domain enhancement processing unit includes:

the first frequency domain enhancement processing unit is configured to perform Fourier transform on the first semantic feature map to obtain a first frequency domain feature map corresponding to the first semantic feature map; determining a first enhanced frequency domain map based on the first frequency domain feature map and a first mask map corresponding to the first frequency domain feature map; performing inverse Fourier transform on the first frequency domain enhancement image to obtain a first frequency domain enhancement characteristic image;

the second frequency domain enhancement processing unit is used for carrying out Fourier transform on the second semantic feature map to obtain a second frequency domain feature map corresponding to the second semantic feature map; determining a second enhanced frequency domain map based on the second frequency domain feature map and a second mask map corresponding to the second frequency domain feature map; and performing inverse Fourier transform on the second enhanced frequency domain image to obtain a second frequency domain enhanced characteristic image.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the object recognition model training method of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the object recognition model training method of the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the object recognition model training method of the first aspect described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the semantic features of the target semantic depth of the sample image are extracted through the object recognition model, the calculation resource consumption in the recognition process of the object recognition model is reduced through scale compression, the feature transfer between the teacher recognition model and the object recognition model to be trained is realized through the frequency domain enhanced feature map, so that the recognition result of the object recognition model is basically consistent with the accuracy of the teacher recognition model, and the effective reduction of the model calculation resource consumption is realized while the recognition accuracy of the trained object recognition model is ensured.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a schematic diagram of an application environment of a method for training an object recognition model according to an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a method of object recognition model training in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating the structure of an object recognition model to be trained in accordance with an exemplary embodiment;

FIG. 4 is a flow diagram illustrating another method of object recognition model training in accordance with an exemplary embodiment;

FIG. 5 is a flow diagram illustrating another method of object recognition model training in accordance with an exemplary embodiment;

FIG. 6 is a schematic diagram illustrating a first information delivery module in accordance with an exemplary embodiment;

FIG. 7 is a block diagram illustrating an architecture of an object recognition model training apparatus in accordance with an exemplary embodiment;

FIG. 8 is a block diagram of an electronic device shown in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

It should also be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are both information and data that are authorized by the user or sufficiently authorized by various parties.

Referring to fig. 1, a schematic diagram of an application environment of an object recognition model training method according to an exemplary embodiment is shown, where the application environment may include a terminal 110 and a server 120, and the terminal 110 and the server 120 may be connected through a wired network or a wireless network.

The terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. The terminal 110 may have installed therein client software providing an image processing function, such as an Application (App), which may be a stand-alone Application or a subroutine in the Application. Illustratively, the application may include a short video class application, a live class application, and the like. The image processing function may be to identify a target object in the image, for example, to identify an article such as a commodity in the image.

Specifically, the terminal 110 may perform object recognition processing on the image to be recognized based on a pre-trained object recognition model to obtain a recognition result, where the pre-trained object recognition model may be a lightweight convolutional neural network model capable of operating in the terminal 110.

The server 120 may provide background services for the application programs in the terminal 110, where the background services may include an object recognition model training service, and the server 120 may issue the object recognition model to the terminal 110 after training the object recognition model. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like.

Fig. 2 is a flowchart illustrating an object recognition model training method according to an exemplary embodiment, which includes the following steps, as shown in fig. 2.

In step S201, semantic features of the sample image at the target semantic depth are extracted based on the object recognition model to be trained, so as to obtain an initial semantic feature map at the target semantic depth.

The object recognition model to be trained is used for recognizing a target object in the sample image, and the target object may be set according to an actual application scenario, for example, in a commodity recognition scenario, the target object may be a commodity, and in a face recognition scenario, the target object may be a face, and so on.

The sample images may be original images of the target object in the training data set, and each sample image corresponds to labeling information indicating a reference category of the target object in the sample image, for example, when the target object is a commodity, the reference category may include a primary classification category (e.g., jacket, trousers, etc.), and may further include a secondary classification category (e.g., shirt, T-shirt, etc.).

The object recognition model to be trained may be a convolutional neural network structure, and a Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of convolutional layers and sub-sampling layers, which can be regarded as a filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

For example, the object recognition model to be trained may be a deep convolutional network such as a residual network (ResNet), which is a kind of deep convolutional network proposed in 2015, and is easier to optimize than a conventional convolutional neural network, and can improve accuracy by increasing a comparable depth. The core of the residual network is to solve the side effect (degradation problem) caused by increasing the depth, so that the network performance can be improved by simply increasing the network depth. The residual network generally includes many sub-modules with the same structure, and a number is usually connected to the residual network to indicate the number of times the sub-module is repeated, such as ResNet38.

Fig. 3 is a schematic structural diagram of an object recognition model to be trained according to an embodiment of the present disclosure, and as shown in fig. 3, an input and an output include 5 convolutional network layers connected in sequence, a last convolutional network layer is connected to an average pooling layer, and the average pooling layer is connected to a classifier, where the classifier includes a full connection layer and an output layer, the full connection layer includes a number of neurons consistent with a reference class number, and if the reference class is 1000 classes, the full connection layer includes 1000 neurons, and the output layer may employ a Softmax classification function. The convolutional network layer in the model will be described below by taking the input image size of 224x224 shown in fig. 3 as an example.

The input size of the first convolutional network layer is 112x112, the size of convolutional kernels in the first convolutional network layer is 7x7, and the number of the convolutional kernels is 64; the input size of the second convolution network layer is 56x56, the second convolution network layer comprises a maximum pooling layer and three first convolution modules, the size of a convolution kernel in the maximum pooling layer is 3x3, the number of the convolution kernels is 64, and each first convolution module comprises three convolution layers (the sizes of the convolution kernels are 1x1, 3x3 and 1x1 in sequence, and the number of the convolution kernels is 64, 64 and 256 in sequence); the input size of the third convolution network layer is 28x28, the second convolution module comprises a second convolution module and three third convolution modules, the second convolution module comprises three convolution layers (the sizes of convolution kernels are 1x1, 3x3 and 1x1 in sequence, the number of the convolution kernels is 128, 128 and 512 in sequence, and the step size is 1, 2 and 1 in sequence), each third convolution module comprises three convolution layers (the sizes of the convolution kernels are 1x1, 3x3 and 1x1 in sequence, the number of the convolution kernels is 128, 128 and 512 in sequence, and the step size is 1); the input size of the fourth convolution network layer is 14x14, the fourth convolution network layer comprises a fourth convolution module and five fifth convolution modules, the fourth convolution module comprises three convolution layers (the sizes of convolution kernels are 1x1, 3x3 and 1x1 in sequence, the number of the convolution kernels is 256, 256 and 1024, and the step size is 1, 2 and 1 in sequence), each fifth convolution module comprises three convolution layers (the sizes of the convolution kernels are 1x1, 3x3 and 1x1 in sequence, the number of the convolution kernels is 256, 256 and 1024, and the step size is 1); the input size of the fifth convolution network layer is 7x7, the fifth convolution network layer comprises a sixth convolution module and two seventh convolution modules, the sixth convolution module comprises three convolution layers (the sizes of convolution kernels are 1x1, 3x3 and 1x1 in sequence, the number of the convolution kernels is 512, 512 and 2048 in sequence, and the step size is 1, 2 and 1 in sequence), each seventh convolution module comprises three convolution layers (the sizes of the convolution kernels are 1x1, 3x3 and 1x1 in sequence, the number of the convolution kernels is 512, 512 and 2048 in sequence, and the step size is 1).

The target semantic depth may be set according to actual needs, for example, the target semantic depth may include a low semantic depth and a medium semantic depth, in which the semantic depths are sequentially increased.

In practical applications, a convolutional neural network structure from input to output may generally include a shallow convolutional network, a middle convolutional network, and a high convolutional network in sequence, where the shallow convolutional network is a preset number of convolutional layers close to the input layer in the model, and the preset number may be set based on the total number of convolutional layers in the model, for example, the total number of convolutional layers in the model is 3, that may set the first convolutional layer as the shallow convolutional network. The middle layer convolution network comprises convolution layers connected with the shallow layer convolution network, and the high layer convolution network comprises convolution layers connected with the middle layer convolution network. Then, the semantic features extracted by the shallow convolutional network may be defined as semantic features with low semantic depth (also referred to as a low-level semantic feature map), the semantic features extracted by the middle-level convolutional network may be defined as semantic features with medium semantic depth (also referred to as a middle-level semantic feature map), and the semantic features extracted by the high-level convolutional network may be defined as semantic features with high semantic depth (also referred to as a high-level semantic feature map).

Taking the model structure shown in fig. 3 as an example, the first convolutional network layer and the second convolutional network layer may be regarded as a shallow convolutional network, the third convolutional network layer and the fourth convolutional network layer may be regarded as a middle convolutional network, and the fifth convolutional network layer may be regarded as a high convolutional network, then the semantic features with low semantic depth may be features output by the first convolutional network layer and/or the second convolutional network layer, the semantic features with high semantic depth may be features output by the fifth convolutional network layer, and then the features output by the third convolutional network layer and/or the fourth convolutional network layer may both be regarded as semantic features with medium semantic depth.

It is to be understood that the above division of the shallow convolutional network, the middle convolutional network, and the high convolutional network is only an example, and does not constitute a specific limitation to the embodiments of the present disclosure.

In step S203, scale compression processing is performed on the initial semantic feature map of the target semantic depth to obtain a first semantic feature map of the target semantic depth.

The size compression process is to reduce the resolution of the feature map to reduce the spatial size of the feature map. In a specific implementation, the down-sampling processing may be performed on the initial semantic feature map of the target semantic depth, and the degree of scale compression may also be set according to actual needs, so as to compress the size of the feature map while avoiding loss of useful information as much as possible.

By carrying out scale compression processing on the initial semantic feature map of the target semantic depth, the spatial scale of the compressed first semantic feature map can be reduced, the waste of computing resources caused by redundant spatial scales during subsequent model processing is avoided, and the consumption of the computing resources of the model during the subsequent processing is reduced.

In consideration of practical application, the feature map with lower semantic depth contains more redundant information, that is, the feature map with lower semantic depth is subjected to scale compression, which is more beneficial to reducing consumption of computing resources in model processing. Therefore, the target semantic depth in the embodiment of the present disclosure includes a low semantic depth, and based on this, in an exemplary embodiment, the step S201 may include, when implemented:

inputting the sample image into an object recognition model to be trained, and carrying out convolution processing on the sample image through a shallow convolution network of the object recognition model to obtain a characteristic diagram output by the shallow convolution network;

Specifically, when the feature map output by the shallow convolutional network obtains the initial semantic feature map of the target semantic depth, the feature map output by the shallow convolutional network can be directly used as the initial semantic feature map of the target semantic depth.

In the embodiment, the sample image is convolved through the shallow convolutional network of the object recognition model, and the initial semantic feature map of the target semantic depth is obtained based on the feature map output by the shallow convolutional network, so that the target semantic depth includes the low semantic depth, and the scale compression processing can be subsequently performed on the semantic feature map of the low semantic depth, so that the redundant information of the low semantic depth occupies the calculation resources subsequently.

In an exemplary embodiment, the obtaining the initial semantic feature map of the target semantic depth based on the feature map output by the shallow convolutional network may include: carrying out scale compression processing on the feature map output by the shallow layer convolution network, inputting the feature map subjected to scale compression processing into a middle layer convolution network connected with the shallow layer convolution network for convolution processing, and obtaining the feature map output by the middle layer convolution network;

and taking the feature map output by the middle layer convolution network and/or the feature map output by the shallow layer convolution network as an initial semantic feature map of the target semantic depth.

In the embodiment, the feature map output by the middle-layer convolutional network is used as the initial semantic feature map of the target semantic depth, so that the spatial scale of the semantic feature map of the middle semantic depth can be reduced subsequently, and the waste of computing resources caused by redundant information of the middle semantic depth is reduced.

In addition, in order to reduce the consumption of computing resources in the model processing process to a greater extent, the target semantic depth may include both the low semantic depth and the medium semantic depth, that is, the feature map output by the shallow convolutional network and the feature map output by the medium convolutional network are obtained and both are used as the initial semantic feature map of the target semantic depth, so that the waste of computing resources caused by the redundant information of the medium semantic depth is reduced while the waste of computing resources caused by the redundant information of the low semantic depth is reduced, and the consumption of computing resources in the model identification process can be reduced to a greater extent.

In step S205, semantic features of the sample image at the target semantic depth are extracted based on the pre-trained teacher identification model, so as to obtain a second semantic feature map at the target semantic depth.

Wherein the pre-trained teacher identification model is used to identify the target object in the sample image.

The pre-trained teacher identification model is an identification model obtained after object identification task training is carried out on the basis of a large amount of training data, and generally has high identification accuracy. The number of model parameters in the pre-trained teacher recognition model may be greater than or equal to the number of object recognition models to be trained, i.e., the object recognition models to be trained may be regarded as student models of the pre-trained teacher recognition model.

In a specific implementation, when the target semantic depth includes a low semantic depth, the sample image may be input to a pre-trained teacher identification model, the sample image is convolved by a shallow convolutional network of the teacher identification model to obtain a feature map output by the shallow convolutional network of the teacher identification model, and the feature map output by the shallow convolutional network of the teacher identification model may be used as a second semantic feature map.

When the target semantic depth comprises a middle semantic depth, the feature graph output by the shallow convolutional network can be convolved through a middle convolutional network connected with the shallow convolutional network in the teacher identification model to obtain the feature graph output by the middle convolutional network of the teacher identification model, and the feature graph output by the middle convolutional network of the teacher identification model can be used as a second semantic feature graph.

When the target semantic depth simultaneously comprises the low semantic depth and the medium semantic depth, the feature map output by the shallow convolutional network of the teacher identification model and the feature map output by the medium convolutional network can be obtained, and both the feature map output by the shallow convolutional network and the feature map output by the medium convolutional network can be used as a second semantic feature map of the target semantic depth.

In step S207, performing frequency domain enhancement processing on the first semantic feature map based on the frequency domain feature corresponding to the first semantic feature map of the target semantic depth to obtain a first frequency domain enhanced feature map of the target semantic depth; and performing frequency domain enhancement processing on the second semantic feature map based on the frequency domain features corresponding to the second semantic feature map of the target semantic depth to obtain a second frequency domain enhanced feature map of the target semantic depth.

Wherein, the frequency domain feature can be obtained by transforming the semantic feature map to a frequency domain space.

In an exemplary embodiment, the frequency-domain enhancement processing is performed on the first semantic feature map based on the frequency-domain feature corresponding to the first semantic feature map, and obtaining the first frequency-domain enhanced feature map of the target semantic depth may include:

determining a first enhanced frequency domain map based on the first frequency domain feature map and a first mask map corresponding to the first frequency domain feature map;

and performing inverse Fourier transform on the first enhanced frequency domain graph to obtain a first frequency domain enhanced characteristic graph.

Wherein the fourier transform may be a two-dimensional discrete fourier transform, thereby transforming the image from image space to frequency domain space. Illustratively, the fourier transform of the semantic feature map may be implemented by the following equation (1):

wherein, (x, y) represents a point on the semantic feature map; f (x, y) represents a semantic feature map; m represents the maximum value corresponding to the x dimension of the semantic feature map in the image coordinate system; n represents the maximum value corresponding to the y dimension of the semantic feature map in the image coordinate system; (u, v) represents a point in a frequency domain coordinate system; f (u, v) represents a frequency domain feature map.

The first mask map corresponding to the first frequency domain feature map may be obtained by performing mask processing on the first frequency domain feature map.

For example, the masking process may be implemented based on the following formula (2):

wherein M is _F(u,v) A mask map corresponding to F (u, v) is shown; v represents the maximum value corresponding to the V dimension of the frequency domain characteristic diagram in the frequency domain coordinate system; u represents the maximum value corresponding to the U dimension of the frequency domain characteristic diagram in the frequency domain coordinate system; r meterThe adjustment coefficient can be set according to actual needs, and can be 0.5 or 1, for example; α is a mask value other than 1, and can be set as needed.

In a specific implementation, the enhanced frequency domain graph may be obtained by calculating a product of the frequency domain feature graph and a mask graph corresponding to the frequency domain feature graph, and an exemplary calculation process may be expressed as the following formula (3):

F′(u,v)＝F(u,v)×M _F(u,v) (3)

where F' (u, v) denotes an enhanced frequency domain map.

The inverse fourier transform is the inverse of the fourier transform, by which the transformation from the frequency domain space to the image space is possible. Illustratively, the inverse fourier transform may be a two-dimensional inverse discrete fourier transform, and in particular, based on the foregoing equations (1) to (3), the inverse fourier transform may be implemented by the following equation (4):

where f' (x, y) represents the frequency domain enhancement profile.

Then, when f (x, y) in the above formula (1) is the first semantic feature map, the first frequency-domain enhanced feature map can be obtained by combining the above formulas (1) to (4).

Similarly, when f (x, y) in the above formula (1) is the second semantic feature map, the second frequency-domain enhanced feature map can be obtained by combining the above formulas (1) to (4). That is, the performing, in step S207, a frequency-domain enhancement process on the second semantic feature map based on the frequency-domain feature corresponding to the second semantic feature map may include:

determining a second enhanced frequency domain map based on the second frequency domain feature map and a second mask map corresponding to the second frequency domain feature map;

and performing inverse Fourier transform on the second enhanced frequency domain image to obtain a second frequency domain enhanced characteristic image.

In the above embodiment, the frequency domain enhanced feature map is obtained by applying a semantic feature map (including the first semantic feature map and the second semantic feature map) to the frequency domain feature by fourier transform, then generating a mask map for different semantic feature maps, taking the product of the frequency domain feature and the mask map as the feature of the enhanced frequency domain, and performing inverse fourier transform on the feature of the enhanced frequency domain.

In step S209, the object recognition model is trained until a preset training end condition is reached based on a difference between the first frequency-domain enhanced feature map of the target semantic depth and the second frequency-domain enhanced feature map of the target semantic depth.

Specifically, a preset loss function may be used, a first loss value may be determined based on a difference between a first frequency domain enhanced feature map of the target semantic depth and a second frequency domain enhanced feature map of the target semantic depth, a model parameter of the object recognition model to be trained may be adjusted based on the first loss value, iterative training may be continued based on the adjusted model parameter until a preset training end condition is reached, and the object recognition model corresponding to the model parameter at the time of training end may be used as the final object recognition model deployed on-line.

The difference between the first frequency domain enhanced feature map of the target semantic depth and the second frequency domain enhanced feature map of the target semantic depth can be embodied by the distance between the first frequency domain enhanced feature map of the target semantic depth and the second frequency domain enhanced feature map of the target semantic depth, the difference is larger when the distance is larger, and the difference is smaller when the distance is smaller. Wherein the distance may be a euclidean distance, a manhattan distance, or the like.

The preset loss function for calculating the first loss value may be a minimum absolute value deviation loss (i.e., L1 loss function) or a minimum squared error loss (i.e., L2 norm loss function). The L1 loss function is determined by minimizing a sum of absolute differences between the first frequency-domain enhanced feature map of the target semantic depth and the second frequency-domain enhanced feature map of the target semantic depth. The L2 norm loss function is calculated by minimizing a sum of squares of differences between a first frequency-domain enhancement feature map of the target semantic depth and a second frequency-domain enhancement feature map of the target semantic depth.

Illustratively, the first loss value L ₁ Can be expressed by the following formula (5):

wherein n represents the number of sample images; f' ₁ (x, y) a first frequency-domain enhanced feature map representing a target semantic depth; f' ₂ (x, y) a first frequency-domain enhanced feature map representing a target semantic depth.

In the embodiment of the present disclosure, when the model parameter of the object recognition model to be trained is adjusted based on the first loss value, the model parameter of the object recognition model to be trained may be updated in a random gradient descent manner. The preset training end condition may be that the first loss value reaches a preset minimum loss value, and the preset minimum loss value may be set according to actual experience; of course, the preset training end condition may also be that the iteration number reaches a preset iteration number threshold, and the preset iteration number threshold may also be set according to practical experience, for example, may be 100, and so on.

In practical application, the dimension of the second semantic feature map of the target semantic depth obtained based on the teacher identification model and the dimension of the second semantic feature value of the target semantic depth obtained based on the object identification model may be the same or different, and when the two dimensions are different, the dimension of the second semantic feature map of the target semantic depth obtained based on the teacher identification model is usually greater than the dimension of the second semantic feature value of the target semantic depth obtained based on the object identification model, so that when determining the first loss value, dimension adjustment may be performed first to adjust the first frequency domain enhanced feature map of the target semantic depth and the second frequency domain enhanced feature map of the target semantic depth to the same dimension, and then the preset loss function is used to determine the first loss value based on the difference between the two dimensions. And performing convolution processing on the characteristic diagram through convolution layers with corresponding sizes by dimension adjustment.

According to the technical scheme of the embodiment of the disclosure, the semantic features of the sample image at the target semantic depth are extracted through the object recognition model, and the scale compression processing is performed on the semantic features of the target semantic depth to obtain the first semantic feature map, so that the consumption of computing resources of the object recognition model in the subsequent recognition processing process can be reduced, in order to ensure that the trained object recognition model has higher recognition accuracy, the semantic features of the sample image at the target semantic depth are further extracted in combination with the pre-trained teacher recognition model to obtain the second semantic feature map, the first semantic feature map and the second semantic feature map of the target semantic depth are respectively subjected to frequency domain enhancement processing, the object recognition model is trained based on the difference between the first frequency domain enhancement feature map and the second frequency domain enhancement feature map after the frequency domain enhancement processing, so that the object recognition model can learn the enhanced frequency domain information in the feature map extracted by the teacher recognition model, the distance between the object recognition model and the teacher recognition model is increased, and the consumption of the computing resources is effectively compressed on the premise that the model recognition accuracy is ensured, and the deployment of the object recognition model in the embedded training device and the lower scene is facilitated.

In an exemplary embodiment, in order to further improve the recognition accuracy of the trained object recognition model, as shown in fig. 4, the step S209 may include:

in step S401, a first loss value is determined based on a difference between a first frequency-domain enhanced feature map of a target semantic depth and a second frequency-domain enhanced feature map of the target semantic depth.

In step S403, performing object recognition processing on the sample image based on the first semantic feature map through the object recognition model to obtain a first recognition result; determining a second loss value based on the reference class label corresponding to the sample image of the first identification result;

in step S405, the teacher identification model performs object identification processing on the sample image based on the second semantic feature map, so as to obtain a second identification result.

In step S407, a third loss value is determined based on the first recognition result and the second recognition result.

In step S409, based on the first loss value, the second loss value, and the third loss value, a model parameter of the object recognition model is adjusted until a preset training end condition is reached.

The preset loss function for calculating the first loss value may be a minimum absolute value deviation loss (i.e., L1 loss function) or a minimum squared error loss (i.e., L2 norm loss function). The L1 loss function is determined by minimizing a sum of absolute differences between the first frequency-domain enhanced feature map of the target semantic depth and the second frequency-domain enhanced feature map of the target semantic depth. The L2 norm loss function is obtained by minimizing a sum of squares of differences between the first frequency-domain enhancement feature map of the target semantic depth and the second frequency-domain enhancement feature map of the target semantic depth. In a specific implementation, the first loss value L may be calculated based on the foregoing formula (5) ₁ 。

Wherein the first recognition result indicates a probability that the sample image predicted by the object recognition model belongs to the respective reference class, and the loss function used for calculating the second loss value may be a cross-entropy loss function. Exemplary, second loss value L ₂ Can be obtained by the following equation (6):

wherein N represents the number of reference categories; n represents the number of sample images; y is _i,c A reference category label representing the sample image c; p is a radical of _i,c Representing a first recognition result of the object recognition model for the sample image c.

Wherein the second recognition result indicates a probability that the sample image predicted by the teacher recognition model belongs to the corresponding reference type, and the loss function used to calculate the third loss value may be a KL divergence function. Exemplary, third loss value L ₃ Can be obtained by the following formula (7):

wherein, p' _i,c Representing a second recognition result of the teacher recognition model for the sample image c; p is a radical of _i,c Representing a first recognition result of the object recognition model for the sample image c.

When the model parameter of the object recognition model is adjusted until the preset training end condition is reached based on the first loss value, the second loss value and the third loss value, weight values can be respectively distributed to the first loss value, the second loss value and the third loss value, then the total loss value is obtained by weighting and summing based on the first loss value, the second loss value, the third loss value and the corresponding weight values, and then the model parameter of the object recognition model is adjusted until the preset training end condition is reached based on the total loss value.

In the above embodiment, the distance between the frequency domain enhanced features of the teacher identification model and the object identification model is pulled in by the first loss value, so as to realize information transmission; meanwhile, the distance between the final classification characteristics of the teacher recognition model and the final classification characteristics of the object recognition model are shortened through the second loss value and the third loss value, so that more effective information transmission is realized, the object recognition model trained based on the first loss value, the second loss value and the third loss value has recognition accuracy rate similar to that of the teacher recognition model, and the recognition accuracy rate of the object recognition model is further improved while the calculation resource consumption of the object recognition model is reduced.

In an exemplary embodiment, when determining the first loss value based on a difference between the first frequency-domain enhanced feature map of the target semantic depth and the second frequency-domain enhanced feature map of the target semantic depth, the determining may include:

determining a first sub-loss value based on a difference between a first frequency domain enhancement feature map corresponding to the shallow convolutional network and a second frequency domain enhancement feature map corresponding to the shallow convolutional network;

determining the first loss value based on a first sub-loss value and the second sub-loss value.

Specifically, a first sub-loss value may be determined based on a difference between a first frequency domain enhanced feature map corresponding to a shallow convolutional network in the object identification model and a second frequency domain enhanced feature map corresponding to a shallow convolutional network in the teacher identification model by using a preset loss function; determining a second sub-loss value based on the difference between a first frequency domain enhanced feature map of a middle layer convolutional network object in the object identification model and a second frequency domain enhanced feature map corresponding to the middle layer convolutional network in the teacher identification model by using a preset loss function; the first loss value is then determined in combination with the first sub-loss value and the second sub-loss value, for example the sum of the first sub-loss and the second sub-loss may be determined as the first loss value.

The preset loss function for calculating the first sub-loss value and the second sub-loss value may be a minimum absolute value deviation loss (i.e., L1 loss function) or a minimum square error loss (i.e., L2 norm loss function).

In the above embodiment, by compressing the spatial scales of the low-layer and middle-layer features and transmitting the feature information to the frequency domain enhancement features corresponding to the low-layer and middle-layer features, the recognition accuracy of the trained object recognition model is improved while the calculation resource consumption of the object recognition model is reduced to a greater extent.

In order to compress the consumption of the trained object recognition model on the computing resources during the recognition process while further improving the recognition accuracy of the object recognition model, in an exemplary embodiment, the determining the first loss value based on the first sub-loss value and the second sub-loss value may further include:

determining a third sub-loss value based on the difference between the frequency domain enhancement feature map corresponding to the higher-layer convolutional network in the object recognition model and the frequency domain enhancement feature map corresponding to the higher-layer convolutional network in the teacher recognition model;

determining a first loss value based on the first sub-loss value, the second sub-loss value, and the third sub-loss value;

Specifically, after a sample image is input into an object identification model, feature extraction is performed through a shallow convolutional network of the object identification model to obtain a feature map output by the shallow convolutional network, scale compression processing is performed on the feature map output by the shallow convolutional network, the feature map subjected to the scale compression processing is used as input of a middle convolutional network in the object identification model, feature extraction is performed through the middle convolutional network to obtain a feature map output by the middle convolutional network, scale compression processing is performed on the feature output by the middle convolutional network, the feature map subjected to the scale compression processing is used as input of a high convolutional network in the object identification model, feature extraction is performed through the high convolutional network to obtain a feature map output by the high convolutional network, and then frequency domain enhancement processing is performed on the feature map output by the high convolutional network in the object identification model to obtain a frequency domain enhancement feature map corresponding to the high convolutional network in the object identification model.

In addition, after the sample image is input into the teacher identification model, feature extraction is carried out through a shallow convolutional network of the teacher identification model to obtain a feature graph output by the shallow convolutional network, the feature graph output by the shallow convolutional network is used as input of a middle convolutional network in the teacher identification model, feature extraction is carried out through the middle convolutional network to obtain a feature graph output by the middle convolutional network in the teacher identification model, the feature graph output by the middle convolutional network is used as input of a high convolutional network in the teacher identification model, feature extraction is carried out through the high convolutional network to obtain a feature graph output by the high convolutional network in the teacher identification model, and further frequency domain enhancement processing is carried out on the feature graph output by the high convolutional network in the teacher identification model to obtain a frequency domain enhancement feature graph corresponding to the high convolutional network in the teacher identification model.

For performing the frequency domain enhancement processing on the feature map output by the high-level convolutional network, reference may be made to the related description of step S207 in the embodiment of the method shown in fig. 2, which is not described herein again.

Wherein, the preset loss function for calculating the third sub-loss value can be a minimum absolute value deviation loss (i.e. L1 loss function) or a minimum square error loss (i.e. L2 norm loss function).

Wherein determining the first loss value based on the first, second, and third sub-loss values may be determining a sum of the first, second, and third sub-loss values as the first loss value.

According to the embodiment, on the basis of the low-layer and middle-layer characteristics, information transmission of the frequency domain enhancement characteristics corresponding to the high-layer convolutional network is added, so that the recognition accuracy of the object recognition model can be further improved, and the consumption of the trained object recognition model on computing resources in the recognition processing process can be reduced.

In order to more clearly understand the technical solution of the embodiment of the present disclosure, a specific example is described below with reference to fig. 5.

As shown in fig. 5, the sample image is subjected to object recognition processing by an object recognition model to be trained and a teacher recognition model, respectively. For an object recognition model to be trained, extracting a low-level semantic feature map f of a sample image through a shallow convolutional network layer of the object recognition model _ls And carrying out scale compression processing on the low-level semantic feature map to obtain a compressed low-level semantic feature map f _lps (ii) a The compressed low semantic feature map f _lps Further inputting the data to a middle layer convolution network layer of the object recognition model for feature extraction to obtain a middle layer semantic feature map f _ms And for the middle layer semantic feature map f _ms Scale compression processing is carried out to obtain a compressed middle-layer semantic feature map f _mps (ii) a The compressed middle layer semantic feature map f _mps Further inputting the data to a high-level convolution network layer of the object recognition model for feature extraction to obtain a high-level semantic feature map f _hs The high-level semantic feature map f _hs Further inputting the data into a classifier of the object recognition model for class prediction to obtain a sampleFirst recognition result P corresponding to the image _s 。

For the teacher identification model, extracting the low-level semantic feature map fl of the sample image through the shallow convolutional network layer of the teacher identification model _lt The low-level semantic feature map f _lt Further inputting the data to a middle layer convolution network layer of the teacher identification model for feature extraction to obtain a middle layer semantic feature map f _mt The middle layer semantic feature map f _mt Further inputting the data to a high-level convolution network layer of the teacher identification model for feature extraction to obtain a high-level semantic feature map f _ht The high-level semantic feature map f _ht Further inputting the second identification result P into a classifier of the teacher identification model for class prediction to obtain a second identification result P corresponding to the sample image _t 。

In addition, the information transmission between the teacher recognition model and the object recognition model is realized through the first information transmission module and the second information transmission module in the training process.

The first information transfer module is used for pulling the distance of the frequency domain enhanced features between the teacher identification model and the object identification model. As shown in fig. 5, first information transfer modules corresponding to a shallow convolutional network layer, a middle convolutional network layer, and a high convolutional network layer may be provided, respectively. The input of the first information transmission module corresponding to the shallow convolutional network layer comprises the compressed low-level semantic feature map and a low-level semantic feature map corresponding to the teacher identification model; the input of the first information transmission module corresponding to the middle layer convolution network layer comprises the compressed middle layer semantic feature map and a middle layer semantic feature map corresponding to the teacher identification model; the input of the first information delivery module corresponding to the high-level convolutional network layer comprises a high-level semantic feature map of the object recognition model and a high-level semantic feature map of the teacher recognition model.

Specifically, as shown in fig. 6, which is a schematic diagram of the first information delivery module, a semantic feature map f of the object recognition model _s Obtaining a corresponding frequency domain characteristic diagram f through DFT (Fourier transform) _fs The frequency domain feature map f _fs And its corresponding mask map M _fs Multiplying to obtain an enhanced frequency domain map f _wfs The enhanced frequency domain map f _wfs Obtaining a corresponding frequency domain enhanced feature map f through IDFT (inverse Fourier transform) _fes Then enhancing the feature map f for the frequency domain _fes Upsampling (which may be performed by convolution) is performed to extend its scale to a dimension appropriate to the frequency domain enhanced feature map corresponding to the teacher identification model, resulting in f _{fes_up} . Similarly, the teacher identifies the semantic feature f of the model _t Obtaining a corresponding frequency domain characteristic diagram f through DFT (Fourier transform) _ft The frequency domain feature map f _ft And its corresponding mask map M _ft Multiplying to obtain an enhanced frequency domain map f _wft The enhanced frequency domain map f _wft Obtaining a corresponding frequency domain enhanced feature map f through IDFT (inverse Fourier transform) _fet . Finally, a frequency domain enhancement feature map f corresponding to the object recognition model is utilized based on the L1 loss function _{fes_up} Frequency domain enhanced feature map f corresponding to teacher identification model _fet The difference between the first and second information transfer modules determines a corresponding loss value for the first information transfer module.

The second information transfer module is used for drawing the distance between the final classification features of the teacher identification model and the object identification model. In a specific implementation, the second information transfer module may be implemented based on classification knowledge distillation, that is, the difference between the first recognition result of the object recognition model and the second recognition result of the teacher recognition model and the difference between the first recognition result of the object recognition model and the reference class label of the corresponding sample image are used to determine the loss value corresponding to the second information transfer model.

And then, the total loss for training the object recognition model is the sum of the loss value corresponding to each first information transmission module and the loss value corresponding to each second information transmission module, model parameters of the object recognition model are updated in a gradient descent mode based on the total loss, and the object recognition model is subjected to iterative training to obtain the finally trained object recognition model.

It can be understood that the trained object recognition model can be deployed in an embedded device such as a terminal, so as to recognize a target object in an image to be processed by using the object recognition model.

In specific implementation, an image to be processed may be input into an object recognition model, a low-level semantic feature map is obtained by performing feature extraction on the image to be processed through a shallow convolutional network layer of the object recognition model, a compressed low-level semantic feature map is obtained by performing scale compression on the low-level semantic feature map, the compressed low-level semantic feature map is input into a middle-level convolutional network layer to perform feature extraction to obtain a middle-level semantic feature map, the middle-level semantic feature map is subjected to scale compression to obtain a compressed middle-level semantic feature map, the compressed middle-level semantic feature map is input into a high-level convolutional network layer to perform feature extraction to obtain a high-level semantic feature map, the high-level semantic feature map is input into a classifier to perform classification processing to obtain a recognition result corresponding to the image to be processed, and the recognition result indicates a category of a target object in the image to be processed.

Fig. 7 is a block diagram illustrating an architecture of an object recognition model training apparatus according to an exemplary embodiment. Referring to fig. 7, the object recognition model training apparatus 700 includes:

a first feature extraction unit 710, configured to extract semantic features of the sample image at a target semantic depth based on an object recognition model to be trained, to obtain an initial semantic feature map of the target semantic depth; the object identification model is used for identifying a target object in the sample image;

the scale compression unit 720 is configured to perform scale compression processing on the initial semantic feature map of the target semantic depth to obtain a first semantic feature map of the target semantic depth;

the second feature extraction unit 730 is configured to execute extraction of semantic features of the sample image at the target semantic depth based on a pre-trained teacher recognition model to obtain a second semantic feature map of the target semantic depth; the teacher identification model is used for identifying a target object in the sample image;

a frequency domain enhancement processing unit 740 configured to perform frequency domain enhancement processing on the first semantic feature map based on the frequency domain feature corresponding to the first semantic feature map, so as to obtain a first frequency domain enhancement feature map of the target semantic depth; performing frequency domain enhancement processing on the second semantic feature map based on the frequency domain feature corresponding to the second semantic feature map to obtain a second frequency domain enhanced feature map of the target semantic depth;

a training unit 750 configured to perform a difference between the first frequency-domain enhanced feature map based on the target semantic depth and the second frequency-domain enhanced feature map based on the target semantic depth, train the object recognition model until a preset training end condition is reached, and obtain a trained object recognition model.

In an exemplary embodiment, the sample image corresponds to a reference category label; the training unit 750 includes:

a second loss determination unit configured to perform object recognition processing on the sample image based on the first semantic feature map by the object recognition model to obtain a first recognition result; determining a second loss value based on the reference class label corresponding to the first identification result and the sample image;

a first identification unit configured to perform object identification processing on the sample image based on the second semantic feature map by the teacher identification model, resulting in a second identification result;

a parameter adjusting unit configured to perform adjusting model parameters of the object recognition model based on the first loss value, the second loss value and the third loss value until a preset training end condition is reached.

In an exemplary embodiment, the first feature extraction unit includes:

the first convolution unit is configured to input the sample image into an object recognition model to be trained, and carry out convolution processing on the sample image through a shallow convolution network of the object recognition model to obtain a feature map output by the shallow convolution network;

the characteristic extraction subunit is configured to execute a characteristic graph output based on the shallow convolutional network to obtain an initial semantic characteristic graph of the target semantic depth; the shallow convolutional network comprises a preset number of convolutional layers close to an input layer in the object recognition model.

In an exemplary embodiment, the feature extraction subunit is specifically configured to perform scale compression processing on the feature map output by the shallow convolutional network, and input the feature map after the scale compression processing into a middle convolutional network connected to the shallow convolutional network for convolution processing, so as to obtain a feature map output by the middle convolutional network; and taking the feature map output by the middle layer convolution network and/or the feature map output by the shallow layer convolution network as the initial semantic feature map of the target semantic depth.

In an exemplary embodiment, the first loss determining unit includes:

In an exemplary embodiment, the fourth loss determining unit includes:

the frequency domain enhancement characteristic diagram corresponding to the high-level convolutional network is obtained based on the characteristic diagram output by the high-level convolutional network, and the characteristic diagram output by the high-level convolutional network is obtained by carrying out scale compression processing on the connected characteristic diagram output by the middle-level convolutional network.

In an exemplary embodiment, the frequency domain enhancement processing unit 740 includes:

the first frequency domain enhancement processing unit is configured to perform Fourier transform on the first semantic feature map to obtain a first frequency domain feature map corresponding to the first semantic feature map; determining a first enhanced frequency domain map based on the first frequency domain feature map and a first mask map corresponding to the first frequency domain feature map; carrying out inverse Fourier transform on the first frequency domain enhancement image to obtain a first frequency domain enhancement characteristic image;

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

In an exemplary embodiment, there is also provided an electronic device, comprising a processor; a memory for storing processor-executable instructions; the processor is configured to implement the object recognition model training method provided in any of the above embodiments when executing the instructions stored in the memory.

The electronic device may be a terminal, a server, or a similar computing device, taking the electronic device as a server as an example, fig. 8 is a block diagram of an electronic device for training an object recognition model according to an exemplary embodiment, and as shown in fig. 8, the server 700 may generate a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 810 (the processor 810 may include but is not limited to a Processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 830 for storing data, and one or more storage media 820 (e.g., one or more mass storage devices) for storing an application 823 or data 822. Memory 830 and storage medium 820 may be, among other things, transitory or persistent storage. The program stored in storage medium 820 may include one or more modules, each of which may include a series of instruction operations for a server. Still further, the central processor 810 may be configured to communicate with the storage medium 820 to execute a series of instruction operations in the storage medium 820 on the server 800. The server 800 may also include one or more power supplies 860, one or more wired or wireless network interfaces 850, one or more input-output interfaces 840, and/or one or more operating systems 821, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and so forth.

The input-output interface 840 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the server 800. In one example, i/o Interface 840 includes a Network adapter (NIC) that may be coupled to other Network devices via a base station to communicate with the internet. In one example, the input/output interface 840 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

It will be understood by those skilled in the art that the structure shown in fig. 8 is only an illustration and is not intended to limit the structure of the electronic device. For example, server 800 may also include more or fewer components than shown in FIG. 8, or have a different configuration than shown in FIG. 8.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 830 comprising instructions, executable by the processor 810 of the apparatus 800 to perform the method described above is also provided. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product comprising a computer program/instructions which, when executed by a processor, implement the object recognition model training method provided in any of the above embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An object recognition model training method, comprising:

training the object recognition model based on the difference between the first frequency domain enhanced feature map of the target semantic depth and the second frequency domain enhanced feature map of the target semantic depth until a preset training end condition is reached, and obtaining a trained object recognition model.

2. The method of claim 1, wherein the sample images correspond to a reference category label; the training the object recognition model based on the difference between the first frequency-domain enhanced feature map and the second frequency-domain enhanced feature map until reaching a preset training end condition includes:

3. The method according to claim 2, wherein the extracting semantic features of the sample image at a target semantic depth based on the object recognition model to be trained to obtain an initial semantic feature map of the target semantic depth comprises:

inputting the sample image into an object recognition model to be trained, and carrying out convolution processing on the sample image through a shallow convolution network of the object recognition model to obtain a feature map output by the shallow convolution network; the shallow convolutional network comprises a preset number of convolutional layers close to the input layer in the object recognition model;

4. The method according to claim 3, wherein the obtaining an initial semantic feature map of the target semantic depth based on the feature map output by the shallow convolutional network comprises:

5. The method of claim 4, wherein determining a first loss value based on a difference between the first frequency-domain enhanced feature map of the target semantic depth and the second frequency-domain enhanced feature map of the target semantic depth comprises:

6. The method of claim 5, wherein determining the first penalty value based on the first sub-penalty value and the second sub-penalty value comprises:

7. The method according to any one of claims 1 to 6, wherein the performing frequency domain enhancement processing on the first semantic feature map based on the frequency domain feature corresponding to the first semantic feature map to obtain a first frequency domain enhanced feature map comprises:

the frequency domain enhancement processing is performed on the second semantic feature map based on the frequency domain feature corresponding to the second semantic feature map to obtain a second frequency domain enhancement feature map, and the frequency domain enhancement processing comprises:

8. An object recognition model training apparatus, comprising:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the object recognition model training method of any one of claims 1 to 7.

10. A computer-readable storage medium whose instructions, when executed by a processor of an electronic device, enable the electronic device to perform the object recognition model training method of any one of claims 1-7.