CN113706550A

CN113706550A - Image scene recognition and model training method and device and computer equipment

Info

Publication number: CN113706550A
Application number: CN202110255557.XA
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-03-09
Filing date: 2021-03-09
Publication date: 2021-11-26

Abstract

The application relates to an image scene recognition method, an image scene recognition device, a computer device and a storage medium. The method comprises the following steps: acquiring an image to be identified; extracting a foreground area and a background area in an image to be identified; performing self-attention weight calculation based on foreground features corresponding to the foreground regions to obtain self-attention foreground weights, and adjusting the foreground features through the self-attention foreground weights to obtain self-attention foreground features; performing self-attention weight calculation based on the background features corresponding to the background area to obtain a self-attention background weight, and adjusting the background features through the self-attention background weight to obtain self-attention background features; and performing feature fusion on the self-attention background feature and the self-attention foreground feature to obtain a fusion feature, and performing scene recognition based on the fusion feature to obtain an image scene recognition result corresponding to the image to be recognized. By adopting the method, the accuracy of image scene identification can be improved.

Description

Image scene recognition and model training method and device and computer equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for image scene recognition and model training, a computer device, and a storage medium.

Background

With the development of image processing technology, image recognition technology has appeared, which refers to a technology for processing, analyzing and understanding images by using a computer to recognize various different patterns of targets and objects, and is a practical application of applying a deep learning algorithm. Currently, image recognition can recognize a scene in an image by detecting an object in the image and then determining the scene where the image is located according to the object in the image. However, with the method of detecting objects in an image and then performing scene recognition, there is a problem that the accuracy of image scene recognition is low because objects that can be detected, such as seashores, large forests, and the like, may not exist in the image.

Disclosure of Invention

In view of the above, it is necessary to provide an image scene recognition and model training method, apparatus, computer device and storage medium capable of improving the accuracy of image scene recognition.

A method of image scene recognition, the method comprising:

acquiring an image to be identified;

extracting a foreground area and a background area in an image to be identified;

performing self-attention weight calculation based on foreground features corresponding to the foreground regions to obtain self-attention foreground weights, and adjusting the foreground features through the self-attention foreground weights to obtain self-attention foreground features;

performing self-attention weight calculation based on the background features corresponding to the background area to obtain a self-attention background weight, and adjusting the background features through the self-attention background weight to obtain self-attention background features;

and performing feature fusion on the self-attention background feature and the self-attention foreground feature to obtain a fusion feature, and performing scene recognition based on the fusion feature to obtain an image scene recognition result corresponding to the image to be recognized.

In one embodiment, the region division is performed based on the image features to be recognized to obtain a foreground region and a background region, and the method includes:

calculating a mean value corresponding to the characteristic value in the image characteristic to be identified;

performing binary division on the image features to be identified based on the mean value to obtain a foreground mask;

calculating the product of the foreground mask and the pixel value of the image to be identified to obtain a foreground area in the image to be identified;

and (4) negating the foreground mask to obtain a background mask, and calculating the product of the background mask and the pixel value of the image to be identified to obtain a background area.

An image scene recognition device, the device comprising:

the image acquisition module is used for acquiring an image to be identified;

the region extraction module is used for extracting a foreground region and a background region in the image to be identified;

the foreground feature extraction module is used for carrying out self-attention weight calculation on the basis of foreground features corresponding to the foreground region to obtain self-attention foreground weights, and adjusting the foreground features through the self-attention foreground weights to obtain self-attention foreground features;

the background feature extraction module is used for performing self-attention weight calculation based on the background features corresponding to the background area to obtain a self-attention background weight, and adjusting the background features through the self-attention background weight to obtain self-attention background features;

and the scene recognition module is used for carrying out feature fusion on the self-attention background feature and the self-attention foreground feature to obtain a fusion feature, carrying out scene recognition based on the fusion feature and obtaining an image scene recognition result corresponding to the image to be recognized.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring an image to be identified;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring an image to be identified;

According to the image scene identification method, the image scene identification device, the computer equipment and the storage medium, the foreground area and the background area in the image to be identified are extracted, then self-attention foreground weight calculation is carried out based on the foreground features corresponding to the foreground area, self-attention foreground weight is obtained, and the foreground features are adjusted through the self-attention foreground weight, so that the self-attention foreground features are obtained. The method comprises the steps of calculating self-attention weight based on background features corresponding to a background region to obtain self-attention background weight, adjusting the background features through the self-attention background weight to obtain self-attention background features, fusing the self-attention foreground features and the self-attention background features, and then carrying out scene recognition, namely recognizing an image scene through the combined action of the background region and the foreground region, so that sufficient scene recognition features can be extracted, and the accuracy of image scene recognition is improved.

A method of training an image scene recognition model, the method comprising:

acquiring a training image and a corresponding training scene label, and inputting the training image into an initial image scene recognition model;

extracting an initial foreground training area and an initial background training area in a training image by using an initial image scene recognition model, inputting the initial foreground area into an initial foreground branch network, and inputting the initial background area into an initial background branch network;

the initial foreground branch network carries out self-attention weight calculation based on initial foreground features corresponding to the initial foreground training area to obtain initial self-attention foreground weights, and the initial foreground features are adjusted through the initial self-attention foreground weights to obtain initial self-attention foreground features;

the initial background branch network carries out self-attention weight calculation based on initial background features corresponding to the initial background training area to obtain initial self-attention background weights, and the initial background features are adjusted through the initial self-attention background weights to obtain initial self-attention background features;

the initial image scene recognition model performs feature fusion on the initial self-attention background features and the initial self-attention foreground features to obtain initial fusion features, and performs scene recognition based on the initial fusion features to obtain initial image scene recognition results;

and calculating the initial image scene recognition result and the loss information of the training scene label, updating the initial image scene recognition model based on the loss information, and returning to the step of inputting the training image into the initial image scene recognition model for iterative execution until the training completion condition is reached to obtain the trained image scene recognition model.

An image scene recognition model training apparatus, the apparatus comprising:

the training data acquisition module is used for acquiring a training image and a corresponding training scene label and inputting the training image into the initial image scene recognition model;

the model processing module is used for extracting an initial foreground training area and an initial background training area in a training image by an initial image scene recognition model, inputting the initial foreground area into an initial foreground branch network, and inputting the initial background area into an initial background branch network;

the foreground network processing module is used for calculating the self-attention foreground weight of the initial foreground branch network based on the initial foreground features corresponding to the initial foreground training area to obtain an initial self-attention foreground weight, and adjusting the initial foreground features through the initial self-attention foreground weight to obtain initial self-attention foreground features;

the background network processing module is used for calculating the self-attention background weight of the initial background branch network based on the initial background features corresponding to the initial background training area to obtain an initial self-attention background weight, and adjusting the initial background features according to the initial self-attention background weight to obtain initial self-attention background features;

the model identification module is used for carrying out feature fusion on the initial self-attention background feature and the initial self-attention foreground feature by the initial image scene identification model to obtain an initial fusion feature, and carrying out scene identification based on the initial fusion feature to obtain an initial image scene identification result;

and the iteration module is used for calculating the initial image scene recognition result and the loss information of the training scene label, updating the initial image scene recognition model based on the loss information, and returning to the step of inputting the training image into the initial image scene recognition model for iterative execution until the training completion condition is reached to obtain the trained image scene recognition model.

the initial foreground branch network carries out self-attention weight calculation based on the foreground features corresponding to the foreground area to obtain self-attention foreground weights, and the foreground features are adjusted through the self-attention foreground weights to obtain self-attention foreground features;

the initial background branch network carries out self-attention weight calculation based on the background features corresponding to the background area to obtain self-attention background weights, and the background features are adjusted through the self-attention background weights to obtain self-attention background features;

The training image is input into the initial image scene recognition model by acquiring the training image and the corresponding training scene label, the initial image scene recognition model extracts the self-attention foreground characteristic through the initial foreground branch network and extracts the self-attention background characteristic through the initial background branch network, and then carrying out feature fusion on the initial self-attention background features and the initial self-attention foreground features to obtain initial fusion features, carrying out scene recognition based on the initial fusion features to obtain initial image scene recognition results, calculating loss information of the initial image scene recognition results and training scene labels, updating the initial image scene recognition model based on the loss information, and obtaining the trained image scene recognition model until the training completion condition is reached. The self-attention foreground features and the self-attention background features are respectively extracted through the foreground branch network and the background branch network, and then the initial self-attention background features and the initial self-attention foreground features are subjected to feature fusion to obtain an image scene recognition result through recognition, so that the accuracy of the image scene recognition can be improved through the trained image scene recognition model.

Drawings

FIG. 1 is a diagram of an exemplary environment in which an image scene recognition method may be implemented;

FIG. 2 is a flow diagram illustrating an exemplary method for identifying an image scene;

FIG. 3 is a schematic flow chart illustrating model identification in one embodiment;

FIG. 4 is a block diagram of an embodiment;

FIG. 5 is a flow diagram illustrating obtaining a background region in one embodiment;

FIG. 6 is a diagram of a background mask obtained in one embodiment;

FIG. 7 is a schematic flow chart illustrating obtaining self-attention foreground features in one embodiment;

FIG. 8 is a flow diagram illustrating a process for deriving self-attention background features in one embodiment;

FIG. 9 is a flowchart illustrating a method for training an image scene recognition model according to an embodiment;

FIG. 10 is a schematic flow chart of obtaining an image scene recognition model in one embodiment;

FIG. 11 is a schematic flow chart of pre-training in one embodiment;

FIG. 12 is a schematic flow chart of image scene recognition in one embodiment;

FIG. 13 is a block diagram of an image scene recognition model in an embodiment;

FIG. 14 is a diagram illustrating an exemplary application of the image scene recognition method in an embodiment;

FIG. 15 is a diagram illustrating a picture to be recognized according to the embodiment of FIG. 14;

FIG. 16 is a block diagram showing the structure of an image scene recognition apparatus according to an embodiment;

FIG. 17 is a block diagram showing the construction of an image scene recognition model training apparatus according to an embodiment;

FIG. 18 is a diagram showing an internal structure of a computer device in one embodiment;

fig. 19 is an internal configuration diagram of a computer device in another embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

The scheme provided by the embodiment of the application relates to the technologies such as artificial intelligence image recognition, and is specifically explained by the following embodiments:

the image scene recognition method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 acquires an image to be identified uploaded by the terminal 102; extracting a foreground area and a background area in an image to be identified; the server 104 performs self-attention weight calculation based on the foreground features corresponding to the foreground regions to obtain self-attention foreground weights, and adjusts the foreground features according to the self-attention foreground weights to obtain self-attention foreground features; the server 104 performs self-attention weight calculation based on the background features corresponding to the background area to obtain a self-attention background weight, and adjusts the background features according to the self-attention background weight to obtain self-attention background features; the server 104 performs feature fusion on the self-attention background feature and the self-attention foreground feature to obtain a fusion feature, performs scene recognition based on the fusion feature to obtain an image scene recognition result corresponding to the image to be recognized, and the server 104 may return the image scene recognition result to the terminal 102 for display. The terminal 102 may be, but not limited to, a notebook computer, a smart phone, a tablet computer, a desktop computer, a smart television, and a portable wearable device, and the server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud services, a cloud database, cloud computing, a cloud function, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a CDN, and a big data and artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

In an embodiment, as shown in fig. 2, an image scene recognition method is provided, which is described by taking the method as an example applied to the server in fig. 1, it is understood that the method may also be applied to a terminal, or executed by the terminal and the server in cooperation, and in this embodiment, the method includes the following steps:

step 202, acquiring an image to be identified.

And step 204, extracting a foreground area and a background area in the image to be identified.

The image to be recognized refers to an image which needs scene recognition. The foreground region is a partial region where a foreground in the image to be recognized is located, and the foreground is a scene or a person which is located in front of the main body or close to a camera lens in the picture and represents a certain spatial relationship or a person relationship. The background area refers to a partial area where the background is located in the image to be recognized. The background, which is behind the subject, is a significant component of the environment, away from the camera.

Specifically, the server acquires an image to be identified, where the image to be identified may be uploaded to the server by a terminal, may also be acquired from a database by the server, and may also be sent by a service server, where the service server is used to process an image-related service. The server segments the image to be recognized and extracts a foreground region and a background region in the image to be recognized, wherein the server can segment the image to be recognized by using a threshold-based segmentation algorithm to obtain the foreground region and the background region. The server may also be a segmentation using pixel clustering. The server can also use a maximum entropy algorithm to segment the image to be identified to obtain a foreground region and a background region. The server can also segment the image to be recognized based on a deep neural network algorithm to obtain a foreground region and a background region. In one embodiment, the server may extract image features corresponding to the image to be recognized, and then segment the image to be recognized by using the image features to obtain a foreground region and a background region. The image features corresponding to the image to be recognized can be extracted by using a deep neural network. In a specific embodiment, the server may also obtain the image to be recognized through an Application program (APP) on the terminal or a client.

And step 206, performing self-attention weight calculation based on the foreground features corresponding to the foreground regions to obtain self-attention foreground weights, and adjusting the foreground features according to the self-attention foreground weights to obtain the self-attention foreground features.

The foreground features refer to regional features corresponding to foreground regions, and the self-attention foreground weight refers to a weight corresponding to the foreground features calculated through a self-attention mechanism (self-attention mechanism). self-attention means that for each feature element, its corresponding attention weight is found. The self-attention foreground feature refers to a feature obtained by weighting the foreground feature by using a self-attention foreground weight.

Specifically, the server extracts foreground features corresponding to the foreground region, and the features of the foreground region can be extracted through the deep neural network to obtain the foreground features. The features obtained by the feature extraction algorithm may also be extracted, for example, the features may be color features, texture features, shape features, spatial relationship features, and the like. And then, performing self-attention weight calculation by using the foreground features to obtain the self-attention foreground weight. Wherein the self-attention weight calculation can be carried out through a neural network established by a self-attention mechanism. And the server uses the self-attention foreground weight to weight the foreground features to obtain the self-attention foreground features.

And 208, performing self-attention weight calculation based on the background features corresponding to the background area to obtain a self-attention background weight, and adjusting the background features according to the self-attention background weight to obtain self-attention background features.

The background features refer to features corresponding to a background region, and the self-attention background weight refers to a weight corresponding to the background features calculated by a self-attention mechanism (self-attention mechanism). The self-attention background feature refers to a feature obtained by weighting a background feature by using a self-attention foreground weight.

Specifically, the server extracts the background features corresponding to the background area, and may extract the background features corresponding to the background area through the deep neural network. And then, self-attention weight calculation is carried out by using the background features, and self-attention background weight is obtained. The self-attention background feature is obtained by weighting the background feature by using the self-attention background weight.

And step 210, performing feature fusion on the self-attention background feature and the self-attention foreground feature to obtain a fusion feature, and performing scene recognition based on the fusion feature to obtain an image scene recognition result corresponding to the image to be recognized.

The fusion feature refers to a feature obtained by fusing the self-attention background feature and the self-attention foreground feature. The image scene recognition result refers to a specific scene category corresponding to the image to be recognized, where the scene category may be a scene name, a scene label, or a scene number, and the like, and for example, the image scene recognition result may be a scene such as a city street, a highway, a park, a coffee shop, an office, a restaurant, and the like.

Specifically, the server performs feature fusion on the self-attention background feature and the self-attention foreground feature to obtain a fusion feature, where the fusion feature may be a feature obtained by directly splicing the self-attention background feature and the self-attention foreground feature. The fusion feature may be a feature obtained by performing vector operation on a vector corresponding to the feature. For example, a vector sum, a vector product, and the like corresponding to the self-attention background feature and the self-attention foreground feature may be calculated to obtain the fusion feature. And then, performing scene recognition by using the fusion features to obtain an image scene recognition result corresponding to the image to be recognized, for example, performing scene recognition on the fusion features by using a convolutional neural network to obtain an image scene recognition result corresponding to the image to be recognized.

In the image scene identification method, the self-attention foreground weight is obtained by extracting the foreground region and the background region in the image to be identified, then performing self-attention weight calculation based on the foreground features corresponding to the foreground region, and the self-attention foreground features are obtained by adjusting the foreground features according to the self-attention foreground weight. The method comprises the steps of calculating self-attention weight based on background features corresponding to a background region to obtain self-attention background weight, adjusting the background features through the self-attention background weight to obtain self-attention background features, fusing the self-attention foreground features and the self-attention background features, and then carrying out scene recognition, namely recognizing an image scene through the combined action of the background region and the foreground region, so that sufficient scene recognition features can be extracted, and the accuracy of image scene recognition is improved.

In one embodiment, as shown in fig. 3, an image scene recognition method includes:

step 302, inputting an image to be recognized into an image scene recognition model.

Step 304, the image scene recognition model extracts a foreground region and a background region in the image to be recognized, inputs the foreground region into a foreground branch network, and inputs the background region into a background branch network.

The image scene recognition model is a model which is obtained by training through a neural network algorithm by using training data and is used for image scene recognition. The image scene recognition model comprises two branch networks, namely a foreground branch network and a background branch network, wherein the foreground branch network is used for extracting self-attention foreground features, the background branch network is used for extracting self-attention foreground features, and the two branch networks are both networks with self-attention mechanisms. In one embodiment, the foreground and background branched networks have the same network structure and different network parameters. In one embodiment, the network structures and network parameters of the foreground branch network and the background branch network are different, the network structures of the branch networks can be set according to needs, and the network parameters are obtained through training.

Specifically, the server trains the obtained image scene recognition model in advance, and deploys the image scene recognition model to the server for use. When the server acquires the image to be recognized, the image to be recognized can be directly input into the image scene recognition model, the image scene recognition model receives the input image to be recognized, image features corresponding to the image to be recognized are extracted, region division is carried out on the basis of the image features, and a foreground region and a background region are obtained. And then inputting the foreground area into a foreground branch network of the image scene recognition model, and simultaneously inputting the background area into a background branch network of the image scene recognition model.

And step 306, extracting foreground features corresponding to the foreground region by the foreground branch network, performing self-attention weight calculation by using the foreground features to obtain self-attention foreground weights, and weighting the foreground features by using the self-attention foreground weights to obtain the self-attention foreground features.

Specifically, the foreground branch network may extract the foreground feature corresponding to the foreground region through the foreground feature extraction network. The foreground feature extraction network may be a convolutional neural network, a cyclic neural network, a long-short term memory network, a feed-forward neural network, and the like, and then the foreground features are subjected to self-attention weight calculation to obtain self-attention foreground weights, for example, the foreground features may be pooled and then compressed to extract important information in the foreground features, and then the compressed features are subjected to weight mapping to obtain the self-attention foreground weights. And then weighting the foreground features through the self-attention foreground weight to obtain the self-attention foreground features.

And 308, extracting the background features corresponding to the background area by the background branch network, performing self-attention weight calculation by using the background features to obtain self-attention background weights, and weighting the background features by using the self-attention background weights to obtain the self-attention background features.

Specifically, the background branch network may also extract the background features corresponding to the background region through a background feature extraction network, where the background feature extraction network may be a convolutional neural network, a cyclic neural network, a long-short term memory network, a feed-forward neural network, or the like, and the network structure of the background feature extraction network is the same as that of the foreground feature extraction network, and the network parameters are different. In one embodiment, the network structure of the background feature extraction network and the foreground feature extraction network may also be different. When the background features are obtained, the background branch network performs self-attention weight calculation based on the background features, for example, the background branch network may pool the background features and then compress the background features, extract important information in the foreground features, and then perform weight mapping on the compressed features to obtain self-attention background weights. The background features are then weighted by the self-attention background weight to obtain self-attention background features.

And 310, the image scene recognition model performs feature fusion on the self-attention background feature and the self-attention foreground feature to obtain a fusion feature, and performs scene recognition based on the fusion feature to obtain an image scene recognition result.

Specifically, the image scene recognition model may splice the self-attention background feature and the self-attention foreground feature, that is, connect the self-attention background feature and the self-attention foreground feature end to obtain a fusion feature, perform scene recognition on the fusion feature to obtain an image scene recognition result, and then output the image scene recognition result. In an embodiment, the image scene recognition model may further perform vector operation on the self-attention background feature and the self-attention foreground feature, for example, perform vector product operation or perform vector sum operation or perform quantity product operation, and the like, to obtain the fusion feature.

In the above embodiment, the image to be recognized is subjected to scene recognition through the trained image scene recognition model, that is, a foreground region and a background region in the image to be recognized are extracted, the foreground region is input into the foreground branch network, the background region is input into the background branch network, the foreground branch network extracts the self-attention background feature, the self-attention foreground feature is extracted through the background branch network, the self-attention background feature and the self-attention foreground feature are subjected to feature fusion to obtain a fusion feature, scene recognition is performed based on the fusion feature to obtain an output image scene recognition result, since the image scene recognition model extracts the self-attention background feature and the self-attention foreground feature from the background region and the foreground region through the dual-branch network, and then an image scene recognition result is performed through the self-attention background feature and the self-attention foreground feature, the accuracy of the image scene recognition result can be improved.

In one embodiment, the image scene recognition model includes an image feature extraction network; extracting a foreground region and a background region in an image to be identified, comprising:

inputting an image to be identified into an image feature extraction network for feature extraction to obtain the image features to be identified; and carrying out region division based on the image features to be identified to obtain a foreground region and a background region.

The image feature extraction network is used for extracting features of an image to be recognized, and the image feature extraction network can be a deep neural network, such as a convolutional neural network. The image features to be recognized are used for representing the image to be recognized, are image high-dimensional features extracted through a deep neural network, and are generally used for representing the foreground of the image to be recognized.

Specifically, the image scene recognition model inputs an image to be recognized into an image feature extraction network for feature extraction to obtain the output image features to be recognized, and the image features to be recognized are used for carrying out region division to obtain a foreground region and a background region. The method comprises the steps of obtaining a foreground region and a background region, wherein the region division can be carried out according to the size of a characteristic value in the image characteristic to be identified, and the foreground region and the background region are obtained. The image to be recognized can also be divided according to the image activation degree in the image features to be recognized, so that a foreground region and a background region are obtained, and the image activation degree is used for representing the importance degree of the corresponding image features to be recognized. In a specific embodiment, an image feature extraction network shown in table 1 below is used to perform feature extraction on an image to be recognized, so as to obtain an output image feature to be recognized. The image feature extraction network is a resnet101 (residual error network) network, the image feature extraction network comprises five convolution layers, an image to be identified is input, and a 7 × 2048 feature map is output by the fifth convolution layer.

TABLE 1 image feature extraction network Structure Table (ResNet-101 Structure Table)

As shown in fig. 4, which is a block (module) structure diagram, the input with 256 dimensions is reduced to 64 dimensions by convolution with 1X1, and finally restored by convolution with 1X1, where a reduced Linear Unit (Linear rectification function) function is used as an activation function. The structure can relieve the problems of model degradation and gradient disappearance to a certain extent,

in one embodiment, as shown in fig. 5, performing region division based on the image features to be recognized to obtain a foreground region and a background region, includes:

and 502, calculating a mean value corresponding to the characteristic value in the image characteristic to be identified.

Specifically, the server calculates the sum of each feature value in the image features to be identified and the number of the feature values, and then calculates the ratio of the sum of the feature values to the number of the feature values to obtain a feature mean value corresponding to the image features to be identified. And taking the characteristic mean value as a threshold value for dividing the foreground area and the background area of the image characteristic to be identified. In an embodiment, a median, a mode, a quantile, a variance, a standard deviation, or the like corresponding to a feature value in the image feature to be recognized may also be calculated as a threshold for performing foreground region and background region division on the image feature to be recognized. In one embodiment, the division threshold corresponding to the image feature to be recognized may also be calculated by a gray histogram algorithm. In one embodiment, the partition threshold corresponding to the image feature to be recognized may be calculated by a maximum inter-class algorithm.

And step 504, performing binary division on the image features to be identified based on the mean value to obtain a foreground mask.

Specifically, the server uses the mean value to perform binary division on the image to be recognized, that is, the feature value of the feature value exceeding the mean value in the image to be recognized is replaced by 1, and the feature value not exceeding the mean value is replaced by 0, so as to obtain the foreground mask. Wherein exceeding the mean value represents a foreground region and not exceeding the mean value represents a background region.

Step 506, calculating the product of the foreground mask and the pixel value of the image to be identified to obtain the foreground area in the image to be identified.

Specifically, the server calculates the product of the foreground mask and the pixel value in the image to be identified to obtain a binarized image, and then obtains the foreground region in the image to be identified from the binarized image. In a specific embodiment, as shown in fig. 6, foreground mask extraction is performed on three different pictures to obtain a foreground mask schematic diagram. And when the picture is seen from top to bottom, performing foreground masking on the first puppy picture, wherein the extracted foreground region is the region of the puppy in the first picture. And performing foreground masking on the second street picture, wherein the extracted foreground region is the region of the billboard in the second picture. And performing foreground masking on the third house picture, wherein the foreground area obtained by extraction is the area of the house in the third picture.

And step 508, negating the foreground mask to obtain a background mask, and calculating the product of the background mask and the pixel value of the image to be identified to obtain a background area.

Specifically, the server negates the foreground mask, that is, the value of the foreground mask is subtracted from 1 to obtain a background mask, then the product of the background mask and the pixel value of the image to be identified is calculated to obtain a binarized image, and the background region in the image to be identified is obtained from the binarized image.

In the embodiment, the background area and the foreground area are obtained by dividing the image to be recognized by using the image feature to be recognized, so that the accuracy of the obtained background area and foreground area can be improved.

In one embodiment, the foreground branching network comprises a foreground feature extraction network and a foreground attention feature extraction network; performing self-attention weight calculation based on foreground features corresponding to the foreground regions to obtain self-attention foreground weights, and adjusting the foreground features through the self-attention foreground weights to obtain self-attention foreground features, wherein the self-attention foreground features comprise:

inputting the foreground area into a foreground feature extraction network for feature extraction to obtain foreground features corresponding to the foreground area; and inputting the foreground features into a foreground attention weight feature network for attention weight calculation to obtain self-attention foreground weights, and weighting the foreground features by using the self-attention foreground weights to obtain the self-attention foreground features.

The foreground feature extraction network is a network for extracting features from a foreground region, and the foreground attention feature extraction network is a network for extracting self-attention features from the foreground features.

Specifically, each branch network in the image scene recognition model has a corresponding feature extraction network and attention feature extraction network, that is, each branch network may perform feature extraction on an input image region and then perform attention feature extraction. The foreground region can be input into a foreground feature extraction network for feature extraction to obtain a foreground feature map corresponding to the foreground region, then the foreground feature is input into a foreground attention weight feature network for attention weight calculation to obtain a self-attention foreground weight, and the foreground feature is weighted by using the self-attention foreground weight to obtain the self-attention foreground feature. In a specific embodiment, the network structure of the foreground feature extraction network may be the network structure shown in table 1, and the network parameters are obtained by training.

In one embodiment, as shown in fig. 7, inputting the foreground features into a foreground attention weight feature network to perform attention weight calculation, so as to obtain a self-attention foreground weight, and weighting the foreground features by using the self-attention foreground weight, so as to obtain the self-attention foreground features, where the method includes:

and 702, performing mean pooling on the foreground features through a mean pooling layer in the foreground attention feature extraction network to obtain foreground pooling features.

The mean pooling layer is used for performing mean pooling on the foreground attention features, namely performing dimensionality reduction on the foreground attention features. The foreground pooling feature is a feature obtained by mean pooling of foreground attention features.

Specifically, the server performs mean pooling on the foreground features through a mean pooling layer in the foreground attention feature extraction network to obtain foreground pooled features, for example, if the input foreground attention features are 7 × 2048-dimensional feature maps, the foreground pooled features obtained through the mean pooling layer are 1 × 2048-dimensional features, that is, 1 × 2048 vectors, and the vectors are used for representing the activation mean of 2048 different channels (channels) of the deep learning network layer on the foreground region. In one embodiment, the foreground pooling feature may also be obtained by performing maximum pooling through a maximum pooling layer in the foreground attention feature extraction network.

And 704, performing nonlinear compression on the foreground pooling features by using a nonlinear compression layer in the foreground attention feature extraction network to obtain foreground compression features.

The nonlinear compression layer is used for nonlinear compression, and extraction of important information in foreground attention characteristics is achieved. The foreground compression features are features obtained after the foreground pooling features are subjected to nonlinear compression.

Specifically, the server may perform nonlinear compression on the foreground pooled features through a nonlinear compression layer, for example, compress a foreground vector of 1 × 2048 dimensions to 64 dimensions through the nonlinear compression, so as to refine important information in the foreground features.

And step 706, activating the foreground compression features through an activation function layer in the foreground attention feature extraction network to obtain foreground activation features.

Specifically, the activation function layer is used to activate the foreground compression feature. The foreground activation feature refers to a feature obtained after activation through an activation function. The activation may be performed through a reliu (Rectified Linear Unit, Linear rectification function) activation function, an S-type activation function, or a Tanh (hyper-bolic range, Hyperbolic Tangent function) activation function. For example, 64-dimensional foreground compression features may be activated using the RELU function, resulting in 64-dimensional foreground activation features.

And 708, extracting a weight mapping layer in the network based on the foreground attention feature to perform weight mapping on the foreground activation feature to obtain a self-attention foreground weight.

The weight mapping layer is used for performing self-attention weight mapping, namely mapping obtained through extraction is used as a weight vector.

Specifically, the server inputs the foreground activation features to the weight mapping layer for weight mapping to obtain the self-attention foreground weight, for example, the server inputs the 64-dimensional foreground activation features to the weight mapping layer for weight mapping to obtain an output 2048-dimensional self-attention foreground weight vector, which is a vector used to represent weight vectors corresponding to 2048 different channels (channels) of the deep learning network layer.

And 710, weighting the feature values in the foreground features by using the self-attention foreground weight to obtain weighted foreground features, and performing maximum pooling through a maximum pooling layer in the foreground attention feature extraction network based on the weighted foreground features to obtain the self-attention foreground features.

Specifically, the server uses the self-attention foreground weight to weight the feature value in the foreground feature, so as to obtain a weighted foreground feature. And inputting the weighted foreground features into a maximum pooling layer for maximum pooling to obtain the self-attention foreground features. Namely, the server uses a self-attention foreground weight vector with 2048 dimensions to weight each channel (channel) in the foreground features, so as to obtain a self-attention feature map. And then, performing maximum pooling on the self-attention feature map to obtain a self-attention foreground feature vector with 2048 dimensions. In one particular embodiment, the network structure of the foreground attention feature extraction network is shown in table 2 below.

Table 2 network architecture for self-attention feature extraction network

In the above embodiment, the self-attention background features extracted by the foreground feature extraction network and the foreground attention weight extraction network in the foreground branch network can be more accurate.

In one embodiment, the background branching network comprises a background feature extraction network and a background attention weight extraction network;

performing self-attention weight calculation based on the background features corresponding to the background area to obtain a self-attention background weight, and adjusting the background features through the self-attention background weight to obtain self-attention background features, wherein the self-attention background features comprise:

inputting the background area into a background feature extraction network for feature extraction to obtain background features corresponding to the background area; and inputting the background features into a background attention weight feature network to calculate attention weights to obtain self-attention background weights, and weighting the background features by using the self-attention background weights to obtain the self-attention background features.

The background feature extraction network is a network for extracting features corresponding to a background area. The background attention weight feature network refers to a network that performs self-attention feature extraction on foreground features.

Specifically, the background branch network of the image scene recognition model inputs the background area into the background feature extraction network for feature extraction, so as to obtain the background feature corresponding to the background area. In a specific embodiment, the background feature extraction network obtained by training the network structure shown in table 1 can be used. And then inputting the background features into a background attention weight feature network to perform attention weight calculation to obtain self-attention background weights, and calculating the product of the self-attention background weights and the background features to obtain the self-attention background features.

In one embodiment, as shown in fig. 8, inputting the background feature into a background attention weight feature network for attention weight calculation to obtain a self-attention background weight, and weighting the background feature by using the self-attention background weight to obtain a self-attention background feature, includes:

and 802, performing mean pooling on the background features through a mean pooling layer in the background attention feature extraction network to obtain background pooling features.

Wherein, a mean pooling layer in the background attention feature extraction network is used for mean pooling the background features. Background pooling characteristic digitization is the characteristic obtained after the background characteristic is subjected to mean pooling.

Specifically, the server inputs the background features into a mean pooling layer in the background attention feature extraction network to perform mean pooling, so as to obtain background pooling features, for example, the 7 × 2048-dimensional background features are input into the mean pooling layer in the background attention feature extraction network, so as to obtain an output 1 × 2048-dimensional feature vector. This vector is used to characterize the activation mean of 2048 different channels on the background region by the deep learning network layer. In one embodiment, the background pooling feature may also be obtained by maximal pooling in a maximal pooling layer in the background attention feature extraction network.

And step 804, performing nonlinear compression on the background pooling features by using a nonlinear compression layer in the background attention feature extraction network to obtain background compression features.

Among them, the background attention feature extraction network has a non-linear compression layer for performing non-linear compression. The background compression feature refers to a feature obtained after nonlinear compression.

Specifically, the server inputs the background pooling features into a non-linear compression layer in the background attention feature extraction network to obtain output background compression features, namely, background vectors of 1 × 2048 dimensions are compressed to 64 dimensions through non-linear compression, so that extraction of important information in the background features is realized.

And 806, activating the background compression features through an activation function layer in the background attention feature extraction network to obtain background activation features.

Wherein an activation function layer in the background attention feature extraction network is used for activation using the activation function. Wherein the activation function may be a RELU function, a sigmoid activation function, a tanh activation function, and the like. The background activation feature refers to a feature obtained after activation of the background compression feature.

Specifically, the server inputs the background compression feature into an activation function layer in the background attention feature extraction network for activation, so as to obtain a background activation feature. For example, a 64-dimensional background activation feature is activated using the RELU activation function to obtain the background activation feature.

And 808, extracting a weight mapping layer in the network based on the background attention feature to perform weight mapping on the background activation feature to obtain a self-attention background weight.

Wherein, the weight mapping layer in the background attention feature extraction network is used for performing weight mapping on the background activation features.

Specifically, the server inputs the background activation feature into a weight mapping layer in the background attention feature extraction network for weight mapping, so as to obtain a self-attention background weight. For example, the server inputs 64-dimensional background activation features to the weight mapping layer for weight mapping, and obtains an output 2048-dimensional self-attention background weight vector, that is, the background weight vector is used to represent weight vectors corresponding to 2048 different channels (channels) of the deep learning network layer.

Step 810, weighting the feature values in the background features by using the self-attention background weight to obtain weighted background features, and extracting the maximum pooling layer in the network through the background attention features based on the weighted background features to perform maximum pooling to obtain the self-attention background features.

Specifically, the server calculates the product of the self-attention background weight and the background feature to obtain a weighted background feature, and then performs maximum pooling on the weighted background feature through a maximum pooling layer in the background attention feature extraction network to obtain the self-attention background feature. Namely, the server uses a self-attention background weight vector with 2048 dimensions to weight each channel (channel) in the background features, so as to obtain a self-attention background feature map. And then, performing maximum pooling on the self-attention background feature map to obtain a self-attention background feature vector with 2048 dimensions. In one particular embodiment, a network structure such as that shown in Table 2 may be used to train and derive a background attention feature extraction network.

In the above embodiment, the self-attention background features extracted by the background feature extraction network and the background attention weight extraction network in the background branch network can be more accurate.

In one embodiment, the image scene recognition model includes a fused output network; the method comprises the following steps of performing feature fusion on a self-attention background feature and a self-attention foreground feature to obtain a fusion feature, performing scene recognition based on the fusion feature to obtain an image scene recognition result, wherein the image scene recognition result comprises the following steps:

splicing the self-attention background feature and the self-attention foreground feature through a fusion layer in a fusion output network to obtain a splicing feature; and inputting the splicing characteristics into a full-connection layer in the fusion output network for scene recognition to obtain an image scene recognition result.

The fusion output network is used for fusing features and identifying image scenes. And a fusion layer in the fusion output network is used for fusing the features, and a full connection layer in the fusion output network is used for carrying out image scene recognition and outputting an image scene recognition result.

Specifically, the server splices the head and the tail of the self-attention background feature and the self-attention foreground feature through a fusion layer in the fusion output network to obtain a splicing feature, the server inputs the splicing feature into a full-connection layer in the fusion output network to perform multi-classification scene recognition to obtain the probability of each image scene category, and the scene category with the highest probability is used as an output image scene recognition result according to the probability of the image scene category.

In a specific embodiment, the network structure of the converged output network is as shown in table 3 below.

Table 3 network structure table of converged output network

Where N represents the number of categories of the image scene.

In the embodiment, the self-attention background feature and the self-attention foreground feature are fused through the fusion output network, and then the fused features are used for multi-classification scene recognition, so that the accuracy of image scene recognition is improved.

In an embodiment, as shown in fig. 9, an image scene recognition model training method is provided, which is described by taking the method as an example of being applied to a server in fig. 1, it is understood that the method can also be applied to a terminal, and can also be applied to a system including the terminal and the server, and is implemented through interaction between the terminal and the server, and the method includes the following steps:

step 902, acquiring a training image and a corresponding training scene label, and inputting the training image to an initial image scene recognition model.

The training images are images with training scene labels during training, and the training scene labels are labels of specific scene categories corresponding to the training images. The initial image scene recognition model refers to an image scene recognition model with initialized model parameters.

Specifically, the server may directly obtain the training images and the corresponding training scene labels from the database, may also collect the training images and the corresponding training scene labels from the internet, and may also obtain the training images and the corresponding training scene labels from a service provider providing data services. The training image is input into an initial image scene recognition model for image scene recognition, and an image scene recognition model with initialized model parameters is established in advance, wherein the initialization can be random initialization, zero initialization, Gaussian distribution initialization and the like, for example, the image scene recognition model can be initialized by adopting Gaussian distribution with variance of 0.01 and mean value of 0. In one embodiment, the feature extraction parameters in the initial image scene recognition model may be pre-trained, and the other parameters may be initialized using a gaussian distribution.

Step 904, the initial image scene recognition model extracts an initial foreground training area and an initial background training area in the training image, inputs the initial foreground area into an initial foreground branch network, and inputs the initial background area into an initial background branch network.

The initial foreground training area refers to a foreground area in a training image extracted by an initial image scene recognition model. The initial background training area refers to a background area in a training image extracted by an initial image scene recognition model. The initial foreground branched network refers to a foreground branched network with initialized parameters. The initial background branching network refers to a parameter initialized background branching network.

Specifically, the initial image scene recognition model in the server may extract initial image features in the training image through an initial image feature extraction network, extract an initial foreground training region and an initial background training region in the training image according to the initial image features, then input the initial foreground region into an initial foreground branch network, and simultaneously input the initial background region into an initial background branch network. The initial image feature extraction network is an image feature extraction network with initialized parameters and is used for extracting features of images. The initialization parameters of the initial image feature extraction network may also be obtained through pre-training.

Step 906, the initial foreground branch network performs self-attention weight calculation based on the initial foreground features corresponding to the initial foreground training area to obtain initial self-attention foreground weights, and adjusts the initial foreground features according to the initial self-attention foreground weights to obtain initial self-attention foreground features.

The initial foreground features refer to features corresponding to the extracted initial foreground training area. The initial self-attention foreground weight refers to a self-attention foreground weight corresponding to the initial foreground training area. The initial self-attention foreground feature refers to the self-attention foreground feature corresponding to the initial foreground training area.

Specifically, the initial foreground branch network inputs the initial foreground training area into the initial foreground feature extraction network for feature extraction, so as to obtain an initial foreground feature, where initialization parameters of the initial foreground feature extraction network may be obtained by pre-training. And then inputting the initial foreground features into an initial foreground attention feature extraction network for self-attention weight calculation to obtain initial self-attention foreground weights, and weighting the initial foreground features through the initial self-attention foreground weights to obtain the initial self-attention foreground features, wherein the initialization parameters of the initial foreground attention feature extraction network can be parameters obtained by using Gaussian distribution.

Step 908, the initial background branch network performs self-attention weight calculation based on the initial background features corresponding to the initial background training area to obtain initial self-attention background weights, and adjusts the initial background features according to the initial self-attention background weights to obtain initial self-attention background features.

The initial background feature refers to a background feature corresponding to the initial background training area. The initial self-attention background weight refers to the self-attention background weight corresponding to the initial background training area. The initial self-attention background feature refers to a self-attention background feature corresponding to the initial background training area.

Specifically, the initial background branch network inputs the initial background training area into an initial background area feature extraction network for feature extraction, so as to obtain an initial background feature, where the initial background area feature extraction network is a network for feature extraction of the initial background training area, and initialization parameters of the initial background area feature extraction network may be obtained by pre-training or by initialization. Inputting the initial background features into an initial background attention feature extraction network for self-attention weight calculation to obtain initial self-attention background weights, and weighting the initial background features through the initial self-attention background weights to obtain initial self-attention background features. Wherein the initialization parameter in the initial background attention feature extraction network may be a parameter obtained using a gaussian distribution.

Step 910, the initial image scene recognition model performs feature fusion on the initial self-attention background feature and the initial self-attention foreground feature to obtain an initial fusion feature, and performs scene recognition based on the initial fusion feature to obtain an initial image scene recognition result.

The initial fusion characteristic refers to a characteristic obtained after the initial image scene recognition model performs characteristic fusion, and the initial image scene recognition result refers to an image scene recognition result output by the initial image scene recognition model.

Specifically, the initial image scene recognition model performs head-to-tail splicing on initial self-attention background features and initial self-attention foreground features to obtain initial fusion features, and inputs the initial fusion features into an initial full-connection network to perform scene recognition to obtain an initial image scene recognition result. The initial fully-connected network is a multi-class fully-connected network, and initial parameters of the initial fully-connected network are obtained by initializing through Gaussian distribution.

And 912, calculating the initial image scene recognition result and loss information of the training scene label, and updating the initial image scene recognition model based on the loss information.

Specifically, the server calculates an error between the initial image scene recognition result and the training scene label by using a loss function, and obtains loss information.

And then calculating gradient by using the loss information, and reversely updating parameters in the initial image scene identification model by using a gradient descent algorithm to obtain an updated image scene identification model.

And 914, judging whether a training completion condition is reached, executing the step 916 when the training completion condition is reached, and returning to the step 902 for iterative execution when the training completion condition is not reached, namely returning to the step of inputting the training image into the initial image scene recognition model for iterative execution.

And step 916, obtaining the trained image scene recognition model.

The training completion condition refers to a condition for completing training of the image scene recognition model, and includes at least one of loss information obtained through training meeting a preset loss threshold, training iteration times reaching the maximum iteration times and model parameters not changing obviously.

Specifically, the server judges whether the model training reaches a training completion condition, and when the training completion condition is not reached, the step of inputting the training image into the initial image scene recognition model is returned for iterative execution until the training completion condition is reached, and the image scene recognition model reaching the training completion condition is used as the image scene recognition model after the training is completed.

The training method of the image scene recognition model comprises the steps of inputting a training image into an initial image scene recognition model by obtaining the training image and a corresponding training scene label, extracting a self-attention foreground characteristic from the initial image scene recognition model through an initial foreground branch network, extracting a self-attention background characteristic from the initial background branch network, performing characteristic fusion on the initial self-attention background characteristic and the initial self-attention foreground characteristic to obtain an initial fusion characteristic, performing scene recognition based on the initial fusion characteristic to obtain an initial image scene recognition result, calculating the initial image scene recognition result and loss information of the training scene label, updating the initial image scene recognition model based on the loss information, and obtaining the trained image scene recognition model until a training completion condition is reached. The self-attention foreground features and the self-attention background features are respectively extracted through the foreground branch network and the background branch network, and then the initial self-attention background features and the initial self-attention foreground features are subjected to feature fusion to obtain an image scene recognition result through recognition, so that the accuracy of the image scene recognition can be improved through the trained image scene recognition model.

In one embodiment, as shown in fig. 10, calculating the initial image scene recognition result and the loss information of the training scene label, updating the initial image scene recognition model based on the loss information, and returning to the step of inputting the training image into the initial image scene recognition model to be iteratively executed until a training completion condition is reached, so as to obtain a trained image scene recognition model, including:

step 1002, calculating an error between an initial image scene recognition result and a training scene label by using a cross entropy loss function to obtain loss information;

specifically, the error may be calculated using a cross entropy loss function, i.e., the loss information may be calculated using equation (1) shown below.

Wherein, L represents loss information, y represents a training scene label and is a scene category label of the real image.

The scene recognition result is an initial image scene recognition result and is a predicted scene type.

And 1004, when the loss information does not exceed a preset loss threshold value, calculating a gradient based on the loss information, and updating the initial image scene identification model by using the gradient to obtain an updated image scene identification model.

The preset loss threshold refers to a preset threshold of loss information. The gradient is intended to be a vector (vector) indicating that the directional derivative of a certain function at that point takes a maximum value along that direction, i.e. the function changes most rapidly and at the maximum rate along that direction (the direction of this gradient) at that point. The updated image scene recognition model is an image scene recognition model with updated parameters.

Specifically, the server judges that when the loss information does not exceed a preset loss threshold, the gradient is calculated based on the loss information, and the gradient is used for updating each parameter in the initial image scene recognition model, namely updating the parameter of the initial background branch network, updating the parameter of the initial foreground branch network, updating the network parameter of the initial fusion output network, updating the network parameter of the initial image feature extraction network and the like, so that the updated image scene recognition model is obtained.

And 1006, taking the updated scene recognition model as an initial scene recognition model, and returning to the step of inputting the training image into the initial image scene recognition model for iterative execution, and taking the initial image scene recognition model exceeding the preset loss threshold value as the trained image scene recognition model until the loss information exceeds the preset loss threshold value.

Specifically, the server takes the updated scene recognition model as an initial scene recognition model, returns the step of inputting the training image into the initial image scene recognition model for iterative execution, and takes the initial image scene recognition model exceeding the preset loss threshold as the trained image scene recognition model when the loss information exceeds the preset loss threshold, and trains the image scene recognition model by using the cross entropy loss function, so that the trained image scene recognition model has better performance.

In one embodiment, the initial image scene recognition model comprises an initial image feature extraction network, an initial foreground feature extraction network and an initial background feature extraction network;

as shown in fig. 11, before step 902, that is, before acquiring the training images and the corresponding training scene labels, the method further includes:

step 1102, a pre-training image and a pre-training scene label are obtained.

The pre-training image is an image used for pre-training. The pre-training scene label is a scene category label corresponding to a pre-training image during pre-training, and each pre-training image has a corresponding pre-training scene label which is a real scene category label.

Specifically, the server may acquire the stored pre-training images and pre-training scene labels from the database, or acquire the pre-training images from the internet, and then acquire the pre-training scene labels corresponding to the pre-training images. Training data, i.e., pre-training images and pre-training scene labels, may also be obtained from a service provider providing data services.

And 1104, inputting the pre-training image into a pre-training scene recognition model, performing feature extraction on the pre-training image through a feature extraction network by the pre-training scene recognition model to obtain pre-training image features, and performing scene recognition based on the pre-training image features to obtain a pre-training image scene recognition result.

The pre-training scene recognition model is a scene recognition model during pre-training, the pre-training scene recognition model can be a model established by using a deep neural network, and model parameters of the pre-training scene recognition model are obtained by random initialization. The pre-training scene recognition model comprises a feature extraction network, and the feature extraction network is used for extracting features of the image. The pre-training image features refer to image features obtained by extracting pre-training images. The pre-training image scene recognition result refers to an image scene category obtained by prediction corresponding to the pre-training image.

Specifically, the server inputs a pre-training image into a pre-training scene recognition model, the pre-training scene recognition model performs feature extraction on the pre-training image through a feature extraction network to obtain pre-training image features, and scene recognition is performed through a full-connection network in the pre-training scene recognition model based on the pre-training image features to obtain a pre-training image scene recognition result.

Step 1106, calculating pre-training loss information based on the pre-training scene recognition result and the pre-training scene label, and updating the pre-training scene recognition model based on the pre-training loss information.

The pre-training loss information refers to loss information obtained in the pre-training process.

Specifically, the server may calculate pre-training loss information between the pre-training scene recognition result and the pre-training scene label using a classification loss function, and then reversely update parameters in the pre-training scene recognition model based on a gradient descent algorithm using the pre-training loss information. Wherein the classification loss function may be a cross-entropy loss function, an exponential loss function, an S-shaped loss function, or the like.

Step 1108, judging whether the pre-training is finished, returning to the step of inputting the pre-training image into the pre-training scene recognition model for iterative execution when the pre-training is not finished, and executing step 1110 when the pre-training is finished.

Step 1110, obtaining an initial image feature extraction network, an initial foreground feature extraction network and an initial background feature extraction network in the initial image scene recognition model based on the pre-trained feature extraction network.

Specifically, the server determines whether pre-training is completed, that is, determines whether a pre-training completion condition is reached, where the pre-training completion condition includes at least one of a condition that pre-training loss information reaches a preset pre-training loss threshold, a condition that pre-training iteration times reaches a maximum iteration times, and a condition that model parameters obtained by pre-training do not significantly change. And when the pre-training is finished, obtaining a pre-training image scene recognition model after the pre-training is finished, and taking a feature extraction network in the pre-training image scene recognition model after the pre-training as an initial image feature extraction network, an initial foreground feature extraction network and an initial background feature extraction network in the initial image scene recognition model. The network structure and network parameters of the feature extraction network in the pre-training image scene recognition model are the same as those of the initial image feature extraction network, the network structure and network parameters of the feature extraction network in the pre-training image scene recognition model are the same as those of the initial foreground feature extraction network, and the network structure and network parameters of the feature extraction network in the pre-training image scene recognition model are the same as those of the initial background feature extraction network. And establishing an initial image scene recognition model by using an initial image feature extraction network, an initial foreground feature extraction network and an initial background feature extraction network, and then training the established initial image scene recognition model to obtain an image scene recognition model.

In the embodiment, the initial image feature extraction network, the initial foreground feature extraction network and the initial background feature extraction network are obtained through pre-training, and then the initial image scene recognition model is established for training to obtain the image scene recognition model, so that the convergence speed during training of the image scene recognition model can be increased, and the training efficiency and accuracy are improved. In a specific embodiment, the initial network parameters of the initial image feature extraction network, the initial foreground feature extraction network and the initial background feature extraction network may use parameters of ResNet101 pre-trained by the ImageNet data set, and the parameters of newly added networks, such as the self-attention feature extraction network and the fusion output network, are initialized with a gaussian distribution with a variance of 0.01 and a mean of 0.

In a specific embodiment, as shown in fig. 12, there is provided an image scene recognition method, executed by a server, specifically including the following steps:

step 1202, the server acquires an image to be recognized from the terminal and inputs the image to be recognized into the image scene recognition model.

Step 1204, the image scene recognition model inputs the image to be recognized into an image feature extraction network for feature extraction, so as to obtain the image features to be recognized.

And 1206, calculating a mean value corresponding to the characteristic value in the image characteristic to be recognized by the image scene recognition model, and performing binary division on the image characteristic to be recognized based on the mean value to obtain a foreground mask. And calculating the product of the foreground mask and the pixel value of the image to be identified to obtain the foreground area in the image to be identified.

And 1208, the image scene recognition model negating the foreground mask to obtain a background mask, and calculating the product of the background mask and the pixel value of the image to be recognized to obtain a background area. The foreground region is input into the foreground branched network and the background region is input into the background branched network.

Step 1210, the foreground branch network inputs the foreground region into the foreground feature extraction network for feature extraction to obtain foreground features corresponding to the foreground region, inputs the foreground features into the foreground attention weight feature network for attention weight calculation to obtain self-attention foreground weights, and weights the foreground features by using the self-attention foreground weights to obtain the self-attention foreground features.

Step 1212, the background branch network inputs the background region into the background feature extraction network for feature extraction to obtain a background feature corresponding to the background region, inputs the background feature into the background attention weight feature network for attention weight calculation to obtain a self-attention background weight, and weights the background feature by using the self-attention background weight to obtain the self-attention background feature.

Step 1214, the image scene recognition model splices the self-attention background feature and the self-attention foreground feature through the fusion layer in the fusion output network to obtain a splicing feature, inputs the splicing feature into the full-link layer in the fusion output network to perform scene recognition to obtain an image scene recognition result, and the server returns the image scene recognition result to the terminal for display.

In a specific embodiment, as shown in fig. 13, an architecture diagram of an image scene recognition model is provided, where the image scene recognition model is a two-branch self-attention recognition model, specifically:

the method comprises the steps of obtaining an image to be identified, inputting the image into an image scene identification model, carrying out image depth feature extraction on the image scene identification model to obtain image depth features, and then carrying out foreground extraction and background extraction by using the image depth features to obtain a foreground image and a background image. And then, respectively extracting the foreground image and the background image through a double-branch network to obtain foreground image characteristics and background image characteristics. And based on the foreground image features, extracting a self-attention weight 2 through a self-attention network, and weighting the foreground image features by using the self-attention weight 2 to obtain foreground classification features. And extracting a self-attention weight 1 through a self-attention network based on the background image features, and weighting the background image features by using the self-attention weight 1 to obtain background classification features. And then carrying out feature fusion on the background classification features and the foreground classification features in an end-to-end manner, and then carrying out image scene recognition through the fused features to obtain an image scene recognition result, namely the scene classification result.

The application also provides an application scene, and the application of the image scene identification method in the application scene is as follows:

as shown in fig. 14, it is a schematic view of an application scene of image scene recognition, specifically, a user inputs a picture to be recognized through a terminal a, for example, the picture may be the picture shown in fig. 15. The terminal a uploads the picture input by the user to the server, the server is deployed with an image scene recognition model, the image scene recognition model performs image scene recognition on the picture input by the user to obtain an image scene recognition result, for example, the recognition result in fig. 15 may be a seaside scene, and then the image scene recognition result is sent to the terminal B for display.

It should be understood that although the various steps in the flow charts of fig. 2-12 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-12 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 16, there is provided an image scene recognition apparatus 1600, which may be a part of a computer device by using a software module or a hardware module, or a combination of the two, and specifically includes: an image acquisition module 1602, a region extraction module 1604, a foreground feature extraction module 1606, a background feature extraction module 1608, and a scene recognition module 1610, wherein:

an image obtaining module 1602, configured to obtain an image to be identified;

the region extraction module 1604 is configured to extract a foreground region and a background region in the image to be identified;

a foreground feature extraction module 1606, configured to perform self-attention weight calculation based on foreground features corresponding to the foreground region to obtain a self-attention foreground weight, and adjust the foreground features according to the self-attention foreground weight to obtain self-attention foreground features;

a background feature extraction module 1608, configured to perform self-attention weight calculation based on the background features corresponding to the background region to obtain a self-attention background weight, and adjust the background features according to the self-attention background weight to obtain self-attention background features;

the scene recognition module 1610 is configured to perform feature fusion on the self-attention background feature and the self-attention foreground feature to obtain a fusion feature, perform scene recognition based on the fusion feature, and obtain an image scene recognition result corresponding to the image to be recognized.

In one embodiment, an image scene recognition apparatus 1600 includes:

the image input module is used for inputting an image to be recognized into the image scene recognition model;

the branch input module is used for extracting a foreground area and a background area in an image to be identified by the image scene identification model, inputting the foreground area into a foreground branch network, and inputting the background area into a background branch network;

the foreground identification module is used for extracting foreground features corresponding to the foreground area by the foreground branch network, performing self-attention weight calculation by using the foreground features to obtain self-attention foreground weights, and weighting the foreground features by the self-attention foreground weights to obtain the self-attention foreground features;

the background recognition module is used for extracting background features corresponding to the background area by the background branch network, performing self-attention weight calculation by using the background features to obtain self-attention background weights, and weighting the background features by using the self-attention background weights to obtain the self-attention background features;

and the image recognition module is used for carrying out feature fusion on the self-attention background feature and the self-attention foreground feature by the image scene recognition model to obtain a fusion feature, and carrying out scene recognition based on the fusion feature to obtain an image scene recognition result.

In one embodiment, the image scene recognition model includes an image feature extraction network; the branch input module is also used for inputting the image to be identified into an image feature extraction network for feature extraction to obtain the image features to be identified; and carrying out region division based on the image features to be identified to obtain a foreground region and a background region.

In one embodiment, the branch input module is further configured to calculate a mean value corresponding to a feature value in the image feature to be identified; performing binary division on the image features to be identified based on the mean value to obtain a foreground mask; calculating the product of the foreground mask and the pixel value of the image to be identified to obtain a foreground area in the image to be identified; and (4) negating the foreground mask to obtain a background mask, and calculating the product of the background mask and the pixel value of the image to be identified to obtain a background area.

In one embodiment, the foreground branching network comprises a foreground feature extraction network and a foreground attention feature extraction network; the foreground identification module is also used for inputting the foreground area into a foreground feature extraction network for feature extraction to obtain foreground features corresponding to the foreground area; and inputting the foreground features into a foreground attention weight feature network for attention weight calculation to obtain self-attention foreground weights, and weighting the foreground features by using the self-attention foreground weights to obtain the self-attention foreground features.

In one embodiment, the foreground identification module is further configured to perform mean pooling on the foreground features through a mean pooling layer in the foreground attention feature extraction network to obtain foreground pooled features; performing nonlinear compression on the foreground pooling characteristics by using a nonlinear compression layer in the foreground attention characteristic extraction network to obtain foreground compression characteristics; activating the foreground compression features through an activation function layer in a foreground attention feature extraction network to obtain foreground activation features; a weight mapping layer in the network is extracted based on the foreground attention feature to carry out weight mapping on the foreground activation feature to obtain the self-attention foreground weight; weighting the characteristic values in the foreground characteristic by using the self-attention foreground weight to obtain a weighted foreground characteristic, and performing maximum pooling through a maximum pooling layer in the foreground attention characteristic extraction network based on the weighted foreground characteristic to obtain the self-attention foreground characteristic.

In one embodiment, the background branching network comprises a background feature extraction network and a background attention weight extraction network; the background identification module is also used for inputting the background area into a background feature extraction network for feature extraction to obtain background features corresponding to the background area; and inputting the background features into a background attention weight feature network to calculate attention weights to obtain self-attention background weights, and weighting the background features by using the self-attention background weights to obtain the self-attention background features.

In one embodiment, the background identification module is further configured to perform mean pooling on the background features through a mean pooling layer in the background attention feature extraction network to obtain background pooling features; carrying out nonlinear compression on the background pooling characteristics by using a nonlinear compression layer in the background attention characteristic extraction network to obtain background compression characteristics; activating the background compression features through an activation function layer in a background attention feature extraction network to obtain background activation features; extracting a weight mapping layer in the network based on the background attention feature to perform weight mapping on the background activation feature to obtain a self-attention background weight; weighting the feature values in the background features by using the self-attention background weight to obtain weighted background features, and performing maximum pooling through a maximum pooling layer in the background attention feature extraction network based on the weighted background features to obtain the self-attention background features.

In one embodiment, the image scene recognition model includes a fused output network; the image identification module is also used for splicing the self-attention background feature and the self-attention foreground feature through a fusion layer in the fusion output network to obtain a splicing feature; and inputting the splicing characteristics into a full-connection layer in the fusion output network for scene recognition to obtain an image scene recognition result.

In one embodiment, as shown in fig. 17, an image scene recognition model training apparatus 1700 is provided, which may be a part of a computer device by using a software module or a hardware module, or a combination of the two modules, and specifically includes: a training data obtaining module 1702, a model processing module 1704, a foreground network processing module 1706, a background network processing module 1708, a model identifying module 1710, and an iterating module 1712, wherein:

a training data obtaining module 1702, configured to obtain a training image and a corresponding training scene label, and input the training image to the initial image scene recognition model;

a model processing module 1704, configured to extract an initial foreground training region and an initial background training region in a training image by using an initial image scene recognition model, input the initial foreground region into an initial foreground branch network, and input the initial background region into an initial background branch network;

a foreground network processing module 1706, configured to perform self-attention weight calculation on the basis of an initial foreground feature corresponding to the initial foreground training area by using the initial foreground branch network to obtain an initial self-attention foreground weight, and adjust the initial foreground feature by using the initial self-attention foreground weight to obtain an initial self-attention foreground feature;

a background network processing module 1708, configured to perform self-attention weight calculation on the basis of an initial background feature corresponding to the initial background training area by using the initial background branch network to obtain an initial self-attention background weight, and adjust the initial background feature by using the initial self-attention background weight to obtain an initial self-attention background feature;

a model identification module 1710, configured to perform feature fusion on the initial self-attention background feature and the initial self-attention foreground feature by using the initial image scene identification model to obtain an initial fusion feature, and perform scene identification based on the initial fusion feature to obtain an initial image scene identification result;

and the iteration module 1712 is configured to calculate an initial image scene recognition result and loss information of the training scene labels, update the initial image scene recognition model based on the loss information, and perform iteration by returning to the step of inputting the training image into the initial image scene recognition model until a training completion condition is met, so as to obtain a trained image scene recognition model.

In one embodiment, the iteration module 1712 is further configured to calculate an error between the initial image scene recognition result and the training scene label using a cross entropy loss function, to obtain loss information; when the loss information does not exceed a preset loss threshold value, calculating a gradient based on the loss information, and updating the initial image scene identification model by using the gradient to obtain an updated image scene identification model; and taking the updated scene recognition model as an initial scene recognition model, returning the step of inputting the training image into the initial image scene recognition model for iterative execution, and taking the initial image scene recognition model exceeding the preset loss threshold value as the trained image scene recognition model when the loss information exceeds the preset loss threshold value.

In one embodiment, the initial image scene recognition model comprises an initial image feature extraction network, an initial foreground feature extraction network and an initial background feature extraction network; the image scene recognition model training apparatus 1700 further includes:

the pre-training module is used for acquiring a pre-training image and a pre-training scene label; inputting a pre-training image into a pre-training scene recognition model, performing feature extraction on the pre-training image through a feature extraction network by the pre-training scene recognition model to obtain pre-training image features, and performing scene recognition based on the pre-training image features to obtain a pre-training image scene recognition result; calculating pre-training loss information based on a pre-training scene recognition result and a pre-training scene label, updating a pre-training scene recognition model based on the pre-training loss information, and returning to the step of inputting a pre-training image into the pre-training scene recognition model for iterative execution until pre-training is completed, and obtaining an initial image feature extraction network, an initial foreground feature extraction network and an initial background feature extraction network in the initial image scene recognition model based on a pre-training completed feature extraction network.

For specific limitations of the image scene recognition apparatus and the image scene recognition model training apparatus, reference may be made to the above limitations of the image scene recognition method and the image scene recognition model training method, which are not described herein again. The respective modules in the image scene recognition device and the image scene recognition model device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 18. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing training images and image data to be recognized. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an image scene recognition method and an image scene recognition model training method.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 19. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement an image scene recognition method and an image scene recognition model training method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the configurations shown in fig. 18 and 19 are only block diagrams of some of the configurations relevant to the present application, and do not constitute a limitation on the computer apparatus to which the present application is applied, and a particular computer apparatus may include more or less components than those shown in the drawings, or may combine some components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An image scene recognition method, characterized in that the method comprises:

acquiring an image to be identified;

extracting a foreground area and a background area in the image to be identified;

performing self-attention weight calculation based on the foreground features corresponding to the foreground regions to obtain self-attention foreground weights, and adjusting the foreground features according to the self-attention foreground weights to obtain self-attention foreground features;

performing self-attention weight calculation based on the background features corresponding to the background area to obtain a self-attention background weight, and adjusting the background features according to the self-attention background weight to obtain self-attention background features;

2. The method of claim 1, comprising:

inputting the image to be recognized into an image scene recognition model;

the image scene recognition model is used for extracting a foreground region and a background region in the image to be recognized, inputting the foreground region into a foreground branch network, and inputting the background region into a background branch network;

the foreground branch network is used for extracting foreground features corresponding to the foreground area, performing self-attention weight calculation by using the foreground features to obtain self-attention foreground weights, and weighting the foreground features by using the self-attention foreground weights to obtain the self-attention foreground features;

the background branch network is used for extracting background features corresponding to the background area, performing self-attention weight calculation by using the background features to obtain self-attention background weights, and weighting the background features by using the self-attention background weights to obtain self-attention background features;

the image scene recognition model is further used for carrying out feature fusion on the self-attention background feature and the self-attention foreground feature to obtain fusion features, and carrying out scene recognition based on the fusion features to obtain an image scene recognition result.

3. The method of claim 2, wherein the image scene recognition model comprises an image feature extraction network;

the extracting of the foreground region and the background region in the image to be recognized includes:

inputting the image to be recognized into the image feature extraction network for feature extraction to obtain the image feature to be recognized;

and carrying out region division based on the image features to be identified to obtain a foreground region and a background region.

4. The method of claim 2, wherein the foreground branching network comprises a foreground feature extraction network and a foreground attention feature extraction network;

the self-attention foreground weight calculation based on the foreground features corresponding to the foreground region to obtain a self-attention foreground weight, and the self-attention foreground feature is adjusted by the self-attention foreground weight to obtain a self-attention foreground feature, including:

inputting the foreground area into the foreground feature extraction network for feature extraction to obtain foreground features corresponding to the foreground area;

and inputting the foreground features into the foreground attention weight feature network for attention weight calculation to obtain the self-attention foreground weights, and weighting the foreground features by using the self-attention foreground weights to obtain the self-attention foreground features.

5. The method of claim 4, wherein the inputting the foreground features into the foreground attention weight feature network for attention weight calculation to obtain the self-attention foreground weight, and the weighting the foreground features with the self-attention foreground weight to obtain the self-attention foreground features comprises:

performing mean pooling on the foreground features through a mean pooling layer in the foreground attention feature extraction network to obtain foreground pooling features;

performing nonlinear compression on the foreground pooling features by using a nonlinear compression layer in the foreground attention feature extraction network to obtain foreground compression features;

activating the foreground compression features through an activation function layer in the foreground attention feature extraction network to obtain foreground activation features;

performing weight mapping on the foreground activation features based on a weight mapping layer in the foreground attention feature extraction network to obtain the self-attention foreground weight;

weighting the feature values in the foreground features by using the self-attention foreground weight to obtain weighted foreground features, and extracting a maximum pooling layer in a network through the foreground attention features on the basis of the weighted foreground features to obtain the self-attention foreground features.

6. The method of claim 2, wherein the background branching network comprises a background feature extraction network and a background attention weight extraction network;

the self-attention background feature is obtained by performing self-attention weight calculation based on the background feature corresponding to the background area and adjusting the background feature according to the self-attention background weight, and the self-attention background feature comprises:

inputting the background area into the background feature extraction network for feature extraction to obtain a background feature corresponding to the background area;

inputting the background features into the background attention weight feature network for attention weight calculation to obtain the self-attention background weight, and weighting the background features by using the self-attention background weight to obtain the self-attention background features.

7. The method of claim 6, wherein the inputting the background feature into the background attention weight feature network for attention weight calculation to obtain the self-attention background weight, and the weighting the background feature using the self-attention background weight to obtain the self-attention background feature comprises:

performing mean pooling on the background features through a mean pooling layer in the background attention feature extraction network to obtain background pooling features;

performing nonlinear compression on the background pooling features by using a nonlinear compression layer in the background attention feature extraction network to obtain background compression features;

activating the background compression features through an activation function layer in the background attention feature extraction network to obtain background activation features;

carrying out weight mapping on the background activation features based on a weight mapping layer in the background attention feature extraction network to obtain the self-attention background weight;

weighting the feature values in the background features by using the self-attention background weight to obtain weighted background features, and performing maximum pooling through a maximum pooling layer in the background attention feature extraction network on the basis of the weighted background features to obtain the self-attention background features.

8. The method of claim 2, wherein the image scene recognition model comprises a fused output network;

the performing feature fusion on the self-attention background feature and the self-attention foreground feature to obtain a fusion feature, performing scene recognition based on the fusion feature, and obtaining an image scene recognition result corresponding to the image to be recognized, includes:

splicing the self-attention background feature and the self-attention foreground feature through a fusion layer in the fusion output network to obtain a splicing feature;

and inputting the splicing characteristics to a full-connection layer in the fusion output network for scene recognition to obtain the image scene recognition result.

9. An image scene recognition model training method is characterized by comprising the following steps:

the initial image scene recognition model extracts an initial foreground training area and an initial background training area in the training image, inputs the initial foreground area into an initial foreground branch network, and inputs the initial background area into an initial background branch network;

the initial background branch network carries out self-attention weight calculation based on the initial background features corresponding to the initial background training area to obtain initial self-attention background weights, and the initial background features are adjusted through the initial self-attention background weights to obtain initial self-attention background features;

and calculating the initial image scene recognition result and the loss information of the training scene label, updating the initial image scene recognition model based on the loss information, and returning to the step of inputting the training image into the initial image scene recognition model for iterative execution until a training completion condition is reached to obtain a trained image scene recognition model.

10. The method of claim 9, wherein the steps of calculating the initial image scene recognition result and the loss information of the training scene label, updating the initial image scene recognition model based on the loss information, and returning to input the training image to the initial image scene recognition model are iteratively performed until a training completion condition is reached to obtain a trained image scene recognition model, and the method comprises:

calculating an error between the initial image scene recognition result and the training scene label by using a cross entropy loss function to obtain loss information;

when the loss information does not exceed a preset loss threshold value, calculating a gradient based on the loss information, and updating the initial image scene recognition model by using the gradient to obtain an updated image scene recognition model;

and taking the updated scene recognition model as an initial scene recognition model, returning the step of inputting the training image into the initial image scene recognition model for iterative execution, and taking the initial image scene recognition model exceeding a preset loss threshold value as a trained image scene recognition model when the loss information exceeds the preset loss threshold value.

11. The method of claim 9, wherein the initial image scene recognition model comprises an initial image feature extraction network, an initial foreground feature extraction network, and an initial background feature extraction network;

before the acquiring the training images and the corresponding training scene labels, further comprising:

acquiring a pre-training image and a pre-training scene label;

inputting the pre-training image into a pre-training scene recognition model, performing feature extraction on the pre-training image through a pre-training feature extraction network by the pre-training scene recognition model to obtain pre-training image features, and performing scene recognition based on the pre-training image features to obtain a pre-training image scene recognition result;

calculating pre-training loss information based on the pre-training scene recognition result and the pre-training scene label, updating a pre-training scene recognition model based on the pre-training loss information, and returning to the step of inputting the pre-training image into the pre-training scene recognition model for iterative execution until pre-training is completed, and obtaining an initial image feature extraction network, an initial foreground feature extraction network and an initial background feature extraction network in the initial image scene recognition model based on a pre-training feature extraction network completed by pre-training.

12. An image scene recognition apparatus, characterized in that the apparatus comprises:

the image acquisition module is used for acquiring an image to be identified;

the foreground feature extraction module is used for performing self-attention weight calculation on the basis of foreground features corresponding to the foreground region to obtain self-attention foreground weights, and adjusting the foreground features according to the self-attention foreground weights to obtain self-attention foreground features;

the background feature extraction module is used for performing self-attention weight calculation based on the background features corresponding to the background area to obtain a self-attention background weight, and adjusting the background features according to the self-attention background weight to obtain self-attention background features;

and the scene recognition module is used for performing feature fusion on the self-attention background feature and the self-attention foreground feature to obtain a fusion feature, and performing scene recognition based on the fusion feature to obtain an image scene recognition result corresponding to the image to be recognized.

13. An image scene recognition model training device, characterized in that the device comprises:

the training data acquisition module is used for acquiring a training image and a corresponding training scene label and inputting the training image into an initial image scene recognition model;

the model processing module is used for extracting an initial foreground training area and an initial background training area in the training image by the initial image scene recognition model, inputting the initial foreground area into an initial foreground branch network, and inputting the initial background area into an initial background branch network;

a foreground network processing module, configured to perform self-attention weight calculation on the basis of an initial foreground feature corresponding to the initial foreground training region by using the initial foreground branch network to obtain an initial self-attention foreground weight, and adjust the initial foreground feature by using the initial self-attention foreground weight to obtain an initial self-attention foreground feature;

a background network processing module, configured to perform self-attention weight calculation on the basis of an initial background feature corresponding to the initial background training area by using the initial background branch network to obtain an initial self-attention background weight, and adjust the initial background feature by using the initial self-attention background weight to obtain an initial self-attention background feature;

and the iteration module is used for calculating the initial image scene recognition result and the loss information of the training scene label, updating the initial image scene recognition model based on the loss information, and returning to the step of inputting the training image into the initial image scene recognition model for iterative execution until a training completion condition is reached to obtain a trained image scene recognition model.

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 11 when executing the computer program.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 11.