CN113762382B

CN113762382B - Model training and scene recognition method, device, equipment and medium

Info

Publication number: CN113762382B
Application number: CN202111043952.8A
Authority: CN
Inventors: 温偲; 项伟; 陈德健
Original assignee: Bigo Technology Singapore Pte Ltd
Current assignee: Bigo Technology Singapore Pte Ltd
Priority date: 2021-09-07
Filing date: 2021-09-07
Publication date: 2024-03-08
Anticipated expiration: 2041-09-07
Also published as: CN113762382A

Abstract

The invention discloses a training and scene recognition method, device, equipment and medium for a model, which are used for accurately recognizing a scene of an image. When the scene to which the image to be identified belongs is determined through the pre-trained scene identification model, for each classification layer in the scene identification model, the scene to which the sample image belongs in the scene of the level corresponding to the classification layer is determined according to the acquired feature image and the fusion feature vector output by the last classification layer, so that the scene to which the image to be identified belongs is accurately determined according to the relevance among the scenes of different levels, and the accuracy of scene identification is improved.

Description

Model training and scene recognition method, device, equipment and medium

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to a method, an apparatus, a device, and a medium for training a model and identifying a scene.

Background

With the development of information technology, users increasingly transmit information through images, such as images corresponding to a certain video frame in video and images shot by the users. The images may relate to a wide variety of scenes such as a delicacy scene, a portrait scene, a landscape scene, a cartoon scene, and the like. The method has important significance in the fields of image content analysis, image retrieval and the like.

Therefore, how to accurately identify a scene to which an image belongs is an increasing concern in recent years.

Disclosure of Invention

The embodiment of the invention provides a training and scene recognition method, device, equipment and medium for a model, which are used for accurately recognizing a scene of an image.

The embodiment of the invention provides a scene recognition method, which comprises the following steps:

acquiring a feature map of an image to be identified through a feature extraction layer in a pre-trained scene identification model;

for at least two first classification layers in the scene recognition model, determining a first feature vector based on the feature map and a fusion feature vector output by the last first classification layer through a first sub-network in the first classification layers; determining a second feature vector corresponding to the first feature vector through a second sub-network in the first classification layer, determining a fusion feature vector corresponding to the first classification layer based on the first feature vector and the second feature vector, outputting the fusion feature vector to the next first classification layer, and determining a scene of the image to be identified at a level corresponding to the first classification layer based on the second feature vector; wherein the scenes contained in different levels are different.

The embodiment of the invention provides a scene recognition device, which comprises:

the acquisition unit is used for acquiring the image to be identified;

the processing unit is used for acquiring a feature map of the image to be identified through a feature extraction layer in the pre-trained scene identification model; for at least two first classification layers in the scene recognition model, determining a first feature vector based on the feature map and a fusion feature vector output by the last first classification layer through a first sub-network in the first classification layers; determining a second feature vector corresponding to the first feature vector through a second sub-network in the first classification layer, determining a fusion feature vector corresponding to the first classification layer based on the first feature vector and the second feature vector, outputting the fusion feature vector to the next first classification layer, and determining a scene of the image to be identified at a level corresponding to the first classification layer based on the second feature vector; wherein the scenes contained in different levels are different.

An embodiment of the invention provides an electronic device comprising a processor for implementing the steps of the method described above when executing a computer program stored in a memory.

Embodiments of the present invention provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of a method as described above.

When the scene to which the image to be identified belongs is determined through the pre-trained scene identification model, for each classification layer in the scene identification model, the scene to which the sample image belongs in the scene of the level corresponding to the classification layer is determined according to the acquired feature image and the fusion feature vector output by the last classification layer, so that the scene to which the image to be identified belongs is accurately determined according to the relevance among the scenes of different levels, and the accuracy of scene identification is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a scene recognition process according to an embodiment of the present invention;

Fig. 2 is a schematic structural diagram of a scene recognition model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a multi-layer hierarchical layer in a scene recognition model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a scene recognition device according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In one possible application scenario, in the live broadcast process, in order to maintain a good network civilized environment, live broadcast content (including video and audio) of a host broadcast may be monitored, and if illegal content in live broadcast video is to be monitored as pertinently as possible, a scene to which an image included in live broadcast video belongs may be identified. And according to the identified scene to which the image belongs and a preset supervision strategy under the scene, supervising the live broadcast content under the scene. Thus, how to achieve accurate scene classification of images is a long-standing, fundamental, challenging problem in computer vision.

With the advent of large-scale data sets, the application of scene classification techniques has become more and more widespread, by which the identification of a scene to which an image contained in a video belongs can be achieved, thereby determining a predefined scene to which the image belongs. In a specific implementation process, after the image to be identified is acquired, the image to be identified may be input into a pre-trained scene recognition model (e.g., a model of a hierarchical multi-label classification structure (Hierarchical Multi-label Classification, HMC)). The feature map of the image to be identified can be obtained through a feature extraction layer in the pre-trained scene identification model. And then, acquiring each scene label of the image to be identified based on the input feature map through each scene classification layer in the pre-trained scene identification model. The scene labels acquired by any scene classification layer identify the scenes to which the images to be identified belong in the scenes of the hierarchy corresponding to the scene classification layer.

The image to be identified determined by the scene identification model may have a situation that part of scenes are not coexistent or are interdependent in the scenes to which each level belongs, for example, the image to be identified belongs to an outdoor scene in a first level scene, the image to be identified belongs to a bedroom scene in a second level scene, and the outdoor scene and the bedroom scene are not coexistent, that is, if the image to be identified belongs to an outdoor scene, the image to be identified is not generally the bedroom scene, so that constraint and relevance are not available between the scenes of different levels to which the image to be identified belongs, and the acquired scene identification result is inaccurate.

In order to solve the above problems, embodiments of the present invention provide a method, an apparatus, a device, and a medium for model training and data processing. In the scene recognition process, a scene recognition model is trained in advance, and the scene recognition model comprises a feature extraction layer and at least two first classification layers, wherein the feature extraction layer is respectively connected with each first classification layer, and each first classification layer is sequentially connected. Through the feature extraction layer in the scene recognition model, a feature map of the image to be recognized can be obtained. Then, for each first classification layer in the scene recognition model, determining a first feature vector based on the obtained feature map and the fusion feature vector determined by the last first classification layer through a first sub-network in the first classification layer; and then determining a second feature vector corresponding to the first feature vector through a second sub-network in the first classification layer, determining a fusion feature vector corresponding to the current first classification layer according to the first feature vector and the second feature vector, and determining a scene to which the image to be identified belongs at a level corresponding to the first classification layer based on the second feature vector. Because in the process of identifying the scene to which the image to be identified belongs, for each first classification layer in the scene identification model, not only the feature image output by the feature extraction layer is considered, but also the fusion feature vector extracted by the last first classification layer, namely the scene to which the image to be identified belongs in the previous level, is combined to determine the scene to which the image to be identified belongs in the level corresponding to the current first classification layer, the constraint and the relevance between the scenes of different levels are enhanced, the situation that the image to be identified belongs to non-coexisting scenes at the same time and the situation that the image to be identified does not belong to interdependencies at the same time are avoided, and the accuracy of identifying the scene to which the image to be identified belongs is improved.

It should be noted that, the application scenarios set forth in the foregoing embodiments are merely exemplary scenarios set forth for convenience of description, and are not a limitation of the application scenarios of the method, apparatus, device and medium for identifying a scenario provided in the embodiments of the present invention. It should be appreciated by those skilled in the art that the method, apparatus, device and medium for identifying a scene provided by the embodiments of the present invention may be applied to all application scenes that need to be identified by an application scene, for example, a target identification application scene, a target detection application scene, etc.

Example 1:

fig. 1 is a schematic diagram of a scene recognition process according to an embodiment of the present invention, where the process includes:

s101: and acquiring a feature map of the image to be identified through a feature extraction layer in the pre-trained scene identification model.

The scene recognition method provided by the embodiment of the invention is applied to the electronic equipment, and the electronic equipment can be intelligent equipment such as a mobile terminal and the like or a server.

In one possible application scenario, taking a scenario of supervising video content in live broadcast as an example, in order to better analyze video content, an electronic device needs to perform scene recognition on an image included in a video first, so as to supervise the video according to a preset supervision policy of the scenario to which the video belongs.

In a possible implementation manner, after determining that when the electronic device receives a processing request for identifying a scene of an image in a certain video, the image is determined to be an image to be identified, and based on the image to be identified, the scene identification method provided by the embodiment of the invention is adopted to perform corresponding processing.

The electronic device for scene recognition receives a processing request for scene recognition of an image in a certain video, and mainly comprises at least one of the following conditions:

in the first case, when the user needs to perform scene recognition, a service processing request for scene recognition can be input to the intelligent device, and after the intelligent device receives the service processing request, the intelligent device can send a processing request for performing scene recognition on an image in the video to the electronic device for performing scene recognition.

And secondly, after the intelligent equipment determines that the video is recorded, generating a processing request for identifying the scene of the image in the recorded video and sending the processing request to the electronic equipment for identifying the scene.

And thirdly, when a user needs to perform scene recognition on a specific video, a service processing request for performing scene recognition on the video can be input into the intelligent device, and after the intelligent device receives the service processing request, the intelligent device can send a processing request for performing scene recognition on an image in the video to the electronic device for performing scene recognition.

Note that, the electronic device for performing scene recognition may be the same as or different from the smart device.

As a possible implementation manner, a scene recognition condition may be preset, for example, when a video sent by the intelligent device is received, a scene is recognized on an image in the video, when a preset number of frame images in a certain video sent by the intelligent device is received, the scene is recognized on the preset number of frame images, and the scene is recognized on an image in a currently acquired video according to a preset period, etc. When the electronic equipment determines that the current time meets the preset scene recognition condition, the scene recognition of the image in a certain video is performed.

In the embodiment of the invention, when the image in the video is acquired, part of the video frames can be extracted from the video according to a preset frame extraction strategy, the extracted part of the video frames can be converted into corresponding images, and all the video frames in the video can be converted into corresponding images in a full frame extraction mode.

In order to accurately determine the scene to which the image belongs, a scene recognition model is trained in advance. When the electronic equipment for scene recognition needs to perform scene recognition on a certain image to be recognized, the image to be recognized can be input into a pre-trained scene recognition model so as to determine the scene to which the input image to be recognized belongs through the pre-trained scene recognition model.

In the embodiment of the invention, in order to conveniently and accurately determine the scene to which the image to be recognized belongs, the pre-trained scene recognition model comprises a feature extraction layer. After the image to be identified is input into the pre-trained scene identification model, the feature extraction layer in the pre-trained scene identification model is used for extracting the features of the image to be identified, and a feature map of the image to be identified is obtained, so that the calculated amount of a subsequent network layer in the pre-trained scene identification model is reduced, and the subsequent network layer is facilitated to accurately identify the scene to which the image to be identified belongs based on the feature map.

In one possible implementation, the feature extraction layer in the scene recognition model may be determined by a depth residual resnet network. Wherein, the resnet network contains a plurality of residual blocks, and any residual block can be expressed by the following formula:

y＝F(x)+x

wherein F (-) represents convolution transformation, x is an input feature map of a residual block, y is an output feature map of the residual block, the input feature map x of the residual block can be reused through the residual block, and the training difficulty of the neural network with the residual block can be reduced in the back propagation process of parameter optimization.

S102: for at least two first classification layers in the scene recognition model, determining a first feature vector based on the feature map and a fusion feature vector output by the last first classification layer through a first sub-network in the first classification layers; determining a second feature vector corresponding to the first feature vector through a second sub-network in the first classification layer, determining a fusion feature vector corresponding to the first classification layer based on the second feature vector and the second feature vector, outputting the fusion feature vector to the next first classification layer, and determining a scene of the image to be identified at a level corresponding to the first classification layer based on the second feature vector; wherein the scenes contained in different levels are different.

In order to accurately identify the scene to which the image to be identified belongs, the pre-trained scene identification model also comprises at least two first classification layers. Each first classification layer is connected with the characteristic extraction layer, and each first classification layer is connected in series according to a preset sequence. The feature image output by the feature extraction layer can be processed through any first classification layer, and a scene of the image to be identified, to which the level corresponding to the first classification layer belongs, is obtained.

In one possible implementation, the scene included in the hierarchy corresponding to any one of the first classification layers is further refinement of the scene included in the hierarchy corresponding to the previous first classification layer. For example, the scenes in the hierarchy corresponding to the first classification layer a include games, eating and playing, and the scenes in the hierarchy corresponding to the first classification layer B include scenes of various specific game items (such as game 1, game 2, game 3, etc.), scenes of eating and playing various foods (such as eating chafing dish, eating barbecue, etc.), and scenes of various talent items (such as singing, dancing, etc.).

In the embodiment of the invention, when setting the scenes in the hierarchy corresponding to each first classification layer respectively, the scenes in different hierarchies are different, and the scenes in the same hierarchy are also different.

Because when the scene to which the image to be identified belongs is identified, a phenomenon that the identified scene of a certain level and the identified scenes of other levels are not coexistent may occur, so that constraint is not present between the scenes of different levels, and the scenes of other levels, which have strong correlation with the identified scene of the certain level, may not be identified, so that correlation is not present between the scenes of different levels. Therefore, in order to further accurately identify the scene to which the image to be identified belongs, in the embodiment of the present invention, the first classification layer of the scene identification model includes a first sub-network and a second sub-network, and the first sub-network is connected to the second sub-network. When the feature map is processed through the first sub-network, the output of the last first classification layer can be combined to determine the first feature vector, so that when the scene of the image to be identified, which belongs to the level corresponding to the first classification layer, is determined through the second sub-network, the influence of the scene of the image to be identified, which belongs to the level corresponding to the last classification layer, on the scene of the image to be identified, which belongs to the level corresponding to the current first classification layer, is considered, constraint and association among the scenes of different levels are enhanced, and accuracy of identifying the scene of the image to be identified is improved. And then, further processing is carried out through a second sub-network in the first classification layer based on the first feature vector output by the first sub-network, and the scene of the image to be identified, which belongs to the level corresponding to the first classification layer, is determined.

In a specific implementation process, after the feature map of the image to be identified output by the feature extraction layer in the scene recognition model is obtained based on the above embodiment, a first pre-configured first classification layer in the scene recognition model processes the feature map, for example, convolution processing, to determine a first feature vector corresponding to the feature map.

After the first feature vector corresponding to the feature map is obtained through the first sub-network, the first feature vector output by the first sub-network is further processed, for example, convolution processing is performed, so as to determine a second feature vector corresponding to the first feature vector. The second feature vector may be understood as a higher-dimensional, more abstract feature extracted from the first feature vector, and the dimensions of the first feature vector and the second feature vector may be the same or different. And determining a fusion feature vector corresponding to the first classification layer according to the first feature vector and the second feature vector, and outputting the fusion feature vector to a next first classification layer. And based on the second feature vector, performing corresponding processing, such as convolution processing, to determine a scene to which the image to be identified belongs at a level corresponding to the first classification layer.

In one possible implementation manner, the fused feature vector corresponding to the first classification layer is determined according to the first feature vector and the second feature vector, and the first feature vector and the second feature vector may be spliced in a splicing manner, and the fused feature vector corresponding to the first classification layer is determined based on the spliced feature vector. For example, the spliced feature vector is subjected to convolution processing, and the fusion feature vector corresponding to the first classification layer is determined according to the feature vector after the convolution processing.

For each first classification layer positioned behind a first classification layer in the scene recognition model, determining a first feature vector by combining the fusion feature vector output by the last classification layer in the process of processing the feature map through a first sub-network in the first classification layer. And then further processing the first feature vector through a second sub-network in the first classification layer, determining a second feature corresponding to the first feature vector, determining a fusion feature vector corresponding to the first classification layer based on the first feature vector and the second feature vector, outputting the fusion feature vector to the next first classification layer, and determining a scene of the image to be identified at a level corresponding to the first classification layer based on the second feature vector.

In one possible implementation manner, for the last first classification layer in the scene recognition model, since there is no first classification layer connected after the last first classification layer, when determining a scene to which an image to be recognized belongs at a level corresponding to the last first classification layer, the fusion feature vector corresponding to the last first classification layer may not be determined, or the fusion feature vector corresponding to the last first classification layer may also be determined and not output to the next first classification layer. For example, through the first subnetwork in the last first classification layer, a first feature vector is determined based on the feature map output by the feature extraction layer and the fused feature vector output by the last first classification layer. And then determining a second feature vector corresponding to the first feature vector through a second sub-network in the first classification layer, determining a fusion feature vector corresponding to the first classification layer based on the first feature vector and the second feature vector, and determining a scene to which the image to be identified belongs at a level corresponding to the last first classification layer based on the second feature vector.

In another possible implementation manner, in order to ensure diversity and accuracy of the scene to which the identified image to be identified belongs, the pre-trained scene identification model may further include a second classification layer, where the second classification layer is connected to the feature extraction layer and the last first classification layer, respectively, so that the scene to which the identified image to be identified belongs may be further determined based on the scene to which the identified image to be identified belongs at the level corresponding to the last first classification layer through the second classification layer. Specifically, determining, by a second classification layer in the scene recognition model, a scene to which the image to be recognized belongs at a level corresponding to the second classification layer based on the feature map and a fusion feature vector output by the last first classification layer; the scenes contained in the hierarchy corresponding to the second classification layer are different from the scenes contained in the hierarchy corresponding to the at least two first classification layers respectively.

The scenes contained in the hierarchy corresponding to the second classification layer are different from the scenes contained in the hierarchy corresponding to each first classification layer.

In one possible implementation, the scene included in the level corresponding to the second classification layer is further refinement of the scene included in the level corresponding to the last first classification layer. For example, the scenes included in the level corresponding to the last first classification layer include indoor and outdoor, and the scenes included in the level corresponding to the second classification layer include bedrooms, kitchens, bathrooms, malls, outdoor football stadiums, outdoor basketball courts, deserts, and the like.

Because in the process of identifying the scene to which the image to be identified belongs, for each first classification layer in the scene identification model, not only the feature image output by the feature extraction layer is considered, but also the fusion feature vector extracted by the last first classification layer, namely the scene to which the image to be identified belongs in the previous level, is combined to determine the scene to which the image to be identified belongs in the level corresponding to the current first classification layer, the constraint and the relevance between the scenes of different levels are enhanced, the situation that the image to be identified belongs to non-coexisting scenes at the same time and the situation that the image to be identified does not belong to interdependencies at the same time are avoided, and the accuracy of identifying the scene to which the image to be identified belongs is improved.

Example 2:

in order to accurately determine the scene to which the image to be identified belongs, on the basis of the above embodiment, in the embodiment of the present invention, the scene identification model is trained by the following manner:

acquiring any sample image in a sample set; the sample image is provided with sample scene labels corresponding to each preset level, the sample scene label of any preset level is used for identifying the scene of the sample image in the preset level, the at least one sample scene label is different, and each preset level comprises levels corresponding to the at least two first classifiers respectively;

Determining scene probability vectors corresponding to the sample images at each preset level respectively based on the sample images through an original scene recognition model; the scene probability vector corresponding to any preset level comprises probability values of each scene of the level to which the sample image belongs respectively;

and training the scene recognition model according to the scene probability vector corresponding to each preset level and the probability value corresponding to the sample scene label corresponding to the preset level.

In order to accurately determine the scene to which the image to be recognized belongs, the scene recognition model needs to be trained according to each sample image in a sample set acquired in advance. Wherein any one of the sample images in the sample set is acquired by: determining the acquired original sample image as a sample image; and/or, after the pixel values of the pixel points in the acquired original sample image are adjusted, the adjusted image is determined to be the sample image.

It should be noted that, in order to facilitate training of the scene recognition model, any sample image in the sample set corresponds to a sample scene tag at each preset level, the sample scene tag of any preset level is used to identify a scene to which the sample image belongs at the preset level, at least one sample scene tag corresponding to any sample image is different, and each preset level includes a level corresponding to each first classifier respectively.

Alternatively, the electronic device for training the scene recognition model may be the same as or different from the electronic device for performing scene recognition.

As a possible implementation, if the sample set contains a sufficient number of sample images, i.e. contains a large number of acquired original sample images under different circumstances, the original scene recognition model may be trained directly from the sample images in the sample set.

As another possible implementation manner, if in order to ensure the diversity of the sample image and improve the accuracy of the scene recognition model, the pixel values of the pixel points in the original sample image may be adjusted, for example, the original sample image is subjected to blurring processing, sharpening processing, contrast processing, etc., so as to obtain a large number of adjusted images, and the adjusted images are determined as the sample image, so as to train the original scene recognition model.

According to statistics, in a working scene of the electronic equipment, more common image quality problems existing in the acquired image comprise: blur, exposure, over darkness, too low contrast, noise in the picture, etc., for example, in a live scene, there may be exposure problems in the acquired image, etc. In order to ensure the diversity of the sample images and improve the accuracy of the scene recognition model, the image quality of the acquired original sample images can be adjusted in advance aiming at possible image quality problems in the acquired images in the working scene of the electronic equipment. The adjusting of the pixel values of the pixels in the acquired original sample image may comprise:

Firstly, adjusting pixel values of pixel points in an original sample image through a preset convolution kernel;

a second mode is to conduct contrast adjustment on pixel values of pixel points in the original sample image;

thirdly, adjusting the brightness of pixel values of pixel points in the original sample image;

and fourthly, carrying out noise adding processing on the pixel values of the pixel points in the original sample image.

For example, if it is desired to perform noise addition processing on the original sample image, so as to obtain an adjusted image with different noise, the pixel values of the pixels in the original sample image may be subjected to noise addition processing, that is, noise may be randomly added to the original sample image. In the process of noise adding processing on the original sample image, the noise types used should be as much as possible, for example, white noise, salt and pepper noise, gaussian noise, etc., so that the sample image in the sample set is more diversified, and the accuracy and the robustness of the scene recognition model are improved.

It should be noted that, the process of processing the pixel values of the pixel points in the original sample image belongs to the prior art, and details thereof are not described herein.

By the method, the number of the sample images in the sample set can be doubled, so that a large number of sample images can be quickly acquired, and the difficulty, cost and consumed resources for acquiring the sample images are reduced. The original scene recognition model can be trained according to more sample images, so that the accuracy and the robustness of the scene recognition model are improved.

As still another possible embodiment, the collected original sample image and the adjusted image obtained by adjusting the pixel value of the pixel point in the collected original sample image may be determined as the sample image. And training the original scene recognition model according to the original sample image in the sample set and the adjusted image.

In the implementation process, any sample image is input into the original scene recognition model. The scene probability vectors corresponding to the sample images at each preset level can be obtained through the original scene recognition model; the scene probability vector corresponding to any preset level comprises probability values of each scene to which the sample image belongs respectively. And determining a loss value according to the scene probability vector corresponding to the sample image at each preset level and the sample scene label corresponding to the preset level. And training the original scene recognition model according to the loss value to adjust each parameter value of the original scene recognition model.

In order to accurately train the original scene recognition model, a loss function for calculating a loss value is preconfigured. According to the loss value determined by the loss function, whether the scene recognition model trained by the current iteration meets a preset termination condition can be determined, and the parameter value in the scene recognition model trained by the current iteration can be adjusted according to the loss value.

In one possible implementation, since one first classification layer is configured for each preset level in the scene recognition model, there is a relationship between interrelation and mutual constraint between scenes that can be recognized by each first classification layer. Therefore, in the embodiment of the invention, according to the scene probability vector corresponding to each preset level of the sample image and the sample scene label corresponding to the preset level, a first loss value is determined by using a Multi-classification cross entropy loss function (for example, a Multi-level-BCEloss, MLBloss cross entropy loss function), so that the parameter value in the scene recognition model trained by the current iteration is adjusted according to the first loss value, so that each first classification layer in the scene recognition model not only can accurately recognize the scene to which the image belongs, but also has a correlation and mutual constraint relation between the scenes which can be recognized by each first classification layer.

For example, taking the loss function as a multi-level two-class cross entropy loss function as aN example, let the scene recognition model be configured with N first class layers, denoted as a1, a2, …, and aN, that is, a1, a2, …, and aN represent the first class layer of the first level, the first class layer of the second level, …, and the first class layer of the nth level, respectively. Wherein N is a positive integer greater than or equal to 2. K sample images are in the sample set, the number of scenes which can be identified by the first classification layer of each level is C, and K and C are positive integers which are larger than or equal to 1. The probability value that the sample image k belongs to the ith scene at the nth level can be expressed as y _i ^k . Wherein K is a positive integer of 1 or more and not more than K, and i is a positive integer of 1 or more and not more than C. Therefore, according to the scene probability vectors corresponding to the sample images at each preset level and the sample scene labels corresponding to the preset levels, determining a first loss value by using the multi-level two-classification cross entropy loss function:

wherein MLBloss represents a first loss value,a bi-level cross entropy sub-loss value representing the first class level of the nth level,/->Representative scene recognition modelThe first classification layer of the nth hierarchy, at which the determined sample image k belongs to the probability value of the ith scene.

In one possible implementation manner, in order to better constrain the inclusion relationship between the scenes that can be identified by each first classification layer, for each first classification layer of the hierarchy, a multi-classification cross entropy sub-loss value corresponding to the first classification layer of the hierarchy may be determined according to a scene probability vector corresponding to the first classification layer of the hierarchy and a scene probability vector corresponding to the first classification layer of a parent hierarchy, so as to implement a scene identification result of the first classification layer of the parent hierarchy, and constrain a scene identification result of the first classification layer of the child hierarchy. The first classification layer corresponding to the parent level of the hierarchy is the first classification layer for inputting the fusion feature vector to the first classification layer of the hierarchy in the scene recognition model. For example, taking the above example as well, a1 is the first classification layer corresponding to the parent level of a2, and a2 is the classification layer corresponding to the child level of a 1.

It should be noted that, since the first hierarchy does not have a parent hierarchy, when calculating the multi-classification cross entropy sub-loss value corresponding to the first classification layer of the first hierarchy, the preset scene probability vector may be determined as the scene probability vector corresponding to the first classification layer of the parent hierarchy of the first hierarchy. The dimension of the preset scene probability vector is the same as the dimension of the scene probability vector corresponding to the first level, and each probability value contained in the preset scene probability vector is 1.

For example, determining the fractional cross entropy sub-loss value corresponding to the first classification layer of the nth hierarchy may be determined by the following formula:

wherein,a classification cross entropy sub-loss value representing a first classification layer of an nth hierarchy, a parent level of the nth hierarchy being the nthn-1 hierarchy->Representing the probability value corresponding to the I-th scene in the probability vectors of the scenes determined by the first classification layer of the n-1-th hierarchy,/->A first classification layer representing an nth level in the scene recognition model, the determined probability value of the sample image k belonging to the ith scene at the level, the ith scene also belonging to the ith scene, y _i ^k Represented as a probability value that the sample image k actually belongs to the ith scene at the nth level, C representing the number of scenes contained in the nth level.

Since the sample images in the sample set are generally from images in daily life, the scenes of the images in each level are uncertain, and the quality of the sample scene labels corresponding to each sample image is unstable by manual labeling, so that the problems that the sample scene labels are extremely unbalanced in distribution and the like in the sample scene labels corresponding to each sample image can also exist, and the accuracy of a scene recognition model trained based on the sample images and the sample scene labels corresponding to the sample images can be affected. However, the existing scene recognition technology is difficult to solve the above problems at the same time, so as to effectively improve the precision of scene recognition.

For example, even if the scene recognition model based on the HMC structure is used to perform scene recognition on the image, in the process of training the scene recognition model, only the hierarchical structure among sample scene labels of different levels is used, so that the difficulty of training each scene to which the scene recognition model recognizes the image is reduced, but in the process of training the scene recognition model, the accuracy of the scene recognition model still excessively depends on the influence of factors such as whether the sample scene label corresponding to the sample image is balanced or not, and in essence, the problem of unbalance existing in the sample scene label corresponding to the sample image is not optimized, so that the accuracy of each scene to which the scene recognition model recognizes the image is difficult to improve.

For another example, a method of resampling a sample image is used to alleviate the problem of sample scene tag imbalance. Specifically, the sample images are resampled (for example, undersampled) by calculating the ratio between the numbers of the sample images contained in the sample set under different scenes, so that the numbers of the sample images contained in the different scenes are adjusted to achieve the purpose of relieving the unbalance of the numbers of the sample images belonging to the different scenes. Since this method does not consider that one sample image may belong to multiple scenes, when resampling a certain sample image under a certain scene, the number of samples contained in other scenes to which the sample belongs will also increase, so that a simple method for resampling a sample image cannot control sample scene label distribution to reach equilibrium, and sample images with noise may be increased, or some sample images with more importance may be lost.

In order to alleviate the problem of unbalanced tag data, in the embodiment of the present invention, for each scene included in any hierarchy, a balance weight value (Re-balance) corresponding to the scene may be determined according to the number of all sample images belonging to the scene in the sample set, the number of scenes included in the hierarchy, and the number of different sample images respectively corresponding to different sample images belonging to the scene in the sample set, so as to determine a multi-classification cross entropy sub-loss value corresponding to a first classifier of the hierarchy according to a scene probability vector corresponding to a first classification layer of the hierarchy, a scene probability vector corresponding to a first classification layer of a parent hierarchy of the hierarchy, and balance weight values respectively corresponding to all scenes included in the hierarchy, thereby reducing the influence of the problem that the multi-classification cross entropy sub-loss value is unbalanced in the sample scene tag distribution, so that the original scene recognition model is trained according to a loss value determined by the multi-classification cross entropy sub-loss value respectively corresponding to each first classification layer, and the accuracy of the trained scene recognition model is improved.

In one possible implementation, for each scene included in any hierarchy, the first sampling frequency (Class-level) corresponding to the scene may be determined according to the number of all sample images belonging to the scene in the sample set and the number of scenes included in the hierarchy. And determining a second sampling frequency (Instance-level) corresponding to the scene according to the total number of the scenes contained in the hierarchy and the respective corresponding numbers of different sample images belonging to the scene in the sample set. And determining a balance weight value corresponding to the scene according to the first sampling frequency and the second sampling frequency.

For example, let z _i Is the number of all sample images belonging to the scene i in the sample set, and is based on the number z of all sample images belonging to the scene i in the sample set _i And the number C of scenes contained in the hierarchy, determining the first sampling frequency corresponding to the scenes as

Wherein,representing a first sampling frequency corresponding to the i-th scene.

Determining a second sampling frequency (Instance-level) corresponding to the scene as according to the number C of the scenes contained in the hierarchy and the number corresponding to the different sample images belonging to the ith scene in the sample set

Wherein P is ^S Representing a second sampling frequency corresponding to the i-th scene,representing the kth sample image belonging to the ith scene,/and/or>Representing the number of sample images in the sample set corresponding to the kth sample image belonging to the ith scene.

Determining the first sampling frequency and the second sampling frequencyThe corresponding balance weight value of the scene is as follows

Wherein,and for the balance weight value corresponding to the ith scene, epsilon is a preset constant value, and the value can be flexibly adjusted according to actual requirements.

After the balance weight value corresponding to each scene is determined based on the above embodiment, for each level, the multi-classification cross entropy sub-loss value corresponding to the first classifier of the level is determined according to the scene probability vector corresponding to the first classification layer of the level, the scene probability vector corresponding to the first classification layer of the parent level of the level, and the balance weight values corresponding to all the scenes included in the level.

In one possible implementation manner, taking the multi-classification cross entropy sub-loss value as the two-classification cross entropy sub-loss value as an example, determining the multi-classification cross entropy sub-loss value corresponding to the first classifier of the hierarchy according to the scene probability vector corresponding to the first classification layer of the hierarchy, the scene probability vector corresponding to the first classification layer of the parent hierarchy of the hierarchy, and the balance weight values respectively corresponding to all the scenes included in the hierarchy, where the multi-classification cross entropy sub-loss value corresponding to the first classifier of the hierarchy can be represented by the following formula:

Wherein,a classification cross entropy sub-loss value representing a first classification layer of an nth hierarchy, the parent hierarchy of the nth hierarchy being the nth-1 hierarchy, +.>Represents the scene probability vector determined by the first classification layer of the n-1 th hierarchy, the n-1 th hierarchyThe probability value corresponding to the I-th scene, to which the I-th scene belongs, C represents the number of scenes contained in the n-th hierarchy, +.>Probability value representing that sample image k belongs to the ith scene at the nth level,/->A first classification layer representing an nth level in the scene recognition model, a probability value of the determined sample image k belonging to the ith scene at the level, +.>And representing the balance weight value corresponding to the ith scene.

At present, when labeling each scene to which a sample image in a sample set belongs, the situations of false labeling, missed labeling and the like may occur, that is, each sample scene label corresponding to the labeled sample image may contain a noise label, and a worker cannot determine which sample scene labels are noise labels, and cannot accurately determine a real label corresponding to the noise label, so that when a parameter value in a scene recognition model is adjusted by a loss value determined by the multi-classification cross entropy loss function, the adjustment of the parameter value is affected by the noise label, and the precision of the scene recognition model is reduced. Thus, to mitigate the impact of noise signatures on the scene recognition model, at least one sample image combination may be determined from at least one original sample image in the sample set and its corresponding adjusted image. For each sample image combination, determining the images in the sample image combination as positive samples, determining the images in the rest sample image combinations as negative samples, and then determining the loss value corresponding to the sample image combination according to each positive sample, each negative sample and the feature map of each sample image acquired through a feature extraction layer in the scene recognition model, which are currently determined. And determining a second loss value according to the sum of the sub-loss values respectively corresponding to the at least one sample image combination.

In one possible embodiment, let the T-th sample image be any one of the T-th sample images, j (T) be a randomly enhanced sample of the T-th sample image, and both the T-th sample image and the j (T) -th sample image be positive samples. Let Q (T) represent the set of negative samples other than positive samples in the T sample images, the second loss value can be determined by the following formula:

wherein,sub-loss value, loss corresponding to combination of sample image where t sample image is located ^self For the second loss value, x _t For the feature image of the t sample image obtained by the feature extraction layer in the scene recognition model, x _q For the feature image of the q-th sample image obtained by the feature extraction layer in the scene recognition model, x _j(t) In order to obtain a feature map of the j (t) th sample image through a feature extraction layer in the scene recognition model, representative point multiplication, and tau are preset numerical values.

By adopting the self-supervision contrast learning mode, the contrast learning loss value is determined, so that the parameter value in the scene recognition model is adjusted according to the loss value determined by the contrast learning loss value, the scene recognition model is trained directly according to the characteristics extracted by the scene recognition model, the training process can be free from the constraint of the precision of the sample scene label, and the influence of the noise label on the precision of the scene recognition model is relieved.

In one possible implementation manner, after the first loss value and the second loss value are obtained based on the foregoing embodiment, the integrated loss value may be determined according to the first loss value and the first weight value corresponding thereto, and the second loss value and the second weight value corresponding thereto. And adjusting the parameter value in the scene recognition model according to the comprehensive loss value.

For example, according to the first loss value and its corresponding first weight value, and the second loss value and its corresponding second weight value, determining the composite loss value may be determined by the following formula:

Loss＝w1*MLBloss+w2*loss ^self

wherein Loss represents the integrated Loss value, MLBloss represents the first Loss value, loss ^self The second loss value is represented, w1 is a first weight value corresponding to the first loss value, and w2 is a second weight value corresponding to the second loss value.

After determining the comprehensive loss value based on the above embodiment, when the parameter value in the scene recognition model can be trained according to the comprehensive loss value, a gradient descent algorithm can be adopted to counter-propagate the gradient of the parameter in the scene recognition model, so as to update the parameter value in the scene recognition model.

The sample set for training the scene recognition model contains a large number of sample images, the operation is carried out on each sample image, and when the preset convergence condition is met, the scene recognition model training is completed.

The comprehensive loss value determined for the current iteration is smaller than a preset threshold value, or the number of iterations for training the original scene recognition model reaches a set maximum number of iterations, etc. The implementation may be flexibly set, and is not particularly limited herein.

As a possible implementation manner, when training the original scene recognition model, the sample images in the sample set can be divided into training samples and test samples, the original scene recognition model is trained based on the training samples, and then the reliability degree of the trained scene recognition model is verified based on the test samples.

The scene recognition model method provided by the embodiment of the invention is explained by a specific embodiment. Fig. 2 is a schematic structural diagram of a scene recognition model according to an embodiment of the present invention, as shown in fig. 2, after an image to be recognized is obtained, the image to be recognized is input into a pre-trained scene recognition model. And extracting the characteristics of the image to be identified through a characteristic extraction layer in the scene identification model, and obtaining a characteristic image of the image to be identified. And then determining the scene of the image to be identified at the corresponding level of each first classification layer based on the feature map through at least two first classification layers in the scene identification model.

Fig. 3 is a schematic structural diagram of a multi-layer hierarchical layer in a scene recognition model according to an embodiment of the present invention. As shown in fig. 3, the multi-level classification layers in the scene recognition model include at least two first classification layers and one second classification layer. When the feature map is obtained based on the above embodiment, for at least two first classification layers in the scene recognition model, determining, by a first sub-network in the first classification layer, a first feature vector a based on the feature map and the fused feature vector output by the previous first classification layer ^m _G The method comprises the steps of carrying out a first treatment on the surface of the Determining a second feature vector A corresponding to the first feature vector through a second sub-network in the first classification layer ^m _L Based on the first feature vector A ^m _G Second feature vector A ^m _L Determining a fusion feature vector corresponding to the first classification layer, outputting the fusion feature vector to the next first classification layer, and based on the second feature vector A ^m _L Determining a scene P to which the image to be identified belongs at a level corresponding to the first classification layer ^m _L . Where m represents the mth first classification layer. Determining a scene P to which the image to be identified belongs at a level corresponding to a second classification layer based on a feature map and a fusion feature vector output by the last first classification layer through the second classification layer in the scene identification model ^m _G . Here, FC shown in fig. 3 represents processing performed by the full connection layer, and concat represents splicing processing.

Example 3:

fig. 4 is a schematic structural diagram of a scene recognition device according to an embodiment of the present invention, where the embodiment of the present invention provides a scene recognition device, including:

an acquisition unit 41 for acquiring an image to be recognized;

a processing unit 42, configured to obtain a feature map of an image to be identified through a feature extraction layer in a pre-trained scene recognition model; for at least two first classification layers in the scene recognition model, determining a first feature vector based on the feature map and a fusion feature vector output by the last first classification layer through a first sub-network in the first classification layers; determining a second feature vector corresponding to the first feature vector through a second sub-network in the first classification layer, determining a fusion feature vector corresponding to the first classification layer based on the first feature vector and the second feature vector, outputting the fusion feature vector to the next first classification layer, and determining a scene of the image to be identified at a level corresponding to the first classification layer based on the second feature vector; wherein the scenes contained in different levels are different.

In some possible implementations, the processing unit 42 is further configured to determine, by the second classification layer in the scene recognition model, a scene to which the image to be recognized belongs at a level corresponding to the second classification layer based on the feature map and the fused feature vector output by the last first classification layer; the scenes contained in the hierarchy corresponding to the second classification layer are different from the scenes contained in the hierarchy corresponding to the at least two first classification layers respectively.

In some possible embodiments, the apparatus further comprises: a training unit;

the training unit is used for acquiring any sample image in the sample set; the sample image is provided with sample scene labels corresponding to each preset level, the sample scene label of any preset level is used for identifying the scene of the sample image in the preset level, the at least one sample scene label is different, and each preset level comprises levels corresponding to the at least two first classifiers respectively; determining scene probability vectors corresponding to the sample images at each preset level respectively based on the sample images through an original scene recognition model; the scene probability vector corresponding to any preset level comprises probability values of each scene of the level to which the sample image belongs respectively; and training the scene recognition model according to the scene probability vector corresponding to each preset level and the probability value corresponding to the sample scene label corresponding to the preset level.

In some possible implementations, the training unit is specifically configured to determine a first loss value according to a scene probability vector corresponding to the sample image at each preset level and a sample scene label corresponding to the preset level; and adjusting the parameter value in the scene recognition model according to the first loss value.

In some possible embodiments, the training unit is specifically configured to determine, according to a scene probability vector corresponding to the sample image at each preset level and a sample scene label corresponding to the preset level, a multi-classification cross entropy sub-loss value corresponding to each preset level; and determining the first loss value according to the sum of multi-classification cross entropy sub-loss values corresponding to each preset level.

In some possible implementations, the training unit is specifically configured to determine, for the first classification layer of each preset level, a multi-classification cross entropy sub-loss value corresponding to the level according to a scene probability vector corresponding to the first classification layer of the level and a scene probability vector corresponding to the first classification layer of a parent level of the level.

In some possible embodiments, the training unit is specifically configured to determine, if the multi-class cross entropy sub-loss value is a two-class cross entropy sub-loss value, a multi-class cross entropy sub-loss value corresponding to the level according to the following formula:

Wherein,a classification cross entropy sub-loss value representing a first classification layer of an nth hierarchy, the parent hierarchy of the nth hierarchy being the nth-1 hierarchy, +.>Representing the probability value corresponding to the I-th scene in the probability vectors of the scenes determined by the first classification layer of the n-1-th hierarchy,/->A first classification layer representing an nth level in the scene recognition model, the determined probability value of the sample image k belonging to the ith scene at the level, the ith scene also belonging to the ith scene, y _i ^k Represented as a probability value that the sample image k actually belongs to the ith scene at the nth level, C representing the number of scenes contained in the nth level.

In some possible implementations, the training unit is specifically configured to determine, for each scene included in the hierarchy, a first sampling frequency corresponding to the scene according to the number of all sample images belonging to the scene in the sample set and the number of scenes included in the hierarchy; determining a second sampling frequency corresponding to the scene according to the total number of the scenes contained in the hierarchy and the number corresponding to the different sample images belonging to the scene in the sample set; determining a balance weight value corresponding to the scene according to the first sampling frequency and the second sampling frequency; and determining a multi-classification cross entropy sub-loss value corresponding to the first classifier of the hierarchy according to the scene probability vector corresponding to the first classification layer of the hierarchy, the scene probability vector corresponding to the first classification layer of the father hierarchy of the hierarchy and the balance weight value corresponding to all scenes contained in the hierarchy.

In some possible embodiments, the training unit is specifically configured to determine, if the multi-class cross entropy sub-loss value is a two-class cross entropy sub-loss value, a multi-class cross entropy sub-loss value corresponding to the first classifier of the hierarchy according to the following formula:

wherein,a classification cross entropy sub-loss value representing a first classification layer of an nth hierarchy, the parent hierarchy of the nth hierarchy being the nth-1 hierarchy, +.>Representing the probability value corresponding to the I-th scene in the scene probability vectors determined by the first classification layer of the n-1-th hierarchy, wherein the I-th scene is also attributed to the I-th scene, and C represents the number of scenes contained in the n-th hierarchy>Probability value representing that sample image k belongs to the ith scene at the nth level,/->A first classification layer representing an nth level in the scene recognition model, a probability value of the determined sample image k belonging to the ith scene at the level, +.>And representing the balance weight value corresponding to the ith scene.

In some possible embodiments, the training unit is specifically configured to obtain adjusted sample images corresponding to at least one original sample image in the sample set respectively; determining at least one sample image combination according to the at least one original sample image and the adjusted sample images corresponding to the at least one original sample image respectively; for the at least one sample image combination, determining an original sample image in the sample image combination and a corresponding adjusted sample image thereof as positive samples, and determining sample images in other sample image combinations except the sample image combination as negative samples; determining a sub-loss value corresponding to the sample image combination according to each positive sample, each negative sample and the obtained feature images of each sample image in the at least one sample image combination through a feature extraction layer in the scene recognition model; determining a second loss value according to the sum of the sub-loss values respectively corresponding to the at least one sample image combination; and adjusting the parameter value in the scene recognition model according to the second loss value and the first loss value.

In some possible embodiments, the training unit is specifically configured to determine the second loss value by the following formula:

/>

wherein the T-th sample image is any one sample image of the T sample images, j (T) is the adjusted sample image corresponding to the T-th sample image, Q (T) represents the set of negative samples except positive samples in the T sample images,sub-loss value, loss corresponding to combination of sample image where t sample image is located ^self For the second loss value, x _t For the feature image of the t sample image obtained by the feature extraction layer in the scene recognition model, x _q For the feature image of the q-th sample image obtained by the feature extraction layer in the scene recognition model, x _j(t) In order to obtain a feature map of the j (t) th sample image through a feature extraction layer in the scene recognition model, representative point multiplication, and tau are preset numerical values.

In some possible embodiments, the training unit is specifically configured to determine a comprehensive loss value according to the first loss value and the first weight value corresponding thereto, and the second loss value and the second weight value corresponding thereto; and adjusting the parameter value in the scene recognition model according to the comprehensive loss value.

Example 4:

fig. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure, and on the basis of the foregoing embodiments, the embodiment of the present disclosure further provides an electronic device, as shown in fig. 5, including: the processor 51, the communication interface 52, the memory 53 and the communication bus 54, wherein the processor 51, the communication interface 52 and the memory 53 complete the communication with each other through the communication bus 54;

the memory 53 has stored therein a computer program which, when executed by the processor 51, causes the processor 51 to perform the steps of:

Because the principle of solving the problem of the electronic device is similar to that of the scene recognition method, the implementation of the electronic device can refer to the implementation of the method, and the repetition is omitted.

The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface 52 is used for communication between the above-described electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit, a network processor (Network Processor, NP), etc.; but also digital instruction processors (Digital Signal Processing, DSP), application specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.

Example 5:

on the basis of the above embodiments, the present disclosure further provides a computer readable storage medium having stored therein a computer program executable by a processor, which when run on the processor, causes the processor to perform the steps of:

Since the principle of solving the problem by using the computer readable storage medium is similar to that of the above-mentioned scene recognition method, the specific implementation can be referred to the implementation of the scene recognition method, and the repetition is omitted.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A method of scene recognition, the method comprising:

2. The method according to claim 1, wherein the method further comprises:

determining a scene to which the image to be identified belongs at a level corresponding to the second classification layer based on the feature map and the fusion feature vector output by the last first classification layer through the second classification layer in the scene identification model; the scenes contained in the hierarchy corresponding to the second classification layer are different from the scenes contained in the hierarchy corresponding to the at least two first classification layers respectively.

3. The method according to claim 1 or 2, wherein the scene recognition model is trained by:

4. The method of claim 3, wherein training the scene recognition model based on each of the scene probability vectors and the probability value corresponding to the corresponding sample scene label comprises:

determining a first loss value according to the scene probability vector corresponding to each preset level of the sample image and the sample scene label corresponding to the preset level;

and adjusting the parameter value in the scene recognition model according to the first loss value.

5. The method of claim 4, wherein determining the first loss value according to the scene probability vector and the sample scene label of the corresponding preset level of the sample image at each preset level comprises:

determining multi-classification cross entropy sub-loss values respectively corresponding to the preset levels according to the scene probability vectors respectively corresponding to the sample images at the preset levels and the sample scene labels corresponding to the preset levels;

And determining the first loss value according to the sum of multi-classification cross entropy sub-loss values corresponding to each preset level.

6. The method according to claim 5, wherein determining the multi-classification cross entropy sub-loss value corresponding to each preset level according to the scene probability vector corresponding to the sample image at each preset level and the sample scene label corresponding to the preset level, respectively, comprises:

and for the first classification layer of each preset level, determining a multi-classification cross entropy sub-loss value corresponding to the level according to the scene probability vector corresponding to the first classification layer of the level and the scene probability vector corresponding to the first classification layer of the father level of the level.

7. The method of claim 6, wherein determining the multi-class cross entropy sub-loss value for the hierarchy based on the scene probability vector for the first classification layer of the hierarchy and the scene probability vector for the first classification layer of the parent hierarchy of the hierarchy comprises:

for each scene contained in the hierarchy, determining a first sampling frequency corresponding to the scene according to the number of all sample images belonging to the scene in the sample set and the number of the scenes contained in the hierarchy; determining a second sampling frequency corresponding to the scene according to the total number of the scenes contained in the hierarchy and the number corresponding to the different sample images belonging to the scene in the sample set; determining a balance weight value corresponding to the scene according to the first sampling frequency and the second sampling frequency;

And determining a multi-classification cross entropy sub-loss value corresponding to the first classifier of the hierarchy according to the scene probability vector corresponding to the first classification layer of the hierarchy, the scene probability vector corresponding to the first classification layer of the father hierarchy of the hierarchy and the balance weight value corresponding to all scenes contained in the hierarchy.

8. The method of claim 4, wherein adjusting the parameter values in the scene recognition model based on the first loss value comprises:

acquiring adjusted sample images respectively corresponding to at least one original sample image in the sample set;

determining at least one sample image combination according to the at least one original sample image and the adjusted sample images corresponding to the at least one original sample image respectively;

for the at least one sample image combination, determining an original sample image in the sample image combination and a corresponding adjusted sample image thereof as positive samples, and determining sample images in other sample image combinations except the sample image combination as negative samples; determining a sub-loss value corresponding to the sample image combination according to each positive sample, each negative sample and the obtained feature images of each sample image in the at least one sample image combination through a feature extraction layer in the scene recognition model;

Determining a second loss value according to the sum of the sub-loss values respectively corresponding to the at least one sample image combination;

and adjusting the parameter value in the scene recognition model according to the second loss value and the first loss value.

9. The method of claim 8, wherein adjusting the parameter values in the scene recognition model based on the second loss value and the first loss value comprises:

determining a comprehensive loss value according to the first loss value and the corresponding first weight value thereof and the second loss value and the corresponding second weight value thereof;

and adjusting the parameter value in the scene recognition model according to the comprehensive loss value.

10. A scene recognition device, the device comprising:

the acquisition unit is used for acquiring the image to be identified;

11. An electronic device, characterized in that it comprises a processor for implementing the steps of the method according to any of claims 1-9 when executing a computer program stored in a memory.

12. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the steps of the method according to any one of claims 1-9.