CN112990378B

CN112990378B - Scene recognition method and device based on artificial intelligence and electronic equipment

Info

Publication number: CN112990378B
Application number: CN202110501512.6A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-08
Filing date: 2021-05-08
Publication date: 2021-08-13
Anticipated expiration: 2041-05-08
Also published as: CN112990378A

Abstract

The application provides a scene recognition method, a scene recognition device, electronic equipment and a computer-readable storage medium based on artificial intelligence; the method relates to an artificial intelligence technology and a block chain technology, and comprises the following steps: acquiring a first image sample without scene category marking in a first field; performing style transformation processing on a second image sample with scene type labels in a second field through the first image sample to obtain a third image sample with scene type labels in the first field; wherein the third image sample has the same scene class label as the second image sample; training a scene recognition model based on the first image sample, the second image sample and the third image sample; and carrying out scene recognition processing on a fourth image sample in the first field through the trained scene recognition model to obtain a scene type of the fourth image sample. By the method and the device, the accuracy of scene recognition in the specific field can be improved.

Description

Scene recognition method and device based on artificial intelligence and electronic equipment

Technical Field

The present application relates to artificial intelligence technologies, and in particular, to a scene recognition method and apparatus based on artificial intelligence, an electronic device, and a computer-readable storage medium.

Background

Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

The image recognition is mainly aimed at extracting features of an image of a certain real style, and therefore high-accuracy scene recognition of images of a special style cannot be effectively performed.

Disclosure of Invention

The embodiment of the application provides a scene recognition method and device based on artificial intelligence, an electronic device and a computer-readable storage medium, and the accuracy of scene recognition in a specific field can be improved.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a scene identification method based on artificial intelligence, which comprises the following steps:

acquiring a first image sample without scene category marking in a first field;

performing style transformation processing on a second image sample with scene type labels in a second field through the first image sample to obtain a third image sample with scene type labels in the first field;

wherein the third image sample has the same scene class label as the second image sample;

training a scene recognition model based on the first image sample, the second image sample and the third image sample;

and carrying out scene recognition processing on a fourth image sample in the first field through the trained scene recognition model to obtain a scene type of the fourth image sample.

The embodiment of the application provides a scene recognition device based on artificial intelligence, includes:

the acquisition module is used for acquiring a first image sample without scene category marking in a first field;

the style module is used for carrying out style transformation processing on a second image sample with scene type labels in a second field through the first image sample to obtain a third image sample with scene type labels in the first field;

a training module for training a scene recognition model based on the first image sample, the second image sample and the third image sample;

and the recognition module is used for carrying out scene recognition processing on the fourth image sample in the first field through the trained scene recognition model to obtain the scene category of the fourth image sample.

In the foregoing solution, the obtaining module is further configured to: acquiring a candidate first image sample set without scene category labels in a first field; extracting a first style image feature of each candidate first image sample in the candidate first image sample set; performing clustering processing according to the first style image characteristics of each candidate first image sample to obtain a plurality of clusters corresponding to the candidate first image sample set; and acquiring a plurality of candidate first image samples for representing the plurality of clusters in a one-to-one correspondence manner from the candidate first image sample set, and taking the plurality of candidate first image samples for representing the plurality of clusters in a one-to-one correspondence manner as a plurality of first image samples without scene category labels of the first field.

In the foregoing solution, the obtaining module is further configured to: randomly selecting N candidate first image samples from the set of candidate first image samples, taking first style image features corresponding to the N candidate first image samples as initial centroids of the plurality of clusters, and removing the N candidate first image samples from the set of candidate first image samples, wherein N is an integral multiple of the number of scene category labels of the scene recognition model; initializing the iteration number of clustering processing to be M, and establishing a null set corresponding to each cluster, wherein M is an integer greater than or equal to 2; in each iteration process of the clustering process, updating each set of the clusters, executing centroid generation process based on the updating process result to obtain a new centroid of each cluster, adding a candidate first image sample corresponding to the initial centroid to the candidate first image sample set again when the new centroid is different from the initial centroid, and updating the initial centroid based on the new centroid; determining each cluster set obtained after M times of iteration as a cluster processing result, or determining each cluster set obtained after M times of iteration as a cluster processing result; the centroids of the clusters obtained after iteration for M times are the same as the centroids of the clusters obtained after iteration for M-1 times, wherein M is an integer variable and the value of M is more than or equal to 2 and less than or equal to M.

In the foregoing solution, the obtaining module is further configured to: for each of the candidate first image samples of the set of candidate first image samples: determining a similarity between a first-style image feature of the candidate first image sample and an initial centroid of each of the clusters; determining an initial centroid corresponding to the maximum similarity as belonging to the same cluster as the candidate first image sample, and transferring the candidate first image sample to a set of clusters corresponding to the maximum similarity initial centroid, wherein the maximum similarity initial centroid is the initial centroid corresponding to the maximum similarity; and averaging the first style image characteristics of each candidate first image sample in each cluster set to obtain a new centroid of each cluster.

In the foregoing solution, the obtaining module is further configured to: performing the following for each of the clusters in the plurality of clusters: averaging the first style image characteristics of each candidate first image sample in each cluster to obtain the centroid of each cluster; determining feature distances between first-style image features of the plurality of candidate first image samples and the centroid of the cluster; and determining the candidate first image sample corresponding to the minimum characteristic distance as the candidate first image sample for characterizing the cluster.

In the foregoing solution, the style module is further configured to: performing feature coding processing on the second image sample to obtain a first object feature of the second image sample; performing feature coding processing on the first image sample to obtain a first to-be-migrated style feature of the first image sample; and performing style migration processing on the first object feature of the second image sample to the first style feature to be migrated to obtain the third image sample.

In the foregoing solution, the style module is further configured to: extracting the mean value and the variance of the first object feature of the second image sample, and extracting the mean value and the variance of the first to-be-migrated style feature of the first image sample; mapping the first object feature based on the mean and variance of the first object feature and the mean and variance of the first to-be-migrated style feature to obtain a first migrated feature; and performing decoding restoration processing on the first migrated feature to obtain the third image sample.

In the above scheme, the style transformation processing is implemented by a style generation network, where the style generation network includes a coding network and a style migration network, and the style migration network includes a style migration layer and a decoding layer; before the second image sample with the scene category label in the second field is subjected to style transformation processing through the first image sample to obtain a third image sample with the scene category label in the first field, the training module is further configured to: respectively carrying out feature coding processing on a fifth image sample and a sixth image sample through the coding network to obtain a second object feature of the sixth image sample and a second to-be-migrated style feature of the fifth image sample; extracting the mean value and the variance of the second object feature and the mean value and the variance of the second to-be-migrated style feature through the style migration layer; mapping the second object feature based on the mean and variance of the second object feature and the mean and variance of the second to-be-migrated style feature to obtain a second migrated feature; decoding and restoring the second migrated feature through the decoding layer to obtain a seventh image sample; determining a style loss and a content loss based on the seventh image sample, the fifth image sample, and the second migrated feature; and fixing the parameters of the coding network and the style migration layer, and updating the parameters of the decoding layer according to the style loss and the content loss.

In the foregoing solution, the training module is further configured to: performing feature coding processing on the seventh image sample through the coding network to obtain image features of the seventh image sample; extracting the mean value and the variance of the image features; determining the style loss based on the mean and variance of the image features and the mean and variance of the second to-be-migrated style features; determining the content loss based on the image feature and the second migrated feature.

In the foregoing solution, the training module is further configured to: performing data enhancement processing on the first image sample to obtain an enhanced image sample corresponding to the first image sample; carrying out forward propagation on the third image sample and the second image sample in the scene recognition model to obtain a first forward propagation result; carrying out forward propagation on the first image sample and the enhanced image sample in a feature extraction network of the scene recognition model to obtain a second forward propagation result; and updating the scene recognition model according to the first forward propagation result and the second forward propagation result.

In the foregoing solution, the training module is further configured to: performing at least one of the following processes for the first image sample and determining a processing result as an enhanced image sample corresponding to the first image sample: performing tone transformation processing on the first image sample; performing cropping processing on the first image sample; performing Gaussian blur processing on the first image sample; and carrying out random drawing processing on the first image sample.

In the above solution, the scene recognition model further includes a second classification network corresponding to the second domain and a first classification network corresponding to the first domain; the training module is further configured to: performing feature extraction processing on the second image sample through the feature extraction network to obtain classification features of the second image sample, and mapping the classification features of the second image sample to a first prediction probability that the second image sample belongs to a pre-labeled scene category through a second classification network; performing feature extraction processing on the third image sample through the feature extraction network to obtain a classification feature of the third image sample, and mapping the classification feature of the third image sample to a second prediction probability that the third image sample belongs to the pre-labeled scene category through a first classification network; and combining the first prediction probability, the second prediction probability, the classification feature of the second image sample and the classification feature of the third image sample to obtain the first forward propagation result.

In the foregoing solution, the training module is further configured to: performing feature extraction processing on the first image sample through the feature extraction network to obtain a classification feature of the first image sample; performing feature extraction processing on the enhanced image sample through the feature extraction network to obtain the classification features of the enhanced image sample; and combining the classification features of the first image sample and the classification features of the enhanced image sample to obtain the second forward propagation result.

In the foregoing solution, the training module is further configured to: determining a first classification loss according to the first prediction probability and the pre-marked scene category; determining a second classification loss according to the second prediction probability and the pre-marked scene category; determining a first consistency loss according to divergence between the classification features of the second image sample and the classification features of the third image sample; determining a second consistency loss according to divergence between the classification features of the first image sample and the classification features of the enhanced image sample; performing fusion processing on the first classification loss, the second classification loss, the first consistency loss and the second consistency loss to obtain fusion loss; determining a fitting parameter of the scene recognition model when the fusion loss takes a minimum value, so as to update the scene recognition model based on the fitting parameter.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the scene recognition method based on artificial intelligence provided by the embodiment of the application when the executable instructions stored in the memory are executed.

The embodiment of the application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute, so as to implement the artificial intelligence-based scene recognition method provided by the embodiment of the application.

The embodiment of the application has the following beneficial effects:

the method comprises the steps of generating an image (a third image sample) of a labeled first field by means of an image (a first image sample) of a non-labeled first field and an image (a second image sample) of a labeled second field, so that the capacity of a training sample of a scene recognition model for executing a scene recognition task of the first field is expanded, the identification capability of the scene recognition model in the first field is effectively improved, feature transfer between the second field and the first field is effectively realized by simultaneously utilizing the second image sample and the third image sample during training, the scene recognition model can effectively act on the first field recognition by utilizing the first image sample during training, and finally the scene recognition accuracy after the scene recognition task of the second field is transferred to the first field is improved.

Drawings

Fig. 1 is a logic diagram of a scene generation method in the related art;

fig. 2A is a schematic structural diagram of an artificial intelligence based scene recognition system provided in an embodiment of the present application;

fig. 2B is a schematic structural diagram of a scene identification system based on a blockchain network according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

FIG. 4A is a flowchart illustrating an artificial intelligence based scene recognition method according to an embodiment of the present disclosure;

FIG. 4B is a flowchart illustrating an artificial intelligence based scene recognition method according to an embodiment of the present disclosure;

FIG. 4C is a flowchart illustrating an artificial intelligence based scene recognition method according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of an architecture of a scene recognition method based on artificial intelligence provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of style migration of a scene recognition method based on artificial intelligence provided in an embodiment of the present application;

FIG. 7 is a schematic diagram illustrating image selection of a scene recognition method based on artificial intelligence according to an embodiment of the present application;

FIG. 8 is a schematic diagram of image generation of a scene recognition method based on artificial intelligence provided by an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a residual error network of a scene identification method based on artificial intelligence provided in an embodiment of the present application;

fig. 10 is a processing flow chart of a scene recognition method based on artificial intelligence provided by an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Image recognition: image recognition is a technology for performing specific level classification on an image, generally, regardless of a specific instance of an object, only the class of the object is considered for image recognition and the class to which the object belongs is given, for example, the image is classified into a person, a dog, a cat, a bird, and the like, and a model trained based on a large-scale general object recognition source data set ImageNet can recognize which one of 1000 classes a certain object belongs to.

2) Multi-label recognition task of images: whether an image corresponds to having multiple attribute tags is identified, for example, an image has multiple attribute tags, and the multi-tag identification task is used for judging which attribute tags the image has.

3) And (3) carrying out noise identification: training of the image recognition task is performed based on noise samples, including samples with incorrect class labels, samples with inaccurate class labels, e.g. the image does not correspond exactly to the class label, the concepts of the two class labels have a partial overlap, the image has the properties of the two class labels, but only one class label.

4) ImageNet: large generic objects identify the source data set.

5) ImageNet pre-training model: and training a deep learning network model based on the large-scale general object recognition source data set ImageNet, wherein the trained deep learning network model is an ImageNet pre-training model.

6) And (3) GAN: a Generative Adaptive Network (GAN) is a deep learning model that mainly includes a discriminant model that requires input variables to generate hidden variables from a model and a Generative model that gives some kind of hidden information to randomly generate observation data, e.g., a discriminant model for determining animal species in a given image and a Generative model for generating a new image including cats based on images of a given plurality of cats.

7) ACG: are abbreviations for Animation (Animation), Comic (Comic), and Game (Game), and are generic names of Animation, Comic, and Game.

The method mainly aims to identify scenes in which scenarios occur in videos in the related art, and solves the problem of the style field of movie and television play scene identification, wherein conventional scene training data are images under real scenes, namely real images, and part of movie and television plays are ACG quadratic animation types, namely belong to different style fields.

In the related art, a method for generating an image of a non-raining scene based on an image of a raining scene is provided, referring to fig. 1, fig. 1 is a logic schematic diagram of a scene generation method in the related art, a target image is generated through a generator, the image of the raining scene serves as an input of the generator, the generator comprises a plurality of dense modules (for example, dense modules 1 to dense modules 8) and a transition module, the target image is generated through a learning process of the generator, a discriminator identifies whether the target image is the image generated by the generator, the discriminator comprises a plurality of sampling modules (respectively performs a plurality of times of sampling with different dimensions), namely, the real image and the target image are processed simultaneously, whether the real image and the target image are the same or not is discriminated, the same characteristic target image is judged to be real (corresponding to output 1) and the different characteristic target image is judged to be false (corresponding to output 0), meanwhile, the generator determines the loss of the generated target image through pixel-by-pixel calculation, and the image of a rainy scene and the image of a non-rainy scene are collected in a large scale for training, so that the generator can generate the image of another scene based on the image of one scene.

The generator model of the related art is over-customized, only the conversion of a specific scene can be completed, the scene type of data of a certain style field is limited, so that the data collection requirements of various scenes of the certain style field cannot be met, the model training data collection is difficult, a large amount of data is required for training a certain specific scene, if the data is extended to the recognition tasks of all scenes of the certain style field, the data size is too large, so that the data labeling for data collection is difficult, direct help in task learning is not generated for the scene recognition task of the certain style field, and the problem of the migration of the scene recognition field without rapid labeling cannot be effectively solved in the related art.

Aiming at the technical problem of rapid field migration of a scene recognition task under the condition of no mark, the embodiment of the application provides a scene recognition method, a scene recognition device, an electronic device and a computer readable storage medium based on artificial intelligence, which can rapidly generate a large number of image samples in a first field, effectively improve the recognition capability of a scene recognition model in the first field and further effectively improve the scene recognition accuracy.

The scene recognition method provided by the embodiment of the application can be implemented by various electronic devices, for example, can be implemented by a terminal or a server alone, or can be implemented by the terminal and the server in a cooperation manner.

Referring to fig. 2A, fig. 2A is a schematic structural diagram of a scene recognition system based on artificial intelligence according to an embodiment of the present application, a terminal 400 is connected to a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two.

In some embodiments, the function of the artificial intelligence based scene recognition system is implemented based on the server 200, during the process of using the terminal 400 by a user, the terminal 400 collects a first image sample and a second image sample and sends the first image sample and the second image sample to the server 200, so that the server 200 generates a third image sample which has the same field as the first image sample and has the same label as the second image sample, so as to train the scene recognition model based on a plurality of loss functions, integrate the trained scene recognition model in the server 200, in response to the terminal 400 receiving the fourth image sample, the terminal 400 sends the fourth image sample to the server 200, and the server 200 determines a scene classification result of the fourth image sample through the scene recognition model and sends the scene classification result to the terminal 400, so that the terminal 400 directly presents the scene classification result.

In some embodiments, when the scene recognition system is applied to a video recommendation scene, the terminal 400 receives a video to be uploaded (a video in a first field), the terminal 400 sends the video to the server 200, the server 200 determines a scene classification result of a video frame in the video through a scene recognition model to serve as a scene classification result of the video, and sends the scene classification result to the terminal 400, so that the terminal 400 directly presents the scene classification result of the corresponding video in a video recommendation home page, and the terminal uploading the video and the terminal presenting the scene classification result may be the same or different.

In other embodiments, when the scene recognition method provided by the embodiment of the present application is implemented by a terminal alone, in the various application scenarios described above, the terminal may run the scene recognition model to determine a scene classification result of the image or video in the first domain, and directly present the scene classification result.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal 400 may be a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, a smart television, a smart car device, and the like, and the terminal 400 may be provided with a client, for example, but not limited to, a video client, a browser client, an information flow client, an image capturing client, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited.

In some embodiments, referring to fig. 2B, fig. 2B is a schematic structural diagram of a scene identification system based on a blockchain network provided in an embodiment of the present application, and an exemplary application of the blockchain network based on the embodiment of the present application is described below. Referring to fig. 2B, the blockchain network 600 (which exemplarily shows the node 610-1 and the node 610-2 included in the blockchain network 600), the server 200, and the terminal 400 are included, which are respectively described below.

The server 200 (mapped as node 610-2) and the terminal 400 (mapped as node 610-1) may each join the blockchain network 600 as a node therein, and the mapping of the terminal 400 as node 610-1 of the blockchain network 600 is exemplarily shown in fig. 2B, where each node (e.g., node 610-1, node 610-2) has a consensus function and an accounting (i.e., maintaining a state database, such as a key-value database) function.

The state database of each node (e.g., the node 610-1) records a fourth image sample of the terminal 400 and a scene classification result corresponding to the fourth image sample, so that the terminal 400 can query the fourth image sample recorded in the state database and the scene classification result corresponding to the fourth image sample.

In some embodiments, in response to receiving the image, a plurality of servers 200 (each server is mapped to a node in the blockchain network) determine a scene classification result of a fourth image sample, determine that a consensus passes when the number of nodes passing the consensus exceeds a threshold of the number of nodes for a candidate scene classification result, and the server 200 (mapped to the node 610-2) sends the scene classification result of the candidate passing the consensus to the terminal 400 (mapped to the node 610-1) and presents a human-computer interaction interface of the terminal 400, and performs uplink storage on the fourth image sample and the scene classification result corresponding to the fourth image sample. Because the scene classification result is obtained after the consensus is performed by the plurality of servers, the reliability of the scene classification result of the fourth image sample can be effectively improved, and the fourth image sample stored in the uplink and the corresponding scene classification result cannot be maliciously tampered due to the characteristic that the block chain network is not easy to tamper.

Next, a structure of an electronic device for implementing an artificial intelligence based scene recognition method according to an embodiment of the present application is described, and as described above, the electronic device according to an embodiment of the present application may be the server 200 or the terminal 400 in fig. 2A. Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device provided in the embodiment of the present application, and the electronic device is taken as a server 200 for example. The server 200 shown in fig. 3 includes: at least one processor 210, memory 250, at least one network interface 220. The various components in server 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 3.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.

The memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 250 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks; a network communication module 252 for communicating to other electronic devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.

In some embodiments, the artificial intelligence based scene recognition device provided by the embodiments of the present application may be implemented in software, and fig. 3 illustrates an artificial intelligence based scene recognition device 255 stored in a memory 250, which may be software in the form of programs and plug-ins, and the like, and includes the following software modules: an acquisition module 2551, a style module 2552, a training module 2553 and a recognition module 2554, which are logical and thus can be arbitrarily combined or further split according to the implemented functions, which will be described below.

Referring to fig. 5, fig. 5 is a schematic diagram of an architecture of a scene recognition method based on artificial intelligence provided in an embodiment of the present application, where the inputs of fig. 5 include: the method comprises the steps of generating real style field annotation data, ACG (adaptive feature graph) label-free data and generated and obtained ACG data with labels, wherein the real style field annotation data is an image (a second image sample) of a seaside character scene in a real style field, the corresponding scene type labels are seaside categories, the ACG label-free data are images (a first image sample) of various ACG style fields without scene type labels, the generated and obtained ACG data with labels are third image samples generated based on the first image sample and the second image sample, the third image samples theoretically belong to the ACG style field and have the same content as the second image sample in theory, therefore, the third image samples have the same scene label category as the second image sample, the scene type labels of the third image samples are seaside categories, and the first image sample, the second image sample, the third image sample and the ACG data with the labels are extracted through a depth feature network, The depth features of the second image sample, the third image sample and the enhanced image sample are obtained based on the first image sample, the classification features (namely, embedding features corresponding to each image) of the first image sample, the second image sample, the third image sample and the enhanced image sample are respectively extracted through a pooling network based on the obtained depth features, finally, the recognition of multiple types of scenes in the real style field is carried out through the second classification network, and the recognition of multiple types of scenes in the ACG style field is carried out through the first classification network.

Referring to fig. 4A, fig. 4A is a schematic flowchart of a scene recognition method based on artificial intelligence according to an embodiment of the present application, which will be described with reference to

steps

101 and 104 shown in fig. 4A.

In step 101, a first image sample without scene class labels of a first domain is obtained.

As an example, the first domain may be an Animation Comic Games (ACG) style domain, the number of the first image samples is one or more, when the number of the first image samples is 1, only one ACG style domain exists, when the number of the first image samples is multiple, a plurality of ACG style domains may exist, so that the ACG style domain is richer, and the first domain may also be other style domains, such as a plaster style domain, and the like.

In some embodiments, referring to fig. 4B, fig. 4B is a flowchart illustrating a step 101 of the scene identification method based on artificial intelligence provided in the embodiment of the present application, where the obtaining of the first image sample without scene category annotation in the first domain in the step 101 may be implemented by steps 1011 and 1014 of fig. 4B.

In step 1011, a candidate first set of image samples of the first domain without scene class annotation is obtained.

As an example, the first field is continuously described as an ACG style field, when the number of candidate first image samples is multiple, the multiple candidate first image samples may be subjected to clustering processing, where the clustering processing is an unsupervised classification process, and the clustering processing is determined according to the number of scene category labels of the scene identification model, so as to obtain a style image adapted to each scene category label from the multiple candidate first image samples, thereby improving the effect of subsequent style migration, and further improving the accuracy of the scene identification model in the first field.

In step 1012, a first style image feature is extracted for each candidate first image sample in the set of candidate first image samples.

As an example, a deep learning network model Resnet101 is trained based on the large-scale generic object recognition source data set ImageNet, and referring to table 1, the structure of the feature extraction network of Resnet101 is used to extract the first style image features of each candidate first image sample in the plurality of candidate first image samples through the feature extraction network of Resnet 101.

TABLE 1 convolution layer structure table in ResNet-101

In step 1013, a clustering process is performed according to the first style image features of each candidate first image sample to obtain a plurality of clusters corresponding to the candidate first image sample set.

In some embodiments, in step 1013, the clustering process is performed according to the first style image feature of each candidate first image sample to obtain a plurality of clusters corresponding to the candidate first image sample set, which may be implemented by the following technical solutions: randomly selecting N candidate first image samples from a candidate first image sample set, taking first style image features corresponding to the N candidate first image samples as initial centroids of a plurality of clusters, and removing the N candidate first image samples from the candidate first image sample set, wherein N is integral multiple of the number of scene category labels of a scene recognition model; initializing the iteration number of clustering processing to be M, and establishing a null set corresponding to each cluster, wherein M is an integer greater than or equal to 2; in each iteration process of the clustering process, updating each set of clusters, executing centroid generation process based on the updating process result to obtain a new centroid of each cluster, adding the candidate first image sample corresponding to the initial centroid to the candidate first image sample set again when the new centroid is different from the initial centroid, and updating the initial centroid based on the new centroid; determining a set of each cluster obtained after M times of iteration as a clustering result, or determining a set of each cluster obtained after M times of iteration as a clustering result; the centroids of the clusters obtained after iteration for M times are the same as the centroids of the clusters obtained after iteration for M-1 times, wherein M is an integer variable and the value of M is more than or equal to 2 and less than or equal to M.

In some embodiments, the updating process is performed on the set of each cluster, and the centroid generation process is performed based on the result of the updating process to obtain a new centroid of each cluster, which may be implemented by the following technical solutions: for each candidate first image sample of the set of candidate first image samples: determining a similarity between a first style image feature of the candidate first image sample and an initial centroid of each cluster; determining the initial centroid corresponding to the maximum similarity as belonging to the same cluster as the candidate first image sample, and transferring the candidate first image sample to a set of clusters corresponding to the maximum similarity initial centroid, wherein the maximum similarity initial centroid is the initial centroid corresponding to the maximum similarity; and averaging the first style image characteristics of each candidate first image sample in each cluster set to obtain a new centroid of each cluster.

As an example, M is a constant obtained by initialization, M is an integer variable, assuming that there are 10 candidate first image samples and the number of scene class labels of the scene recognition model is 2, the objective of the clustering process is to divide the 10 candidate first image samples into two clusters, each cluster has a corresponding set, each set includes the candidate first image samples of the corresponding cluster, first randomly select the first style features of the 2 candidate first image samples as the initial centroids of the two clusters, for the remaining 8 candidate first image samples, calculate the similarity between each candidate first image sample and the 2 initial centroids, for example, take the L2 distance to evaluate the similarity, for example, for the candidate first image sample a, whose first style image feature is closer to the initial centroid a, assign the candidate first image sample a to the set of clusters corresponding to the initial centroid a, after the allocation operation is performed on the 8 candidate first image samples, a corresponding new centroid is recalculated for each cluster, if the new centroids of the two clusters are the same or the similarity between the new centroid and the initial centroid is smaller than a similarity threshold, each set can be directly determined as a clustering processing result, if the new centroids of the two clusters are different and the similarity between the new centroid and the initial centroid is not smaller than the similarity threshold, the initial centroid is continuously updated by using the new centroids, the allocation operation is performed on the candidate first image samples except the candidate first image samples corresponding to the centroids again until the new centroids of the two clusters are the same or the similarity between the new centroid and the initial centroid is smaller than the similarity threshold, or specified times are iterated.

In step 1014, a plurality of candidate first image samples for representing the plurality of clusters in a one-to-one correspondence are obtained from the candidate first image sample set, and the plurality of candidate first image samples for representing the plurality of clusters in a one-to-one correspondence are used as a plurality of first image samples without scene class labels of the first domain.

In some embodiments, the obtaining, in step 1014, a plurality of candidate first image samples for characterizing the plurality of clusters in a one-to-one correspondence from the candidate first image sample set may be implemented by: performing the following for each cluster of the plurality of clusters: averaging the first style image characteristics of each candidate first image sample in each cluster to obtain the centroid of each cluster; determining feature distances between first style image features of the plurality of candidate first image samples and the centroids of the clusters; and determining the candidate first image sample corresponding to the minimum characteristic distance as the candidate first image sample for characterizing the cluster.

As an example, in order to determine first image samples capable of representing a plurality of clusters in a one-to-one correspondence manner from a plurality of candidate first image samples, it is required to perform separate processing for each cluster, for example, for cluster a, average the first style image features of 5 candidate first image samples in cluster a to obtain the centroid of cluster a, and determine the feature distance between the first style image features of 10 candidate first image samples and the centroid of cluster a; and determining the candidate first image sample corresponding to the minimum characteristic distance as the candidate first image sample for representing the cluster A, wherein if the number of the scene category labels is 10, 10 candidate first image samples exist, and 10 clusters are represented respectively.

In some embodiments, in the clustering processing in step 1013, when the number of the scene category labels is 10, the number of the clusters may be determined to be 100, so as to determine that 100 candidate first image samples respectively represent 100 clusters, thereby total 110 clusters, where the partition granularities of 100 clusters are at the same level, and the partition granularities of the other 10 clusters are at the same level, and through multiple times of fine-grained clustering processing, fine-grained partition may be performed on the candidate first image samples, thereby obtaining richer clusters and richer first image samples, which is beneficial to improving the training effect of the subsequent scene recognition model.

In step 102, a second image sample with scene type labels in the second domain is subjected to style transformation processing through the first image sample, so as to obtain a third image sample with scene type labels in the first domain.

As an example, the third image sample has the same scene class label as the second image sample.

In some embodiments, referring to fig. 4C, fig. 4C is a flowchart illustrating the step 102 of the scene recognition method based on artificial intelligence according to the embodiment of the present application, in the step 102, a style transformation process is performed on a second image sample with a scene type label in a second domain through a first image sample, so as to obtain a third image sample with a scene type label in the first domain, which can be implemented through

step

1021 and 1023 of fig. 4C.

In step 1021, a feature coding process is performed on the second image sample to obtain a first object feature of the second image sample.

In step 1022, a feature coding process is performed on the first image sample to obtain a first to-be-migrated style feature of the first image sample.

As an example, referring to fig. 6, fig. 6 is a schematic style transition diagram of a scene recognition method based on artificial intelligence provided in an embodiment of the present application, and a content image (second image sample) and a style image (first image sample) are respectively encoded through an encoding network (encoder) to obtain a feature of a content dimension (first object feature of the second image sample) and a feature of a style dimension (first to-be-migrated style feature of the first image sample).

In step 1023, style migration processing is performed on the first object feature of the second image sample to the first style feature to be migrated, so as to obtain a third image sample.

In some embodiments, the style migration processing is performed on the first object feature of the second image sample to the first style feature to be migrated in step 1023 to obtain a third image sample, which may be implemented by the following technical solutions: extracting the mean value and the variance of the first object feature of the second image sample, and extracting the mean value and the variance of the first to-be-migrated style feature of the first image sample; mapping the first object feature based on the mean and variance of the first object feature and the mean and variance of the first to-be-migrated style feature to obtain a first migrated feature; and carrying out decoding restoration processing on the first migrated feature to obtain a third image sample.

As an example, by making the mean variance of the first object feature consistent with the mean variance of the first to-be-migrated style feature, so that the second image sample has the style of the first image sample, so that the generated third image sample and the first image sample both belong to the first domain, calculating the mean and variance of each channel of the first to-be-migrated style feature, and mapping the mean and variance of the first object feature to the mean variance of the first to-be-migrated style feature by formula (1):

（1）；

wherein,

the first migrated feature output by the style migration layer,

is the variance of the first to-be-migrated style feature,

is the mean value of the first to-be-migrated style feature,

is the mean value of the first object feature,

is the variance of the first object feature and x is the first object feature.

After the mapping is completed by the style migration layer, the decoding layer recovers a third image sample based on the first migrated feature output by the style migration layer.

In some embodiments, the style transformation processing is implemented by a style generation network, the style generation network includes a coding network and a style migration network, the style migration network includes a style migration layer and a decoding layer, in step 102, before a second image sample with a scene type label in a second field is subjected to style transformation processing by a first image sample to obtain a third image sample with a scene type label in the first field, feature coding processing is respectively performed on a fifth image sample and a sixth image sample by the coding network to obtain a second object feature of the sixth image sample and a second to-be-migrated style feature of the fifth image sample; extracting the mean value and the variance of the second object characteristic and the mean value and the variance of the second to-be-migrated style characteristic through the style migration layer; mapping the second object features based on the mean and variance of the second object features and the mean and variance of the second to-be-migrated style features to obtain second migrated features; decoding and restoring the second migrated feature through a decoding layer to obtain a seventh image sample; determining a style loss and a content loss based on the seventh image sample, the fifth image sample, and the second migrated feature; and fixing parameters of the coding network and the style migration layer, and updating the parameters of the decoding layer according to the style loss and the content loss.

In some embodiments, the determining the style loss and the content loss based on the seventh image sample, the fifth image sample, and the training migrated feature may be implemented by the following technical solutions: performing feature coding processing on the seventh image sample through a coding network to obtain the image features of the seventh image sample; extracting the mean value and the variance of the image features; determining style loss based on the mean and variance of the image features and the mean and variance of the second style feature to be migrated; based on the image features and the second migrated features, a content loss is determined.

As an example, in the training process of the style generation network, the forward propagation process before determining the style loss and the content loss is similar to that in step 1021-:

（2）；

wherein,

for style loss, L is the number of layers of the coding network, i is the identity of the layer,

in the case of the seventh image sample,

is the image characteristic of the seventh image sample,

as a seventh image sampleThe mean value of the image characteristics is,

is the image characteristic of the fifth image sample,

is the average of the image features of the fifth image sample,

is the variance of the image characteristics of the seventh image sample,

is the variance of the image feature of the fifth image sample.

As an example, for the content loss, a content loss Lc of the mean variance of the image feature of the generated seventh image sample and the second migrated feature output by the style migration layer is determined, see formula (3):

（3）；

wherein,

in order to be a loss of content,

a second migrated feature that is the output of the style migration layer,

in the case of the seventh image sample,

image features of a seventh image sample output for a particular layer of the coding network.

As an example, after the style loss and the content loss are obtained, the style loss Ls and the content loss Lc are combined as final supervision information, and the decoding layer is updated.

In step 103, a scene recognition model is trained based on the first image sample, the second image sample and the third image sample.

In some embodiments, the scene recognition model comprises: a feature extraction network; in step 103, training a scene recognition model based on the first image sample, the second image sample and the third image sample, which can be implemented by the following technical scheme: performing data enhancement processing on the first image sample to obtain an enhanced image sample corresponding to the first image sample; carrying out forward propagation on the third image sample and the second image sample in the scene recognition model to obtain a first forward propagation result; carrying out forward propagation on the first image sample and the enhanced image sample in a feature extraction network of the scene recognition model to obtain a second forward propagation result; and updating the scene recognition model according to the first forward propagation result and the second forward propagation result.

As an example, training is performed through multiple iterations until the model converges, and the scene recognition model composed of tables 1 and 2 is solved by using a stochastic gradient descent method, where the scene recognition model may use different network structures, or use different pre-training model weights as an initialization model of the scene recognition model, and before performing the iterations, all parameters of the scene recognition model composed of tables 1 and 2 are set to be in a state to be learned.

TABLE 2 structural table of pooling layers and full connectivity layers in ResNet-101

In some embodiments, the above-mentioned performing data enhancement processing on the first image sample to obtain an enhanced image sample corresponding to the first image sample may be implemented by the following technical solutions: performing at least one of the following processes for the first image sample and determining the processing result as an enhanced image sample corresponding to the first image sample: performing tone conversion processing on the first image sample; performing cropping processing on the first image sample; performing Gaussian blur processing on the first image sample; a random drawing process is performed on the first image sample.

As an example, the first image samples belong to unsupervised data, data enhancement is performed on the unsupervised data, data enhancement is performed in manners of clipping, tone transformation, adding text watermarking, gaussian blurring, adding random strokes, clipping, and the like, so that an enhanced image sample of each first image sample is obtained to form a sample pair, and since the sample pair needs to be subjected to consistency learning subsequently, in each iteration process, a specified number of sample pairs can be sampled to participate in a training process, and a part of first image samples subjected to consistency learning is obtained from a plurality of first image samples in a sampling manner.

In some embodiments, the scene recognition model further comprises a second classification network corresponding to the second domain, and a first classification network corresponding to the first domain; the forward propagation of the third image sample and the second image sample in the scene recognition model to obtain the first forward propagation result can be implemented by the following technical scheme: performing feature extraction processing on the second image sample through a feature extraction network to obtain a classification feature of the second image sample, and mapping the classification feature of the second image sample into a first prediction probability that the second image sample belongs to a pre-labeled scene category through the second classification network; performing feature extraction processing on the third image sample through a feature extraction network to obtain a classification feature of the third image sample, and mapping the classification feature of the third image sample into a second prediction probability that the third image sample belongs to a pre-labeled scene category through a first classification network; and combining the first prediction probability, the second prediction probability, the classification characteristic of the second image sample and the classification characteristic of the third image sample to obtain a first forward propagation result.

As an example, the scene recognition model is composed of tables 1 and 2, all parameters of the scene recognition model composed of tables 1 and 2 are set to be in a state requiring learning, forward propagation is performed first in each iteration process, a single sample is taken as an example for explanation, for a sample combination formed by 1 first image sample, 1 enhanced image sample, 1 second image sample and 1 third image sample, the sample combination participates in the forward propagation, classification features of the third image sample and classification features of the second image sample are determined through the pooling layers in tables 1 and 2, and then the classification features of the third image sample are mapped to be a second prediction probability through the first fully-connected layer (i.e. the first classification network corresponding to the first domain) in table 2, the classification features of the second image sample are mapped to be a first prediction probability of the second image sample through the second fully-connected layer (i.e. the second classification network corresponding to the second domain) in table 2, the first forward propagation result comprises a first prediction probability, a second prediction probability, a classification feature of the second image sample and a classification feature of the third image sample.

In some embodiments, the forward propagation of the first image sample and the enhanced image sample in the feature extraction network of the scene recognition model to obtain the second forward propagation result may be implemented by the following technical solutions: performing feature extraction processing on the first image sample through a feature extraction network to obtain the classification feature of the first image sample; carrying out feature extraction processing on the enhanced image sample through a feature extraction network to obtain the classification features of the enhanced image sample; and combining the classification characteristic of the first image sample and the classification characteristic of the enhanced image sample to obtain a second forward propagation result.

As an example, the classification features of the first image sample and the classification features of the enhanced image sample are determined by the pooling layers in table 1 and table 2, and the classification features of the first image sample and the classification features of the enhanced image sample are used as the basis for the subsequent consistency loss.

In some embodiments, the updating the scene recognition model according to the first forward propagation result and the second forward propagation result may be implemented by the following technical solutions: determining a first classification loss according to the first prediction probability and the pre-marked scene category; determining a second classification loss according to the second prediction probability and the pre-marked scene class; determining a first consistency loss according to divergence between the classification features of the second image sample and the classification features of the third image sample; determining a second consistency loss according to divergence between the classification features of the first image sample and the classification features of the enhanced image sample; performing fusion processing on the first classification loss, the second classification loss, the first consistency loss and the second consistency loss to obtain fusion loss; and determining the fitting parameters of the scene recognition model when the fusion loss is minimum, so as to update the scene recognition model based on the fitting parameters.

As an example, the composition of the fusion loss includes two dimensions: and (2) calculating a supervised loss and an unsupervised loss by using a third image sample and the second image sample, calculating an unsupervised loss by using the first image sample, and given a scene labeling category y of the second image sample, calculating a second classification loss aiming at the third image sample and a first classification loss aiming at the second image sample by using a classification cross entropy method, wherein the calculation mode of the classification cross is shown in formula (4):

（4）；

wherein,

either a first classification loss or a second classification loss,

either the first prediction probability or the second prediction probability,

and labeling the scene with categories.

In some embodiments, the training further needs to make the classification features with the same content the same, i.e. regardless of the genre, the classification features extracted from the same scene need to be as consistent as possible, so the consistency loss between the second image sample and the third image sample is determined by formula (5):

（5）；

wherein,

for a first loss of coherence between the second image sample and the third image sample,

is a classification feature of the second image sample,

the classification feature of the third image sample is N, the number of the second image samples, i.e. the number of the third image samples, and i is the identifier of the second image sample, i.e. the identifier of the third image sample.

In some embodiments, the first image sample and the corresponding enhanced image sample are obtained through a scene recognition network, and after the classification features of the first image sample and the classification features of the enhanced image sample output by a pooling layer of the scene recognition network, a second consistency loss that makes the distributions of the two classification features similar is determined, see formula (6):

（6）；

wherein,

for the purpose of the second loss of consistency,

is a classification feature of the first image sample,

to enhance the classification features of the image samples, i represents one of all the first image samples, and to effectively learn the classification features of the true ACG style domain, the scene recognition model directly captures the effective feature expression learned for the first image samples through the second consistency loss.

In some embodiments, training the scene recognition model based on the first image sample, the second image sample, and the third image sample in step 103 may be implemented by training in stages, with parameter updates based on different types of losses in different stages.

As an example, performing data enhancement processing on a first image sample to obtain an enhanced image sample corresponding to the first image sample; carrying out forward propagation on the third image sample and the second image sample in the scene recognition model to obtain a first forward propagation result; updating a scene recognition model according to the first forward propagation result; carrying out forward propagation on the first image sample and the enhanced image sample in the updated feature extraction network of the scene recognition model to obtain a second forward propagation result; and updating the scene recognition model according to the second forward propagation result.

In step 104, a scene recognition process is performed on the fourth image sample in the first field through the trained scene recognition model, so as to obtain a scene type of the fourth image sample.

For example, the trained scene recognition model has the capability of recognizing the scene of the fourth image sample in the first domain, and the scene category labels that can be recognized are the same as the scene category labels corresponding to the second image sample, for example, if there are the second image samples corresponding to 10 scene category labels participating in step 102-103, the trained scene recognition model can recognize the 10 scene category labels.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

In some embodiments, the scene recognition method based on artificial intelligence provided by the embodiment of the present application is applied to a video recommendation application scene to provide a scene recognition service for video aggregation, a terminal uploads a video of a first field to be published to a server, the server performs scene classification processing on a key video frame of the video to obtain a scene classification result of the key video frame, the scene classification result of the key video frame is used as an attribute tag of the video to be published, and the video is published and recommended to a user whose user portrait matches the attribute tag, and the following description is given by taking the first field as an Animation (ACG) style field and the second field as a real style field as examples, where the first field is a target field and the second field is an original field.

In some embodiments, referring to fig. 5, fig. 5 is an architectural schematic diagram of a scene recognition method based on artificial intelligence provided in an embodiment of the present application, because no labeled data exists in a scene recognition task for a first domain, a process implemented in fig. 5 is unsupervised migration learning, in which second domain data (original domain data), first domain data (target domain data), and production domain data are utilized in the learning process, and the input of fig. 5 includes: the method comprises the steps of generating real style field annotation data, ACG (adaptive feature graph) label-free data and generated and obtained ACG data with labels, wherein the real style field annotation data is an image (a second image sample) of a seaside character scene in a real style field, the corresponding scene type labels are seaside categories, the ACG label-free data are images (a first image sample) of various ACG style fields without scene type labels, the generated and obtained ACG data with labels are third image samples generated based on the first image sample and the second image sample, the third image samples theoretically belong to the ACG style field and have the same content as the second image sample in theory, therefore, the third image samples have the same scene label category as the second image sample, the scene type labels of the third image samples are seaside categories, and the first image sample, the second image sample, the third image sample and the ACG data with the labels are extracted through a depth feature network, The depth features of the second image sample, the third image sample and the enhanced image sample are obtained based on the first image sample, based on the obtained depth features, the classification features (namely, embedding features corresponding to each image) of the first image sample, the second image sample, the third image sample and the enhanced image sample are respectively extracted through a pooling network, finally, the multi-class scene recognition in the real style field is carried out through a second classification network (class 2), and the multi-class scene recognition in the ACG style field is carried out through a first classification network (class 1).

In some embodiments, the style generation is realized by GAN, so as to perform unsupervised transfer learning, the scene recognition task in the ACG style domain (first domain) is performed based on the data in the supervised labeled real style domain (second image sample) and the data in the unlabelled ACG style domain (first image sample), the scene recognition capability in the ACG style domain is obtained through deep learning, if 100 scenes can be recognized in the real style domain, the same 100 scenes in the ACG style domain can be recognized after the scene recognition model is learned, firstly, the simulation data of the real style domain to the ACG style domain is generated by GAN based on the data in the two style domains, so as to obtain the generated data (third image sample), and the generated data (third image sample) and the original domain data (second image sample) are jointly learned by the scene recognition model, the method comprises the steps of establishing classification feature learning from data in a real style field to data in an excessive style field, wherein the excessive style field is generated by simulating the ACG style field and theoretically belongs to the ACG style field, the excessive style field can be understood as the excessive style field due to the fact that the excessive style field is generated through GAN, and a scene recognition model simultaneously learns feature expression of the data in the ACG style field, so that the feature of a third image sample in the excessive style field has the capability of being transferred to the ACG style field.

In some embodiments, data of a first domain with scene class labels is generated based on supervised samples and unsupervised samples, and the following description takes an example that a first image sample and a second image sample participate in training, referring to fig. 6, a content image (the second image sample) and a style image (the first image sample) are respectively encoded by an encoding network encoder to obtain a feature of a content dimension and a feature of a style dimension, and the content image has a style of the style image by making a mean variance of a feature distribution of the content dimension consistent with a mean variance of a feature distribution of the style dimension (as supervision information), which includes the following specific processes: calculating the mean and variance of each channel of the feature of the style dimension, and mapping the mean and variance of the feature distribution of the content dimension to the mean variance of the feature distribution of the style dimension by formula (7):

（7）；

wherein,

for the migrated features output by the style migration layer,

is the variance of the feature distribution in the style dimension,

is the mean of the feature distribution for the style dimension,

is the mean of the feature distribution of the content dimension,

is the variance of the feature distribution of the content dimension, and x is the feature distribution of the content dimension.

After the mapping is completed by the style migration layer, the third image sample is restored by the decoding layer, and a style loss Ls of the mean variance of the image features of the generated third image sample and the mean variance of the image features of the first image sample is determined, see formula (8):

（8）；

wherein,

for the third image sample to be processed,

is the image characteristic of the third image sample,

is the average of the image features of the third image sample,

is an image feature of the first image sample,

is the mean of the image features of the first image sample,

is the variance of the image characteristic of the third image sample,

is the variance of the image characteristic of the first image sample.

After the mapping is completed, the content loss Lc of the mean variance of the image features of the generated third image sample and the migrated features output by the style migration layer is determined, see equation (9):

（9）；

wherein,

in order to be a loss of content,

for the migrated features output by the style migration layer,

for the third image sample to be processed,

image features of a third image sample output for a particular layer of the coding network.

After the style loss and the content loss are acquired, the style loss Ls and the content loss Lc are combined as final supervision information, and the coding network, the style transition layer and the decoding layer are updated.

In some embodiments, a transition image (a third image sample, that is, an image of scene content in the true style domain in the ACG style domain) having content in the second domain and a style in the first domain is generated by means of the style generation network, a learning bridge from the scene recognition model in the true style domain to the scene recognition model in the ACG style domain is established, the image content in the true style domain is kept unchanged, and an image in the ACG style domain having the same content is generated, so that the scene recognition model learns classification features related to scene content recognition without being interfered by the style, wherein it is important to select reasonable style images (first image sample) because various styles may actually appear in the ACG style domain, and deep learning is based on feature distribution of big data, if the image cannot make the content in the true style domain migrate to the rich ACG style domain, the learning effect of the scene classification features in the ACG style domain is greatly affected.

In some embodiments, in order to ensure that the style image used for generating the third image sample can cover as many styles of the ACG style domain as possible, the style image is solved by extracting prototype samples of the overall candidate first image samples (prototype selection), see fig. 7, fig. 7 is an image selection schematic diagram of the scene recognition method based on artificial intelligence provided in the embodiment of the present application, first extracting corresponding style image features (embedding features) from all unsupervised candidate first image samples (ACG unlabeled data, hereinafter referred to as first image samples), giving a Resnet101 model trained based on an open-source large-scale general image classification Imagenet data set, inputting the candidate first image samples into the Resnet101 model, performing forward propagation to obtain style image features, the style image features being output of the pooling layer in table 2, saving the style image features of all candidate first image samples, clustering style image characteristics of all candidate first image samples, specifically clustering by adopting a k-means clustering method, wherein the number of centroids of the clusters is the number of scene categories required to be identified, for example, the number N of the scene categories is 100, determining the centroid of each cluster, and similarly clustering by adopting the k-means clustering method, wherein the number of the centroids of the clusters is 10 times (for example, 1000) of the scene categories to be identified, determining the centroid of each cluster, and finally obtaining N x 11 clusters and corresponding centroids, thereby showing that, besides the scene categories required to be identified, clustering with finer granularity is also carried out, namely 10 times of the scene categories, because each scene category can have different expression forms of multiple subdivision styles, further fine-grained division can make the first image samples richer, the richer first image samples enable the scene recognition model to be generalized to more comprehensive feature learning in the ACG style field, the center sample corresponding to each cluster is used as the first image sample, N × 11 style image features closest to L2 of the N × 11 centroids are determined from the style image features of all candidate first image samples, each centroid corresponds to one candidate first image sample closest to L2, the samples corresponding to the embedding features can represent the centroids of the clusters, and the candidate first image samples corresponding to the N × 11 style image features are used as the first image samples.

In some embodiments, the clustering method may be adjusted according to the unsupervised data distribution condition, and if it is necessary to consider the head style images occupying most of the data volume, a candidate first image sample representing the head style images may be used as the first image sample, as the case may be.

In some embodiments, referring to fig. 8, fig. 8 is a schematic image generation diagram of an artificial intelligence-based scene recognition method provided by an embodiment of the present application, two centroids and corresponding first image samples are generated according to the clustering process, the style of the image is represented by a thick border in fig. 8, the content of the image is represented by an internal style, the first image sample of the sample prototype corresponding to the centroid of the cluster is extracted, the first image sample and the second image sample with the scene label are input into a style generation network, the second image sample is an image in the real style field, a third image sample (simulation image) with the same content as the image in the real style field and with the style belonging to the ACG style field is generated, the shadow internal style in the third image sample represents the same content as the second image sample, and the thick border in the third image sample represents the same style field as the first image sample.

In some embodiments, when the scene recognition model is migrated in an unsupervised manner, the scene recognition model (table 1+ table 2) is trained in a manner of training the Resnet101 model, and the recognition task is N types of scene recognition, where N is an integer greater than or equal to 2.

In some embodiments, unsupervised data is subjected to data enhancement, the unsupervised data is scene label-free data (first image samples) in the ACG style domain, and data enhancement is performed by means of cropping, tone transformation, text watermarking, gaussian blurring, random stroke adding, cropping and the like to obtain enhanced image samples of each first image sample so as to form sample pairs, since the unsupervised data needs to be subjected to consistent learning subsequently, in each training round, a specified number of sample pairs are sampled to be input into a scene recognition model, so that unsupervised loss is determined subsequently, Conv1-Conv5 in table 1 adopts parameters of ResNet101 pre-trained on an ImageNet data set, and a newly added layer (for example, FC _ cr) is initialized by using a gaussian distribution with a variance of 0.01 and a mean value of 0.

In some embodiments, referring to fig. 9, fig. 9 is a schematic structural diagram of a residual network of the artificial intelligence based scene recognition method provided in the embodiments of the present application, where the residual network is composed of three convolution layers, a fusion operator, and an activation function, an output of an n-1 th residual network and an input of the n-1 th residual network are subjected to fusion processing, n is an integer greater than or equal to 2, the input of the residual network is 256 dimensions, and after convolution processing of convolution kernels of three different sizes, a convolution processing result and an input are added to be used as an input of a next residual network, where relu represents the activation function, and represents that activation processing is performed, for example, addition processing is performed through the addition operator, so as to obtain a fusion processing result; the activation processing is performed on the fusion processing result, the activation processing is completed through an activation function, the convolution layer is used for performing multi-size convolution processing on the activation processing result, for example, convolution processing of three layers is performed, the training becomes increasingly difficult along with the increase of the network depth, mainly because the phenomenon of gradient dispersion or gradient explosion is very easily caused by multi-layer back propagation of error signals in the network training process based on random gradient descent, the problem of training difficulty caused by the network depth is solved by the residual error network shown in fig. 9, and the network performance (the accuracy and precision of completing tasks) is high.

In some embodiments, training is performed in a multi-round iteration mode until the model converges, a scene recognition model composed of tables 1 and 2 is solved by using a stochastic gradient descent method, the scene recognition model may use different network structures, or different pre-training model weights are used as an initialization model of the scene recognition model, all parameters of the scene recognition model composed of tables 1 and 2 are set to be in a state needing learning, in each iteration process, taking a single sample as an example, for a sample combination formed by 1 first image sample, 1 enhanced image sample, 1 second image sample, and 1 third image sample, the sample combination participates in forward propagation and reverse updating processes, and during the forward propagation, a second prediction probability of the third image sample and a first prediction probability of the second image sample are calculated in a forward direction, so as to obtain a classification feature of the third image sample, the classification characteristics of the second image sample, the classification characteristics of the first image sample and the classification characteristics of the enhanced image sample are calculated based on multiple types of losses to obtain fusion losses of a sample combination, the fusion losses are reversely propagated to a scene recognition model, namely, the calculation gradient updates weight parameters of the scene recognition model by a random gradient descent method, so that one-time weight parameter optimization is realized, the trained scene recognition model is obtained after the processes are iterated for multiple times, and the first full-connection layer Fc1 and the second full-connection layer Fc2 in the table 2 adopt different network parameters.

In some embodiments, the composition of the fusion loss includes two dimensions: and (4) calculating the supervision loss from the simulated image (the generated third image sample) and the real image (the second image sample), and calculating the unsupervised loss from the unsupervised image sample (the first image sample) to the feature. Given a scene labeling category y of the second image sample, since the third image sample has the content of the second image sample, the third image sample also has the scene labeling category y, a second classification loss for the third image sample and a first classification loss for the second image sample are calculated by using a classification cross entropy method, and the calculation manner of the classification cross is shown in formula (10):

（10）；

wherein,

either a first classification loss or a second classification loss,

either the first prediction probability or the second prediction probability,

and labeling the scene with categories.

In some embodiments, the training further needs to make the classification features with the same content the same, i.e. regardless of the genre, the classification features extracted from the same scene need to be as consistent as possible, so the consistency loss between the second image sample and the third image sample is determined by formula (11):

（11）；

wherein,

is a classification feature of the second image sample,

In some embodiments, the unsupervised loss is obtained by a sample pair (the first image sample and the corresponding enhanced image sample) respectively passing through the scene recognition network, and after the classification features of the first image sample and the classification features of the enhanced image sample output by the pooling layer of the scene recognition network, a second consistency loss that makes the distribution of the two classification features similar is determined, see formula (12):

（12）；

wherein,

for the purpose of the second loss of consistency,

is a classification feature of the first image sample,

In some embodiments, the scene recognition model may be subjected to noise training, and the scene recognition model obtained by the noise training may be loaded on a cloud server, to provide a scene recognition service, referring to fig. 10, fig. 10 is a flowchart of a process of a scene recognition method based on artificial intelligence according to an embodiment of the present application, where a terminal a receives an image of a first domain input by a user, the image of the first domain may be obtained by shooting, then uploading the image to a server, the server uses the scene recognition model provided by the embodiment of the application to classify the scene of the image input by the user, and outputs the scene classification result to a terminal B and/or a terminal A for corresponding display, wherein the terminal B is a terminal different from the terminal A, for example, the terminal a uploads an image to a server for distribution, and the terminal B receives the image delivered by the server and a scene classification result of the corresponding image.

By the scene recognition method based on artificial intelligence provided by the embodiment of the application, by means of a large number of images in the non-labeled ACG style field and images in the labeled real style field, the images in the ACG style field are generated through the GAN, so that the sample capacity of the scene recognition model is expanded from richer field dimensions, the recognition capability of the scene recognition model on different style images in the ACG style field is improved, the images related to scenes in the ACG style field are generated through the GAN, the scene recognition model has the sensing and distinguishing capabilities on the ACG style field, meanwhile, the third image sample and the second image sample are learned, a transfer bridge of common characteristics between the real style field and the transitional ACG style field is established, and the scene recognition model can directly act on scene recognition in the ACG style field through the characteristic learning of the images in the ACG style field, finally, scene categories of the real style field can be identified in the image of the ACG style field.

According to the scene recognition method based on artificial intelligence, based on a data set which does not need additional manpower to label input, a scene recognition model of an ACG (adaptive control group) style field with generalization capability can be trained quickly, a lattice-simulated transition image (a third image sample) is generated by means of any ACG style to help learning, so that common features can be extracted more easily between two style domains, migration efficiency is improved, the learning gap between a real style field and the ACG style field is reduced, data of the real style field with labels and a large amount of data of the unmarked ACG style field are effectively utilized, and scene recognition capability mining in the ACG style field is carried out.

Continuing with the exemplary structure of the artificial intelligence based scene recognition device 255 provided by the embodiments of the present application as software modules, in some embodiments, as shown in fig. 3, the software modules stored in the artificial intelligence based scene recognition device 255 of the memory 250 may include: an obtaining module 2551, configured to obtain a first image sample without scene category labeling in a first domain; the style module 2552 is configured to perform style transformation processing on the second image sample with the scene category label in the second field through the first image sample to obtain a third image sample with the scene category label in the first field; wherein the third image sample and the second image sample have the same scene type label; a training module 2553, configured to train a scene recognition model based on the first image sample, the second image sample, and the third image sample; the identifying module 2554 is configured to perform scene identification processing on the fourth image sample in the first field through the trained scene identification model, so as to obtain a scene category of the fourth image sample.

In some embodiments, the obtaining module 2551 is further configured to: acquiring a candidate first image sample set without scene category labels in a first field; extracting a first style image characteristic of each candidate first image sample in the candidate first image sample set; performing clustering processing according to the first style image characteristics of each candidate first image sample to obtain a plurality of clusters corresponding to the candidate first image sample set; and obtaining a plurality of candidate first image samples for representing a plurality of clusters in a one-to-one correspondence manner from the candidate first image sample set, and taking the plurality of candidate first image samples for representing the plurality of clusters in the one-to-one correspondence manner as a plurality of first image samples without scene category labeling in the first field.

In some embodiments, the obtaining module 2551 is further configured to: randomly selecting N candidate first image samples from a candidate first image sample set, taking first style image features corresponding to the N candidate first image samples as initial centroids of a plurality of clusters, and removing the N candidate first image samples from the candidate first image sample set, wherein N is integral multiple of the number of scene category labels of a scene recognition model; initializing the iteration number of clustering processing to be M, and establishing a null set corresponding to each cluster, wherein M is an integer greater than or equal to 2; in each iteration process of the clustering process, updating each set of clusters, executing centroid generation process based on the updating process result to obtain a new centroid of each cluster, adding the candidate first image sample corresponding to the initial centroid to the candidate first image sample set again when the new centroid is different from the initial centroid, and updating the initial centroid based on the new centroid; determining a set of each cluster obtained after M times of iteration as a clustering result, or determining a set of each cluster obtained after M times of iteration as a clustering result; the centroids of the clusters obtained after iteration for M times are the same as the centroids of the clusters obtained after iteration for M-1 times, wherein M is an integer variable and the value of M is more than or equal to 2 and less than or equal to M.

In some embodiments, the obtaining module 2551 is further configured to: for each candidate first image sample of the set of candidate first image samples: determining a similarity between a first style image feature of the candidate first image sample and an initial centroid of each cluster; determining the initial centroid corresponding to the maximum similarity as belonging to the same cluster as the candidate first image sample, and transferring the candidate first image sample to a set of clusters corresponding to the maximum similarity initial centroid, wherein the maximum similarity initial centroid is the initial centroid corresponding to the maximum similarity; and averaging the first style image characteristics of each candidate first image sample in each cluster set to obtain a new centroid of each cluster.

In some embodiments, the obtaining module 2551 is further configured to: performing the following for each cluster of the plurality of clusters: averaging the first style image characteristics of each candidate first image sample in each cluster to obtain the centroid of each cluster; determining feature distances between first style image features of the plurality of candidate first image samples and the centroids of the clusters; and determining the candidate first image sample corresponding to the minimum characteristic distance as the candidate first image sample for characterizing the cluster.

In some embodiments, style module 2552 is further configured to: performing feature coding processing on the second image sample to obtain a first object feature of the second image sample; carrying out feature coding processing on the first image sample to obtain a first to-be-migrated style feature of the first image sample; and performing style migration processing on the first object feature of the second image sample to the first style feature to be migrated to obtain a third image sample.

In some embodiments, style module 2552 is further configured to: extracting the mean value and the variance of the first object feature of the second image sample, and extracting the mean value and the variance of the first to-be-migrated style feature of the first image sample; mapping the first object feature based on the mean and variance of the first object feature and the mean and variance of the first to-be-migrated style feature to obtain a first migrated feature; and carrying out decoding restoration processing on the first migrated feature to obtain a third image sample.

In some embodiments, the style transformation process is implemented by a style generation network comprising an encoding network and a style migration network comprising a style migration layer and a decoding layer; before performing style transformation processing on the second image sample with scene category labeling in the second field through the first image sample to obtain a third image sample with scene category labeling in the first field, the training module 2553 is further configured to: respectively carrying out feature coding processing on the fifth image sample and the sixth image sample through a coding network to obtain a second object feature of the sixth image sample and a second to-be-migrated style feature of the fifth image sample; extracting the mean value and the variance of the second object characteristic and the mean value and the variance of the second to-be-migrated style characteristic through the style migration layer; mapping the second object features based on the mean and variance of the second object features and the mean and variance of the second to-be-migrated style features to obtain second migrated features; decoding and restoring the second migrated feature through a decoding layer to obtain a seventh image sample; determining a style loss and a content loss based on the seventh image sample, the fifth image sample, and the second migrated feature; and fixing parameters of the coding network and the style migration layer, and updating the parameters of the decoding layer according to the style loss and the content loss.

In some embodiments, training module 2553 is further configured to: performing feature coding processing on the seventh image sample through a coding network to obtain the image features of the seventh image sample; extracting the mean value and the variance of the image features; determining style loss based on the mean and variance of the image features and the mean and variance of the second style feature to be migrated; based on the image features and the second migrated features, a content loss is determined.

In some embodiments, training module 2553 is further configured to: performing data enhancement processing on the first image sample to obtain an enhanced image sample corresponding to the first image sample; carrying out forward propagation on the third image sample and the second image sample in the scene recognition model to obtain a first forward propagation result; carrying out forward propagation on the first image sample and the enhanced image sample in a feature extraction network of the scene recognition model to obtain a second forward propagation result; and updating the scene recognition model according to the first forward propagation result and the second forward propagation result.

In some embodiments, training module 2553 is further configured to: performing at least one of the following processes for the first image sample and determining the processing result as an enhanced image sample corresponding to the first image sample: performing tone conversion processing on the first image sample; performing cropping processing on the first image sample; performing Gaussian blur processing on the first image sample; a random drawing process is performed on the first image sample.

In some embodiments, the scene recognition model further comprises a second classification network corresponding to the second domain, and a first classification network corresponding to the first domain; a training module 2553, further configured to: performing feature extraction processing on the second image sample through a feature extraction network to obtain a classification feature of the second image sample, and mapping the classification feature of the second image sample into a first prediction probability that the second image sample belongs to a pre-labeled scene category through the second classification network; performing feature extraction processing on the third image sample through a feature extraction network to obtain a classification feature of the third image sample, and mapping the classification feature of the third image sample into a second prediction probability that the third image sample belongs to a pre-labeled scene category through a first classification network; and combining the first prediction probability, the second prediction probability, the classification characteristic of the second image sample and the classification characteristic of the third image sample to obtain a first forward propagation result.

In some embodiments, training module 2553 is further configured to: performing feature extraction processing on the first image sample through a feature extraction network to obtain the classification feature of the first image sample; carrying out feature extraction processing on the enhanced image sample through a feature extraction network to obtain the classification features of the enhanced image sample; and combining the classification characteristic of the first image sample and the classification characteristic of the enhanced image sample to obtain a second forward propagation result.

In some embodiments, training module 2553 is further configured to: determining a first classification loss according to the first prediction probability and the pre-marked scene category; determining a second classification loss according to the second prediction probability and the pre-marked scene class; determining a first consistency loss according to divergence between the classification features of the second image sample and the classification features of the third image sample; determining a second consistency loss according to divergence between the classification features of the first image sample and the classification features of the enhanced image sample; performing fusion processing on the first classification loss, the second classification loss, the first consistency loss and the second consistency loss to obtain fusion loss; and determining the fitting parameters of the scene recognition model when the fusion loss is minimum, so as to update the scene recognition model based on the fitting parameters.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the artificial intelligence based scene recognition method according to the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, perform an artificial intelligence based scene recognition method provided by embodiments of the present application, for example, the artificial intelligence based scene recognition method shown in fig. 4A-4C.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or distributed across multiple sites and interconnected by a communication network.

In summary, according to the embodiment of the present application, an image (a third image sample) of a first field with a label is generated by using an image (a first image sample) of a first field without a label and an image (a second image sample) of a second field with a label, so that a training sample capacity of a scene recognition model for executing a scene recognition task of the first field is expanded, and an identification capability of the scene recognition model in the first field is effectively improved.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A scene recognition method based on artificial intelligence is characterized by comprising the following steps:

acquiring a first image sample without scene category marking in a first field;

performing data enhancement processing on the first image sample to obtain an enhanced image sample corresponding to the first image sample;

performing feature extraction processing on the second image sample through a feature extraction network to obtain a classification feature of the second image sample, and mapping the classification feature of the second image sample to a first prediction probability that the second image sample belongs to a pre-labeled scene category through a second classification network;

performing feature extraction processing on the third image sample through the feature extraction network to obtain a classification feature of the third image sample, and mapping the classification feature of the third image sample to a second prediction probability that the third image sample belongs to the pre-labeled scene category through a first classification network;

the scene recognition model comprises the feature extraction network, a second classification network corresponding to the second field and a first classification network corresponding to the first field;

combining the first prediction probability, the second prediction probability, the classification feature of the second image sample and the classification feature of the third image sample to obtain a first forward propagation result;

carrying out forward propagation on the first image sample and the enhanced image sample in a feature extraction network of the scene recognition model to obtain a second forward propagation result;

updating the scene recognition model according to the first forward propagation result and the second forward propagation result;

2. The method of claim 1, wherein obtaining the scene category label-free first image sample of the first domain comprises:

acquiring a candidate first image sample set without scene category labels in a first field;

extracting a first style image feature of each candidate first image sample in the candidate first image sample set;

performing clustering processing according to the first style image characteristics of each candidate first image sample to obtain a plurality of clusters corresponding to the candidate first image sample set;

and acquiring a plurality of candidate first image samples for representing the plurality of clusters in a one-to-one correspondence manner from the candidate first image sample set, and taking the plurality of candidate first image samples for representing the plurality of clusters in a one-to-one correspondence manner as a plurality of first image samples without scene category labels of the first field.

3. The method according to claim 2, wherein the performing clustering processing according to the first style image feature of each candidate first image sample to obtain a plurality of clusters corresponding to the candidate first image sample set comprises:

randomly selecting N candidate first image samples from the set of candidate first image samples, taking first style image features corresponding to the N candidate first image samples as initial centroids of the plurality of clusters, and removing the N candidate first image samples from the set of candidate first image samples, wherein N is an integral multiple of the number of scene category labels of the scene recognition model;

initializing the iteration number of clustering processing to be M, and establishing a null set corresponding to each cluster, wherein M is an integer greater than or equal to 2;

during each iteration of the clustering process: updating each set of clusters, executing centroid generation processing based on the updating processing result to obtain a new centroid of each cluster, adding the candidate first image sample corresponding to the initial centroid to the candidate first image sample set again when the new centroid is different from the initial centroid, and updating the initial centroid based on the new centroid;

determining each cluster set obtained after M times of iteration as a cluster processing result, or determining each cluster set obtained after M times of iteration as a cluster processing result;

the centroids of the clusters obtained after iteration for M times are the same as the centroids of the clusters obtained after iteration for M-1 times, wherein M is an integer variable and the value of M is more than or equal to 2 and less than or equal to M.

4. The method of claim 3, wherein the updating each set of clusters and performing a centroid generation process based on the updating process result to obtain a new centroid for each cluster comprises:

for each of the candidate first image samples of the set of candidate first image samples: determining a similarity between a first-style image feature of the candidate first image sample and an initial centroid of each of the clusters;

determining an initial centroid corresponding to the maximum similarity as belonging to the same cluster as the candidate first image sample, and transferring the candidate first image sample to a set of clusters corresponding to the maximum similarity initial centroid, wherein the maximum similarity initial centroid is the largest initial centroid corresponding to the similarity;

and averaging the first style image characteristics of each candidate first image sample in each cluster set to obtain a new centroid of each cluster.

5. The method of claim 2, wherein obtaining a plurality of candidate first image samples for characterizing the plurality of clusters in a one-to-one correspondence from the set of candidate first image samples comprises:

performing the following for each of the clusters in the plurality of clusters:

averaging the first style image characteristics of each candidate first image sample in each cluster to obtain the centroid of each cluster;

determining feature distances between first-style image features of the plurality of candidate first image samples and the centroid of the cluster;

and determining the candidate first image sample corresponding to the minimum characteristic distance as the candidate first image sample for characterizing the cluster.

6. The method of claim 1,

the obtaining a third image sample with scene category labels in the first field by performing style transformation processing on a second image sample with scene category labels in a second field through the first image sample comprises:

performing feature coding processing on the second image sample to obtain a first object feature of the second image sample;

performing feature coding processing on the first image sample to obtain a first to-be-migrated style feature of the first image sample;

and performing style migration processing on the first object feature of the second image sample to the first style feature to be migrated to obtain the third image sample.

7. The method of claim 6,

performing style migration processing on the first object feature of the second image sample to the first style feature to be migrated to obtain a third image sample, including:

extracting the mean value and the variance of the first object feature of the second image sample, and extracting the mean value and the variance of the first to-be-migrated style feature of the first image sample;

mapping the first object feature based on the mean and variance of the first object feature and the mean and variance of the first to-be-migrated style feature to obtain a first migrated feature;

and performing decoding restoration processing on the first migrated feature to obtain the third image sample.

8. The method of claim 1, wherein the style transformation process is implemented by a style generation network comprising an encoding network and a style migration network comprising a style migration layer and a decoding layer;

before the style transformation processing is performed on the second image sample with the scene type label in the second field through the first image sample to obtain the third image sample with the scene type label in the first field, the method further includes:

respectively carrying out feature coding processing on a fifth image sample and a sixth image sample through the coding network to obtain a second object feature of the sixth image sample and a second to-be-migrated style feature of the fifth image sample;

extracting the mean value and the variance of the second object feature and the mean value and the variance of the second to-be-migrated style feature through the style migration layer;

mapping the second object feature based on the mean and variance of the second object feature and the mean and variance of the second to-be-migrated style feature to obtain a second migrated feature;

decoding and restoring the second migrated feature through the decoding layer to obtain a seventh image sample;

determining a style loss and a content loss based on the seventh image sample, the fifth image sample, and the second migrated feature;

and fixing the parameters of the coding network and the style migration layer, and updating the parameters of the decoding layer according to the style loss and the content loss.

9. The method of claim 1, wherein the forward propagating the first image sample and the enhanced image sample in a feature extraction network of the scene recognition model to obtain a second forward propagation result comprises:

performing feature extraction processing on the first image sample through the feature extraction network to obtain a classification feature of the first image sample;

performing feature extraction processing on the enhanced image sample through the feature extraction network to obtain the classification features of the enhanced image sample;

and combining the classification features of the first image sample and the classification features of the enhanced image sample to obtain the second forward propagation result.

10. The method of claim 1,

the updating the scene recognition model according to the first forward propagation result and the second forward propagation result includes:

determining a first classification loss according to the first prediction probability and the pre-marked scene category;

determining a second classification loss according to the second prediction probability and the pre-marked scene category;

determining a first consistency loss according to divergence between the classification features of the second image sample and the classification features of the third image sample;

determining a second consistency loss according to divergence between the classification features of the first image sample and the classification features of the enhanced image sample;

performing fusion processing on the first classification loss, the second classification loss, the first consistency loss and the second consistency loss to obtain fusion loss;

determining a fitting parameter of the scene recognition model when the fusion loss takes a minimum value, so as to update the scene recognition model based on the fitting parameter.

11. A scene recognition device based on artificial intelligence is characterized by comprising:

the training module is used for carrying out data enhancement processing on the first image sample to obtain an enhanced image sample corresponding to the first image sample; performing feature extraction processing on the second image sample through a feature extraction network to obtain a classification feature of the second image sample, and mapping the classification feature of the second image sample to a first prediction probability that the second image sample belongs to a pre-labeled scene category through a second classification network; performing feature extraction processing on the third image sample through the feature extraction network to obtain a classification feature of the third image sample, and mapping the classification feature of the third image sample to a second prediction probability that the third image sample belongs to the pre-labeled scene category through a first classification network; the scene recognition model comprises the feature extraction network, a second classification network corresponding to the second field and a first classification network corresponding to the first field; combining the first prediction probability, the second prediction probability, the classification feature of the second image sample and the classification feature of the third image sample to obtain a first forward propagation result; carrying out forward propagation on the first image sample and the enhanced image sample in a feature extraction network of the scene recognition model to obtain a second forward propagation result; updating the scene recognition model according to the first forward propagation result and the second forward propagation result;

12. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the artificial intelligence based scene recognition method of any one of claims 1 to 10 when executing executable instructions stored in the memory.

13. A computer-readable storage medium storing executable instructions for implementing the artificial intelligence based scene recognition method of any one of claims 1 to 10 when executed by a processor.